This chapter of the Data Science for Water Utilities teaches how to load and explore water quality data in CSV files and spreadsheets

Loading and Exploring Water Quality Data from Spreadsheets

Peter Prevos | 10 May 2023
Last Updated | 9 June 2025
605 words | 3 minutes

Share this content

The first action in any analysis project is to obtain and load data. This data is often available in a database, the internet, or a file on your computer. One of the essential skills is thus to load this data into memory and view its content. This chapter introduces how to read CSV files and spreadsheets into R and explore their content using the Tidyverse suite of libraries. This chapter of Data Science for Water Utilities also presents a case study about water quality using synthetic data from an imaginary water supply network. This chapter has the following learning objectives:

Download and install R packages
Load and describe CSV files and spreadsheet
Explore the content of rectangular data (data frames)

Data Science for Water Utilities

Data Science for Water Utilities published by CRC Press is an applied, practical guide that shows water professionals how to use data science to solve urban water management problems using the R language for statistical computing.

Routledge Amazon

The data and code used in this chapter are available on GitHub:

The Tidyverse

The R language has a lot of capabilities out of the box. But its strongest feature is that it can be easily extended. Thousands of packages are available for almost every imaginable specialised task. Packages are available in public repositories, such as CRAN (The Comprehensive R Archive Network).

The Tidyverse is a collection of packages that makes working with R a bit easier. The remainder of this book will rely heavily on the Tidyverse way of analysing data.

A package provides additional functions for specialised tasks. You install packages with the install.packages("package-name") function. To use the functions in a package, you must evaluate library(package-name).

Loading Data into R

Using specialised packages, R can read almost every common data format, such as REST APIs, SQL databases and GIS shape files.

The Tidyverse readr package reads rectangular data from text files, such as the standard CSV format. Of course, R can also do this with base functions. Still, the Tidyverse version is faster and better at recognising data types. The Tidyverse readxl package reads data from Excel files.

Water Quality Case Study

This case study uses synthetic data from an imaginary water system named Gormsey. This data contains four analytes:

Escherichia coli (E. coli): A harmful bacterium.
Total chlorine [mg/l]. Chlorine is dosed to prevent bacteria from multiplying.
THM: Trihalomethanes. A byproduct of disinfection with chlorine that can cause a negative health impact.
Turbidity [NTU]: A measure of cloudiness.

The data is available in the GitHub repository for this book.

Variable Types and Classes

The previous chapter introduced scalar and vector variables. However, most data used in practice is stored in a rectangular format as a table. In the R language, a data frame is a rectangular data set where each column is a variable and each row an observation. The Tidyverse data frame is a Tibble, but these names are used interchangeably.

Scalar, vector and data frame variables in R.

Exploring Water Quality Data Screencast

Chapter three of Data Science for Water Utilities explains loading and exploring CSV and Excel files in more detail. This screencast below reviews the code for this chapter.

Loading and exploring spreadsheet data in R.

The data and code used in this chapter are available on GitHub:

Additional Resources

To help you remember the various functions discussed in the first five chapters of the book, a cheat sheet is available.

Other Chapters

Previous Chapter: Basics of the R Language

Next Chapter: Descriptive Statistics in Water Quality

Feel free to contact me if you have any comments, suggestions or questions about this book.