
Loading and Exploring Water Quality Data from Spreadsheets

Peter Prevos |
605 words | 3 minutes
Share this content
The first action in any analysis project is to obtain and load data. This data is often available in a database, the internet, or a file on your computer. One of the essential skills is thus to load this data into memory and view its content. This chapter introduces how to read CSV files and spreadsheets into R and explore their content using the Tidyverse suite of libraries. This chapter of Data Science for Water Utilities also presents a case study about water quality using synthetic data from an imaginary water supply network. This chapter has the following learning objectives:
- Download and install R packages
- Load and describe CSV files and spreadsheet
- Explore the content of rectangular data (data frames)
Data Science for Water Utilities
Data Science for Water Utilities published by CRC Press is an applied, practical guide that shows water professionals how to use data science to solve urban water management problems using the R language for statistical computing.
The data and code used in this chapter are available on GitHub:
The Tidyverse
The R language has a lot of capabilities out of the box. But its strongest feature is that it can be easily extended. Thousands of packages are available for almost every imaginable specialised task. Packages are available in public repositories, such as CRAN (The Comprehensive R Archive Network).
The Tidyverse is a collection of packages that makes working with R a bit easier. The remainder of this book will rely heavily on the Tidyverse way of analysing data.
A package provides additional functions for specialised tasks. You install packages with the install.packages("package-name")
function. To use the functions in a package, you must evaluate library(package-name)
.
Loading Data into R
Using specialised packages, R can read almost every common data format, such as REST APIs, SQL databases and GIS shape files.
The Tidyverse readr package reads rectangular data from text files, such as the standard CSV format. Of course, R can also do this with base functions. Still, the Tidyverse version is faster and better at recognising data types. The Tidyverse readxl package reads data from Excel files.
Water Quality Case Study
This case study uses synthetic data from an imaginary water system named Gormsey. This data contains four analytes:
- Escherichia coli (E. coli): A harmful bacterium.
- Total chlorine [mg/l]. Chlorine is dosed to prevent bacteria from multiplying.
- THM: Trihalomethanes. A byproduct of disinfection with chlorine that can cause a negative health impact.
- Turbidity [NTU]: A measure of cloudiness.
The data is available in the GitHub repository for this book.
Variable Types and Classes
The previous chapter introduced scalar and vector variables. However, most data used in practice is stored in a rectangular format as a table. In the R language, a data frame is a rectangular data set where each column is a variable and each row an observation. The Tidyverse data frame is a Tibble, but these names are used interchangeably.

Exploring Water Quality Data Screencast
Additional Resources
To help you remember the various functions discussed in the first five chapters of the book, a cheat sheet is available.
Other Chapters
Previous Chapter: Basics of the R Language
Next Chapter: Descriptive Statistics in Water Quality
Feel free to contact me if you have any comments, suggestions or questions about this book.
Share this content