This chapter of the Data Science for Utilities describes managing dirty data and cleaning it with a script using Tidyverse functionality

Managing and Cleaning Dirty Data

Peter Prevos

Peter Prevos |

474 words | 3 minutes

Share this content

Data is almost always produced to manage operations and rarely collected to analyse it in the future. The available data is thus never in an ideal format and needs to be converted to be suitable for analysis. We need, at minimum, a clear data structure, the correct variable types, and readable variables names. Preparing data for analysis, sometimes called data munging or wrangling, is an essential part of the data science workflow. This chapter of Data Science for Water Utilities introduces some techniques to clean data with R and the Tidyverse to create reproducible code. This chapter introduces a case study about customer perceptions about tap water. The learning objectives for this session are:

  • Use the dplyr package to transform data.
  • Apply the principles of tidy data.
  • Develop a script to automate data cleaning.

Data Science for Water Utilities

Data Science for Water Utilities

Data Science for Water Utilities published by CRC Press is an applied, practical guide that shows water professionals how to use data science to solve urban water management problems using the R language for statistical computing.

The data and code used in this chapter are available on GitHub:

Case Study 2: Understanding the Customer Experience

The data for this case study is a survey to measure the customer experience of the customers of a water utility. This data is generated with online survey software and contains information about:

Data Cleaning

Data cleaning, or data jujitsu as some call it, consists of three steps:

  1. Load and explore the data
  2. Convert the data structure
  3. Remove invalid data

A basic principle of good data science is never to manually clean data. Using a script to clean your data means that it is reproducible for other data sets with the same structure and issues and you can always go back to the raw data in case something went wrong.

Tidy Data

The Tidyverse uses the principles of Tidy Data, which means that:

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table

The data on the left is untidy because the variable is spread over multiple columns. We can fix this by pivoting the data so that the locations are no longer column names, but they become data.

Untidy and tidy version of laboratory data
Untidy and tidy version of laboratory data.

Managing Dirty Data Screencast

Chapter seven of Data Science for Water Utilities explains using the dplyr package to clean data in more detail. This screencast below reviews the code for this chapter.

Managing Dirty Data Screencast.

The data and code used in this chapter are available on GitHub:

Additional Resources

Other Chapters

Previous Chapter: Sharing Results of Data Analysis

Next Chapter: Analysing the Customer Experience.

Feel free to contact me if you have any comments, suggestions or questions about this book.

Share this content

You might also enjoy reading these articles

Analysing the Customer Experience

Basic Linear Regression

Basics of the R Language