
Managing and Cleaning Dirty Data

Peter Prevos |
474 words | 3 minutes
Share this content
Data is almost always produced to manage operations and rarely collected to analyse it in the future. The available data is thus never in an ideal format and needs to be converted to be suitable for analysis. We need, at minimum, a clear data structure, the correct variable types, and readable variables names. Preparing data for analysis, sometimes called data munging or wrangling, is an essential part of the data science workflow. This chapter of Data Science for Water Utilities introduces some techniques to clean data with R and the Tidyverse to create reproducible code. This chapter introduces a case study about customer perceptions about tap water. The learning objectives for this session are:
- Use the dplyr package to transform data.
- Apply the principles of tidy data.
- Develop a script to automate data cleaning.
Data Science for Water Utilities
Data Science for Water Utilities published by CRC Press is an applied, practical guide that shows water professionals how to use data science to solve urban water management problems using the R language for statistical computing.
The data and code used in this chapter are available on GitHub:
Case Study 2: Understanding the Customer Experience
The data for this case study is a survey to measure the customer experience of the customers of a water utility. This data is generated with online survey software and contains information about:
- Consumer Involvement
- Customer contact frequency
- Level of financial hardship
- Service Quality
Data Cleaning
Data cleaning, or data jujitsu as some call it, consists of three steps:
- Load and explore the data
- Convert the data structure
- Remove invalid data
A basic principle of good data science is never to manually clean data. Using a script to clean your data means that it is reproducible for other data sets with the same structure and issues and you can always go back to the raw data in case something went wrong.
Tidy Data
The Tidyverse uses the principles of Tidy Data, which means that:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
The data on the left is untidy because the variable is spread over multiple columns. We can fix this by pivoting the data so that the locations are no longer column names, but they become data.

Managing Dirty Data Screencast
Additional Resources
Other Chapters
Previous Chapter: Sharing Results of Data Analysis
Next Chapter: Analysing the Customer Experience.
Feel free to contact me if you have any comments, suggestions or questions about this book.
Share this content