This tap water sentiment analysis looks at a corpus of tweets about tap water to better understand people's attitudes to tap water.

Tap Water Sentiment Analysis using Twitter and Tidytext

Peter Prevos

Peter Prevos |

1350 words | 7 minutes

Share this content

In developed countries, tap water is safe to drink and available cheaply. Even though high-quality drinking water is almost freely available, the consumption of bottled water is increasing every year. Bottled water companies use sophisticated marketing strategies, while water utilities are mostly passive public service providers. Australian marketing expert Russell Howcroft even called water utilities "lazy marketers". Can we use a tap water sentiment analysis to learn how people feel about tap water? What can we learn about the reasons behind this loss of trust in the municipal water supply?

Gruen Transfer (2009).

This tap water sentiment analysis estimates people's attitudes towards tap water by analysing tweets. This article explains how to examine tweets about tap water using the R language and the Tidytext package.

Sentiment Analysis

Sentiment indicates the perception a group of people has about a particular topic. Traditionally, assessing the sentiment of a group of people requires surveys or interviews. These methods are problematic because it creates an artificial environment where the respondent often answers to meet the perceived expectations of the study or customers exaggerate to get their point across. Using sentiment analysis of ego documents written by consumers can overcome these problems. Ego-documents, i.e. forms of personal writing, are a more direct way to find out what consumers think, but they are challenging to obtain and analyse.

With the advent of social media, access to ego documents has become much more straightforward, and many tools exist to collect and interpret this data. Using ego documents brings you closer to the consumer than possible with surveys or focus groups. One medium gaining popularity with market researchers is Twitter.

Tap Water Sentiment Analysis

Each tweet containing "tap water" reveals the author's attitude towards that topic. Each text expresses a sentiment about the subject it describes.

Sentiment analysis is a data science technique that extracts personal information from a text. The basic method compares a string of words with a set of words with calibrated sentiments. These calibrated sets are created by asking many people how they feel about a specific term. For example, "stink" expresses a negative sentiment, while "nice" would be a positive sentiment.

This tap water sentiment analysis consists of three steps. The first step extracts 1000 tweets containing "tap water" from Twitter. The second step cleans the data, and the third step analyses the results.

The Water Data Aggregator is a commercial application of sentiment analysis for tap water. This website sells the regional social credit score to interested parties.

Extracting tweets using the TwitteR package

The TwitteR package by Geoff Gentry makes it very easy to retrieve tweets using search criteria. You must create an API on Twitter to receive the keys and tokens. In the code below, the actual values have been removed.

Follow the instructions in this article to obtain these codes for yourself. This code snippet calls a private file to load the API codes, extracts the tweets and creates a data frame with a tweet id number and its text.

  ## Tapwater tweet sentiment analysis
  library(tidyverse)
  library(tidytext)
  library(twitteR)

  source("case-studies/twitter-api.R") ## Secret keys
  setup_twitter_oauth(consumer_key, consumer_secret, acess_token, access_secret)

  ## Extract tap water tweets
  tapwater_tweets <- searchTwitter("tap water", n = 1000, lang = "en") %>%
    twListToDF() %>%
    select(id, text)
  tapwater_tweets <- subset(tapwater_tweets, !duplicated(tapwater_tweets$text))
  tapwater_tweets$text <- gsub("’", "'", tapwater_tweets$text)

  write_csv(tapwater_tweets, paste0("data/tapwater-tweets-", Sys.time(), ".csv"))

When I first extracted these tweets, a tweet by CNN about tap water in Kentucky that smells like diesel was retweeted many times, so I removed all duplicate tweets from the set. Unfortunately, this left less than 300 original tweets in the corpus.

If you don't have a Twitter API key, then you can use the CSV file in the data folder:

  tapwater_tweets <- read_csv("data/tapwater_tweets.csv")

Sentiment analysis with Tidytext

Text analysis can be a powerful tool to help to analyse large amounts of text. The R language has an extensive collection of packages to help you undertake such a task. The Tidytext package extends the Tidy Data logic promoted by Hadley Wickham and his Tidyverse software collection.

Data Cleaning

The first step in cleaning the data is to create unigrams, which involve splitting the tweets into individual words that can be analysed. The first step is determining which words are most commonly used in the tap water tweets and visualising the result. The most common phrases are related to drinking and bottled water, which makes sense—also, the recent issues in Kentucky feature in this list.

Most common words in tweets about tap water
Most common words in tweets about tap water.
  ## Tokenise and clean the tweets
  tidy_tweets <- tapwater_tweets %>%
    unnest_tokens(word, text)

  data(stop_words)

  tidy_tweets <- tidy_tweets %>%
    anti_join(stop_words) %>%
    filter(!word %in% c("tap", "water", "rt", "https", "t.co", "gt", "amp",
                        as.character(0:9)))
  ## Most common words
  tidy_tweets %>%
    count(word, sort = TRUE) %>%
    top_n(10) %>%
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(word, n)) +
    geom_col(fill = "dodgerblue4") +
    xlab(NULL) + coord_flip() +
    theme_bw(base_size = 20) + 
    ggtitle("Most common words in tap water tweets")

Sentiment Analysis

The Tidytext package contains three lexicons of thousands of single English words (unigrams) that were manually assessed for their sentiment. The principle of sentiment analysis is to compare the words in the text with the words in the lexicon and analyse the results. For example, the statement: "This tap water tastes horrible" has a sentiment score of -3 in the AFFIN system by Finn Årup Nielsen due to the word "horrible". In this analysis, I have used the "Bing" method published by Liu et al. in 2005. This method assigns a word to either positive, negative or neutral sentiment.

This method is not foolproof, as words with the same spelling can mean different things. For example, "This tap water contains too much lead" will be assessed as a positive sentiment because the verb lead is seen as positive. The noun lead has no sentiment, as it depends on context.

The other problem with sentiment analysis is that we only see negative commentary. Very few people contact a water utility about an excellent morning shower or a wonderful glass of water. Tap water resides in the background of everyday life, and people don't have any opinion about it unless it is unavailable or does not meet their aesthetic expectations.

This tap water sentiment analysis shows that two-thirds of the words that express a sentiment were negative. The most common negative comments were "smells" and "scared". This analysis is not a positive result for water utilities. Unfortunately, most tweets were not spatially located, so that I couldn't determine the origin of the sentiment.

Tap water sentiment analysis
Tap water sentiment analysis.
  ## Sentiment analysis
  sentiment_bing <- tidy_tweets %>%
    inner_join(get_sentiments("bing"))

  sentiment_bing %>%
    summarise(Negative = sum(sentiment == "negative"), 
              positive = sum(sentiment == "positive"))

  sentiment_bing %>%
    group_by(sentiment) %>%
    count(word, sort = TRUE) %>%
    top_n(10) %>%
    ggplot(aes(word, n, fill = sentiment)) +
    geom_col(show.legend = FALSE) + 
    coord_flip() +
    facet_wrap(~sentiment, scales = "free_y") + 
    labs(title = "Contribution to sentiment", x = NULL, y = NULL)

Using Tap Water Sentiment Analysis

Sentiment analysis is an exciting exploration technique but not an absolute truth. This method cannot detect sarcasm or irony, and words don't always have the same meaning as described in the dictionary. For example, the algorithm interprets the word "lead" as positive sentiment in its function as a verb. In water, however, the chemical lead is never a positive sentiment. Interpreting tweets as bags of words is exciting but can lead to misunderstanding of the context.

The critical message for water utilities is that they need to start taking the aesthetic properties of tap water as seriously as the health parameters. A lack of trust will drive consumers to bottled water or less healthy alternatives such as soft drinks are alternative water sources.

Data Science for Water Utilities

Data Science for Water Utilities

Data Science for Water Utilities published by CRC Press is an applied, practical guide that shows water professionals how to use data science to solve urban water management problems using the R language for statistical computing.

Share this content

You might also enjoy reading these articles

Factor Analysis in R: Measuring Consumer Involvement

Monte Carlo Cost Estimates: Engineers Throwing Dice

Cheesecake Diagrams: Pie Charts with a Different Flavour