Call Centre Workforce Planning Using Erlang C in R language

Call centre workforce planning

We all hate the experience of calling a service provider and being placed on hold for a very long time. Organisations that take their level of service seriously plan their call centres so that the waiting times for customers is within acceptable limits. Having said this, making people wait for something can in some instances increase the level of perceived value.

Managing a call centre involves an interesting mix of rational production management and the soft skills of providing service with a smile. The mathematical part of managing a call centre relates to planning processes and staffing levels to ensure customers have a great experience.

Call centre performance can be numerically expressed by the Grade of Service, which is the percentage of calls that are answered within a specific time, for example, 90% of calls are answered within 30 seconds. This Grade of Service depends on the volume of calls made to the centre, the number of available agents and the time it takes to process a contact. Although working in a call centre can be chaotic, the Erlang C formula describes the relationship between the Grade of Service and these variables quite accurately.

Call centre workforce planning is a complex activity that is a perfect problem to solve in R code. This article explains how to use the Erlang C formula in the R language to manage a contact centre by calculating the number of agents needed to meet a required Grade of Service. This approach is extended with a Monte Carlo situation to understand the stochastic nature of the real world better.

The Erlang C Formula

The Erlang C formula describes the probability that a customer needs to queue instead of being immediately serviced (P_w). This formula is closely related to the Poisson distribution which describes queues such as traffic lights.

P_w = \frac{\frac{A^N}{N!}\frac{N}{N-A}}{\Big( \sum_{i=0}^{N-1} \frac{A^i}{i!} \Big)+\frac{A^N}{N!}\frac{N}{N-A}}

The intensity of traffic A is the number of calls per hour multiplied by the average duration of a call. Traffic intensity is measured in dimensionless Erlang units which expresses the time it would take to manage all calls if they arrived sequentially. The intensity is a measure of the amount of effort that needs to be undertaken in an hour. In reality, calls arrive at random times during the hour, which is where the Poisson distribution comes in. The waiting time is also influenced by the number of available operators N. The intensity defines the minimum number of agents needed to manage the workload.

We can now deconstruct this formula in a common-sense way by saying that the level of service increases as the intensity (the combination of call volume and average duration) reduces and the number of operator increases. The more staff, the higher the level of service, but precisely how many people do you need to achieve your desired grade of service efficiently?

The Grade of Service S is a function of the outcome of the Erlang C formula (P_w), the number of agents (N), the call intensity (A), call duration (\lambda) and lastly the target answering time (t).

S = 1 - \Large[ P_w e^ {-[(N-A](t/ \lambda)]} \large]

The Erlang C formula can be reworked to provide that answer. I sourced this formula from but must admit that I don’ t fully understand it and will take it at face value.

We now have a toolset for call centre planning which we can implement in the R language.

Erlang C in R

The Erlang C formula contains some factorials and powers, which become problematic when dealing with large call volumes or a large number of agents. The Multiple Precision Arithmetic package enables working with large integer factorials, but there is no need to wield such strong computing powers. To make life easier, the Erlang C formula includes the Erlang B formula, the inverse of which can be calculated using a small loop.

This implementation is very similar to an unpublished R package by Patrick Hubers, enhanced with work from This code contains four functions:

  1. intensity: Determines intensity in Erlangs based on the rate of calls per interval, the total call handling time and the interval time in minutes. All functions default to an interval time of sixty minutes.
  2. erlang_c: Calculates The Erlang C formula using the number of agents and the variables that determine intensity.
  3. service_level: Calculates the service level. The inputs are the same as above plus the period for the Grade of Service in seconds.
  4. resource: Seeks the number of agents needed to meet a Grade of Service. This function starts with the minimum number of agents (the intensity plus one agent) and keeps searching until it finds the number of agents that achieve the desired Grade of Service.

You can view the code below or download it from GitHub.

intensity <- function(rate, duration, interval = 60) {
    (rate / (60 * interval)) * duration

erlang_c <- function(agents, rate, duration, interval = 60) {
    int <- intensity(rate, duration, interval)
    erlang_b_inv <- 1
    for (i in 1:agents) {
        erlang_b_inv <- 1 + erlang_b_inv * i / int
    erlang_b <- 1 / erlang_b_inv
    agents * erlang_b / (agents - int * (1 - erlang_b))

service_level <- function(agents, rate, duration, target, interval = 60) {
    pw <- erlang_c(agents, rate, duration, interval)
    int <- intensity(rate, duration, interval)
    1 - (pw * exp(-(agents - int) * (target / duration)))

resource <- function(rate, duration, target, gos_target, interval = 60) {
    agents <- round(intensity(rate, duration, interval) + 1)
    gos <- service_level(agents, rate, duration, target, interval)
    while (gos < gos_target * (gos_target > 1) / 100) {
        agents <- agents + 1
        gos <- service_level(agents, rate, duration, target, interval)
    return(c(agents, gos))

Call Centre Workforce Planning Using an Erlang C Monte Carlo Simulation

I have used the Erlang C model to recommend staffing levels in a contact centre some years ago. What this taught me is that the mathematical model is only the first step towards call centre workforce planning. There are several other metrics that can be built on the Erlang C model, such as average occupancy of agents and average handling time.

The Erlang C formula is, like all mathematical models, an idealised version of reality. Agents are not always available; they need breaks, toilet stops and might even go on leave. Employers call this loss of labour shrinkage, which is a somewhat negative term to describe something positive for the employee. The Erlang C model provides you with the number of ‘bums on seats’.

The Erlang C formula is, like every model, not a perfect representation of reality. The formula tends to overestimate the required resources because it assumes that people will stay on hold indefinitely, while the queue will automatically shorten as people lose patience.

The number of employees needed to provide this capacity depends on the working conditions at the call centre. For example, if employees are only available to take calls 70% of their contracted time, you will need 1/0.7=1.4 staff members for each live agent to meet the Grade of Service.

Another problem is the stochastic nature of call volumes and handling times. The Erlang C model requires a manager to estimate call volume and handling time (intensity) as a static variable, while in reality, it is stochastic and subject to variation. Time series analysis can help to predict call volumes, but every prediction has a degree of uncertainty. We can manage this uncertainty by using a Monte Carlo simulation.

All the functions listed above are rewritten so that they provide a vector of possible answers based on the average call volume and duration and their standard deviation. This simulation assumes a normal distribution for both call volume and the length of each call. The outcome of this simulation is a distribution of service levels.

Monte Carlo Simulation

For example, a call centre receives on average 100 calls per half hour with a standard deviation of 10 calls. The average time to manage a call, including wrap-up time after the call, is 180 seconds with a standard deviation of 20 seconds. The centre needs to answer 80% of calls within 20 seconds. What is the likelihood of achieving this level of service?

The average intensity of this scenario is 10 Erlangs. Using the resource formula suggests that we need 14 agents to meet the Grade of Service. Simulating the intensity of the scenario 1000 times suggests we need between 6 and 16 agents to manage this workload.

> resource(100, 180, 20, 80, 30)
[1] 14.0000000  0.88835
> intensity_mc(100, 10, 180, 20) %&amp;amp;gt;% summary()
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
5.480 8.975 9.939 10.025 10.993 15.932

The next step is to simulate the expected service level for this scenario. The plot visualises the outcome of the Monte Carlo simulation and shows that 95% of situations the Grade of Service is more than 77% and half the time it is more than 94%.

> service_level_mc(15, 100, 10, 180, 20, 20, 30, sims = 1000) %&gt;%
+ quantile(c(.05, .5, .95)) 
5%        50%       95% 
0.7261052 0.9427592 0.9914338 

This article shows that Using Erlang C in R helps managers with call centre workforce planning. Perhaps we need a Shiny application to develop a tool to manage the complexity of these functions. I would love to hear from people with practical experience in managing call centres in how they analyse data.

Simulated service levels using Erlang C in R and Monte Carlo simulation
Simulated service levels using Erlang C in R and Monte Carlo simulation.

You can view the code below or download it from GitHub.


intensity_mc &amp;lt;- function(rate_m, rate_sd, duration_m, duration_sd, interval = 60, sims = 1000) { (rnorm(sims, rate_m, rate_sd) / (60 * interval)) * rnorm(sims, duration_m, duration_sd) } intensity_mc(100, 10, 180, 20, interval = 30) %&amp;gt;% summary

erlang_c_mc &amp;lt;- function(agents, rate_m, rate_sd, duration_m, duration_sd, interval = 60) {
    int &amp;lt;- intensity_mc(rate_m, rate_sd, duration_m, duration_sd, interval)
    erlang_b_inv &amp;lt;- 1
    for (i in 1:agents) {
        erlang_b_inv &amp;lt;- 1 + erlang_b_inv * i / int
    erlang_b &amp;lt;- 1 / erlang_b_inv
    agents * erlang_b / (agents - int * (1 - erlang_b))

service_level_mc &amp;lt;- function(agents, rate_m, rate_sd, duration_m, duration_sd, target, interval = 60, sims = 1000) {
    pw &amp;lt;- erlang_c_mc(agents, rate_m, rate_sd, duration_m, duration_sd, interval)
    int &amp;lt;- intensity_mc(rate_m, rate_sd, duration_m, duration_sd, interval, sims)
    1 - (pw * exp(-(agents - int) * (target / rnorm(sims, duration_m, duration_sd))))

data_frame(ServiceLevel = service_level_mc(agents = 12,
                                           rate_m = 100,
                                           rate_sd = 10,
                                           duration_m = 180,
                                           duration_sd = 20,
                                           target = 20,
                                           interval = 30,
                                           sims = 1000)) %&amp;gt;%
    ggplot(aes(ServiceLevel)) +
        geom_histogram(binwidth = 0.1, fill = &amp;quot;#008da1&amp;quot;)

Euler Problem 33: Digit Cancelling Fractions and Ford Circles

Euler Problem 33: Digit Cancelling Fractions and Ford Circles

Euler Problem 33 takes us back to the world of fractions from our primary school days. Many kids hate and struggle learning about fractions but once you master them, a new world of numbers opens up. Unfortunately, the proliferation of digital calculators has negated the use of fractions in favour of decimal expressions. Fractions are an aesthetic way to express numbers, without having to resort to ugly random sequences of decimals. This is why I prefer to use 22/7 as an approximation of Pi over the ugly infinite series of decimals.

This Numberphile video below explains fractions and Farey sequences. A Farey sequence contains all fractions between 0 and 1 with a maximum denominator. More formally, a Farey sequence of order n is the sequence of completely reduced fractions between 0 and 1 which, when in lowest terms, have denominators less than or equal to n, arranged in order of increasing size. For example, the Farey Sequence with order 3 is:

F_3 = \Big\{ \frac{0}{1},\frac{1}{3},\frac{1}{2},\frac{2}{3},\frac{1}{1}\Big\}

These sequences can be visualised in fractal-esque Ford Circles, but before we get to this, first solve Euler problem 33.

Euler Problem 33 Definition

The fraction 49/98 is a curious fraction, as an inexperienced mathematician in attempting to simplify it may incorrectly believe that 49/98 = 4/8, which is correct, is obtained by cancelling the 9s. We shall consider fractions like 30/50 = 3/5, to be trivial examples.

There are exactly four nontrivial examples of this type of fraction, less than one in value, and containing two digits in the numerator and denominator. If the product of these four fractions is given in its lowest common terms, find the value of the denominator.

Proposed Solution in R

To solve this problem, we create a pseudo-Farey sequence by generating all different fractions with two decimals in the numerator and denominator. The loop generates all combinations of demoninators and numerators, excluding the trivial ones (multiples of 10 and 11). This solution converts the numbers to strings, strips any common duplicates, and tests the condition. The code concatenates vectors, which is not good practice. However, the loop is so short it does not matter much.

You can view the code below or download it from my GitHub page.

num &amp;lt;- vector()
den &amp;lt;- vector()
for (a in 11:99) {
    for (b in (a + 1):99) {
        trivial &amp;lt;- (a %% 10 == 0 | b &amp;amp;&amp;amp; 10 == 0 | a %% 11 == 0 | b %% 11 == 0)
        if (!trivial) {
            i &amp;lt;- as.numeric(unlist(strsplit(as.character(a), &amp;quot;&amp;quot;)))
            j &amp;lt;- as.numeric(unlist(strsplit(as.character(b), &amp;quot;&amp;quot;)))
            digs &amp;lt;- c(i, j)
            dup &amp;lt;- digs[duplicated(digs)]
            digs &amp;lt;- digs[which(digs != dup)]
            if (length(digs) == 2 &amp;amp; a/b == digs[1]/digs[2]) {
                num &amp;lt;- c(num, a)
                den &amp;lt;- c(den, b)
answer &amp;lt;- prod(den) / prod(num)

Farey Sequences and Ford Circles

Next step is to generalise Euler problem 33 and write a function to generate Farey Sequences and visualise them using Ford Circles.

The farey function generates a data table with the numerators (p) and denominators (q) of a Farey sequence. The function builds a list of all possible fractions for the solution space, excluding those with one as a Greatest Common Dominator, as defined by the gcd function.

farey &amp;lt;- function(n) {
    fseq &amp;lt;- list()
    fseq[[1]] &amp;lt;- c(0, 1)
    i &amp;lt;- 2
    gcd &amp;lt;- function(a, b) { # Euclid's method
        if (a == 0) return(b)
        if (b == 0) return(a)
        gcd(b, a%%b)
    for (q in 2:n) {
        for (p in 1:(q - 1)){
            if (gcd(p, q) == 1) {
                fseq[[i]] &amp;lt;- c(p, q)
                i &amp;lt;- i + 1
    fseq[[i]] &amp;lt;- c(1, 1)
    fseq &amp;lt;-, fseq))
    names(fseq) &amp;lt;- c(&amp;quot;p&amp;quot;, &amp;quot;q&amp;quot;)
    fseq &amp;lt;- fseq[order(fseq$p / fseq$q), ]

Standard ggplot2 cannot draw circles where the radius of the circles is related to the coordinate system. I tried to use the ggforce package to plot circles in ggplot2, but for some reason, I was not able to install this package on Ubuntu. As an alternative, I used a circle function sourced from StackOverflow. This function is called in a for-loop to build the circles on top of the empty canvas.

Farey Sequence and Ford Circles: Euler Problem 33
Farey Sequence and Ford Circles (n = 20).
lm_palette &amp;lt;- c(&amp;quot;#008da1&amp;quot;, &amp;quot;#005395&amp;quot;, &amp;quot;#262e43&amp;quot;, &amp;quot;#3b2758&amp;quot;, &amp;quot;#865596&amp;quot;, &amp;quot;#f26230&amp;quot;)
ford_circles &amp;lt;- farey(20) %&amp;gt;%
    mutate(x = p / q,
           y = 1 / (2* q^2),
           r = y,
           c = lm_palette[(q - 1)%%6 + 1])

g_circle &amp;lt;- function(r, x, y, color = NA, fill = &amp;quot;black&amp;quot;, ...) {
    x &amp;lt;- x + r * cos(seq(0, pi, length.out = 100))
    ymax &amp;lt;- y + r * sin(seq(0, pi, length.out = 100))
    ymin &amp;lt;- y + r * sin(seq(0, -pi, length.out = 100))
    annotate(&amp;quot;ribbon&amp;quot;, x = x, ymin = ymin, ymax = ymax,
             color = color, fill = fill, ...)

p &amp;lt;- ggplot(ford_circles, aes(x, y))
for (i in 1:nrow(ford_circles)) {
    p &amp;lt;- p + g_circle(ford_circles$r[i], ford_circles$x[i], ford_circles$y[i],
                      fill = ford_circles$c[i])
p + xlim(c(0, 1)) + coord_fixed() + theme_void()

Using R with Emacs and ESS – A Multifunctional Environment

Using R with Emacs and ESS

A few years ago I ditched the spreadsheet in favour of writing code in R. During the process I learnt a valuable lesson: The steeper the learning curve, the larger the reward. My time invested in learning R has paid off in spades, and I now use the R language for all my numerical and qualitative analysis.

Most data scientists solve data problems with the R language using the RStudio IDE, a free and open-source Integrated Development Environment. Although RStudio is a great product, like all other software products, it’s functionality is limited to doing one thing well.

Since last year I use Emacs and once again the rule that a steep learning curve has a valuable reward has come true. Emacs is the Swiss army chainsaw of productivity, the ultimate killer app and also one of the oldest active pieces of software. Although it might seem that newer is better when it comes to software, Emacs has continuously evolved. The power of Emacs is its extensive functionality and virtually infinite customisability. Almost ninety per cent of all my computing activity now takes place within Emacs. I use it to write notes, manage my action lists, write e-mails, articles, books. The Org Mode package in Emacs is the working horse that undertakes many of these functions. Since recently also to write R code in Emacs and ESS.

To master Emacs to the level, I am at now has taken me a couple of months. Initially, the bewildering amount of keyboard shortcuts is a challenge, but your muscle memory will soon kick in, and your fingers will glide across the keyboard like an eagle.

Using R with Emacs and ESS: Org Mode with embeded R code to write an academic paper in APA format.
Using R with Emacs and ESS: Org Mode with embedded R code to write an academic paper in APA format.

Using R with Emacs and ESS

The advantage of using Emacs over RStudio is that I can seamlessly switch between my notes, to-do lists, calendars and so on, while also developing code. There is no need to install, maintain and master multiple applications as Emacs cover almost all my computing needs. Just like R, Emacs includes thousands of packages that extend its functionality far beyond what one would expect from a text editor.

The disadvantage of Emacs compared to RStudio is that it is not as pretty and the screen more resembles an angry fruit salad than a slick material design. The image above shows my screen when I translated a paper about body image research into Org Mode with embedded R code. A temporary disadvantage is that there is a bit of preparation is required to enable coding in Emacs, but that is only a temporary hurdle and adds to the learning curve.

This article explains how to start Using R with Emacs and ESS (Emacs Speaks Statistics). The first section of this article provides links to resources on how to install Emacs on various platforms and how to enable it to start writing code in R with the ESS package. The second sections show how to use the R console and source files. The last section introduces literate programming by combining Org Mode and R and exporting the results to PDF, HTML etc.

Getting Started with Emacs

Installing Emacs depends on your operating system. A simple Google search will tell you what to do. The video playlist below shows how to install Emacs on Ubuntu, OS X or Windows 10. You will also need to install the R language on your machine, and if you like to create publication-quality output, the also include LaTeX.

Emacs needs an init file to define the packages that need to be loaded and to define several settings. You can download a minimal-working init file (init.el) from my GitHub repository.

To install packages, type M-x list-packages. This function displays all available packages. In Emacs-speak this means pressing the Alt (modify) key and the lowercase ‘x’ key, followed by ‘list-packages’ and return. You can find ESS by pressing ‘f’ and search for the package. Type ‘i’ to mark it for installation and press ‘x’ to execute the installation. The example init file loads the following packages that need to be manually installed:

  • ESS: Emacs Speaks Statistics.
  • ESS Smart UnderscoreSmarter underscore for Emacs Speaks Statistics.
  • Org-Reforg-mode modules for citations, cross-references, bibliographies in org-mode and useful BibTeX tools to go with it.
  • auto-complete: Auto Completion for GNU Emacs.

Using R with Emacs and ESS

To start using R with Emacs type M-x R RET. ESS will ask to nominate a working directory and opens a new buffer with the name *R*. You can now use the R console to write code. You can open multiple instances of the R console with different working directories and environment, just like projects in RStudio.

Emacs now recognises any file with the .R extension as being written in the R language. This functionality is a so-called Emacs major mode, which is the essential functionality of Emacs. A major mode defines how Emacs processes a file is managed, whether it is an R file, an Org Mode file and so on.

One bit of useful information is that the underscore key is mapped to the <- assignment operator. If you need an underscore, you need to type it twice in a row. This functionality can be annoying when you are an avid user of ggplot. The ess-smart-underscore package solves this issue by extending the functionality of ESS. You can install this package the same way you installed ESS itself. To enable it, add (require 'ess-smart-underscore) to your init file.

To write in a source file, you can create a .R file by typing C-x C-f (find-file function). Type the filename to create a new file, or when it already exists, open the file. You can now start writing R code as you would in any other editor. As soon as you evaluate code for the first time in a session, ESS will ask you what the starting project directory is, which is defaulted to the folder that your .R file is in. To evaluate the whole buffer, use C-c B, to evaluate a section use C-c c and to evaluate a line press M-RET.

To show the source file and the R console next to each other, type C-x 3 to split your window to show two buffers. You can then use C-x b to select the other buffer.

A disadvantage of using R in ESS is that there is no simple way to integrate plot outputs into the Emacs window. When I am iteratively working on visualisation, I save it to a file and open it in a separate buffer, as shown in the screendump.

Using R with Emacs and ESS to Write an Academic Paper

Org Mode is the most popular extension of Emacs and is precompiled with current versions of the software. Org Mode is an extremely versatile text editing extension. I use it to manage my projects using a calendar and To-Do lists, I write notes, write books an articles. I have translated an article I previously wrote LaTeX and Sweave to Org Mode.

Org Mode works very well with LaTeX. You will need to write some code into your init file to set a template, after which writing LaTeX code is a breeze. The Org Babel package functions as an interface between Org Mode and R. Perhaps explaining this file in detail is a topic for a future post. You can view the most recent version of the Org File and the associated setting in the init file on my GitHub page.

Using R with Emacs and ESS, Org Mode and LaTeX
Body Image research paper, written with Org Mode, R and LaTeX.

Defining Marketing with the Rvest and Tidytext Packages

Defining Marketing with the Rvest and Tidytext Packages

I am preparing to facilitate another session of the marketing course for the La Trobe University MBA. The first lecture delves into the definition of marketing. Like most other social phenomena, marketing is tough to define. Definitions of social constructs often rely on the perspective taken by the person or group writing the definition. As such, definitions also change over time. While a few decades ago, definitions of marketing revolved around sales and advertising, contemporary definitions are more holistic and reference creating value.

Heidi Cohen wrote a blog post where she collated 72 definitions of marketing. So rather than arguing over which definition is the best, why not use all definitions simultaneously? This article attempts to define a new definition of marketing, using a data science approach. We can use the R language to scrape the 72 definitions from Heidi’s website and attempt text analysis to extract the essence of marketing from this data set.

I have mentioned in a previous post about qualitative data science that automated text analysis is not always a useful method to extract meaning from a text. I decided to delve a little deeper into automated text analysis to see if we find out anything useful about marketing using the rvest and tidytext packages.

The presentation below shows the slides I use in my introductory lecture into marketing. The code and analyses are shown below the slideshow. You can download the most recent version of the code from my GitHub Repository.


Scraping text with Rvest

Web scraping is a technique to download data from websites where this data is not available as a clean data source. Web scraping starts with downloading the HTML code from the website and the filtering the wanted text from this file. The rvest package makes this process very easy.

The code for this article uses a pipe (%>%) with three rvest commands. The first step is to download the wanted html code from the web using the read_html function. The output of this function is piped to the html_nodes function, which does all the hard work. In this case, the code picks all lines of text that are embedded in an <ol><li> container, i.e. ordered lists. You can use the SelectorGadget to target the text you like to scrape. The last scraping step cleans the text by piping the output of the previous commands to the html_text function.

The result of the scraping process is converted to a Tibble, which is a type of data frame used in the Tidyverse. The definition number is added to the data, and the Tibble is converted to the format required by the Tidytext package. The resulting data frame is much longer than the 72 definitions because there are other lists on the page. Unfortunately, I could not find a way to filter only the 72 definitions.

definitions <- read_html("") %>%
    html_nodes("ol li") %>%
    html_text() %>%
    as_data_frame() %>%
    mutate(No = 1:nrow(.)) %>%
    select(No, Definition = value)

Tidying the Text

The Tidytext package extends the tidy data philosophy to a text. In this approach to text analysis, a corpus consists of a data frame where each word is a separate item. The code snippet below takes the first 72 rows, and the unnest_tokens function extracts each word from the 72 definitions. This function can also extract ngrams and other word groups from the text. The Tidytext package is an extremely versatile piece of software which goes far beyond the scope of this article. Julia Silge and David Robinson have written a book about text mining using this package, which provides a very clear introduction to the craft of analysing text.

The last section of the pipe removes the trailing “s” from each word to convert plurals into single words. The mutate function in the Tidyverse creates or recreates a new variable in a data frame.

def_words <- definitions[1:72, ] %>%
    unnest_tokens(word, Definition) %>%
    mutate(word = gsub("s$", "", word))

This section creates a data frame with two variables. The No variable indicates the definition number (1–72) and the word variable is a word within the definition. The order of the words is preserved in the row name. To check the data frame you can run unique(def_words$No[which(def_words$word == "marketing")]). This line finds all definition numbers with the word “marketing”, wich results, as expected, in the number 1 to 72.

Using Rvest and Tidytext to define marketing

We can now proceed to analyse the definitions scraped from the website with Rvest and cleaned with Tidytext. The first step is to create a word cloud, which is a popular way to visualise word frequencies. This code creates a data frame for each unique word, excluding the word marketing itself, and uses the wordcloud package to visualise the fifty most common words.

Defining Marketing with the Rvest and Tidytext Packages
72 Definitions of marketing summarised.

word_freq <- def_words %>%
    anti_join(stop_words) %>%
    count(word) %>%
    filter(word != "marketing")

word_freq %>%
    with(wordcloud(word, n, max.words = 50, rot.per = .5,
                   colors = rev(brewer.pal(5, "Dark2"))))

While a word cloud is indeed a pretty way to visualise the bag of words in a text, it is not the most useful way to get the reader to understand the data. The words are jumbled, and the reader needs to search for meaning. A better way to visualise word frequencies is a bar chart. This code takes the data frame created in the previous snippet, determines the top ten occurring words. The mutate statement reorders to factor levels so that the words are plotted in order.

Word frequency diagram of 72 definitions of marketing.
Word frequency diagram of 72 definitions of marketing.
word_freq %>%
    top_n(10) %>%
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(word, n)) + 
        geom_col(fill = "dodgerblue4") +
        coord_flip() +
        theme(text = element_text(size=20))

A first look at the word cloud and bar chart suggests that marketing is about customers and products and services. Marketing is a process that includes branding and communication; a simplistic but functional definition.

Topic Modeling using Tidytext

Word frequencies are a weak method to analyse text because it interprets each word as a solitary unit. Topic modelling is a more advanced method that analyses the relationships between words, i.e. the distance between them. The first step is to create a Document-Term Matrix, which is a matrix that indicates how often a word appears in a text.  As each of the 72 texts are very short, I decided to treat the collection of definitions as one text about marketing. The cast_dtm function converts the data frame to a Document-Term Matrix.

The following pipe determines the top words in the topics. Just like k-means clustering, the analyst needs to choose the number of topics before analysing the text. In this case, I have opted for four topics. The code determines the contribution of each word to the four topics and selects the five most common words in each topic. The faceted bar chart shows each of the words in the four topics.

Topic modelling 72 definitions of marketing
Topic modelling 72 definitions of marketing
marketing_dtm <- word_freq %>%
    mutate(doc = 1) %>%
    cast_dtm(doc, word, n)

marketing_lda <- LDA(marketing_dtm, k = 4) %>%
    tidy(matrix = "beta") %>%
    group_by(topic) %>%
    top_n(5, beta) %>%
    ungroup() %>%
    arrange(topic, -beta)

marketing_lda %>%
    mutate(term = reorder(term, beta)) %>%
    ggplot(aes(term, beta, fill = factor(topic))) +
       geom_col(show.legend = FALSE) +
       facet_wrap(~topic, scales = "free") +
       coord_flip() +
       theme(text = element_text(size=20))

This example also does not tell me much more about what marketing is, other than giving a slightly more sophisticated version of the word frequency charts. This chart shows me that marketing is about customers that enjoy a service and a product. Perhaps the original definitions are not distinctive enough to be separated from each other. The persistence of the word “president” is interesting as it seems to suggest that marketing is something that occurs at the highest levels in the business.

This article is only a weak summary of the great work by Julia Silge who co-authored the Tidytext package. The video below provides a comprehensive introduction to topic modelling.

What have we learnt?

This excursion into text analysis using rvest and Tidytext shows that data science can help us to make some sense out of unread text. If I did not know what this page was about, then perhaps this analysis would enlighten me. This kind of analysis can assist us in wading through to large amounts of text to select the ones we want to read. I am still not convinced that this type of analysis will provide any knowledge beyond what can be obtained from actually reading and engaging with a text.

Although I am a data scientist and want to maximise the use of code in analysing data, I am very much in favour of developing human intelligence before we worry about the artificial kind.

Laser Beams and Elliptical Billiards: Euler Problem 144

Euler problem 144 has been one of the most fun to solve. The underlying problem is the pathway of the reflection of a laser inside an ellipse-shaped mirror. Before I delve into this problem, I like to share this delightful video from Numberphile in which Alex Bellos demonstrates an elliptical billiards table. The billiards problem is mathematically equivalent to the laser problem. The reflection rule optics is the same as the bouncing rule in mechanics, but instead of using light, we use a billiard ball.

This article outlines my solution to Euler problem 104 and simulates the elliptical pool table in the R language. You can find the code on the GitHub repository for this website.

Euler Problem 144 Definition

In laser physics, a “white cell” is a mirror system that acts as a delay line for the laser beam. The beam enters the cell, bounces around on the mirrors, and eventually works its way back out.

The specific white cell we will be considering is an ellipse with the equation 4x^2 + y^2= 100 . The section corresponding to -0.01 \leq \times \leq +0.01 at the top is missing, allowing the light to enter and exit through the hole.

Euler Problem 144

The light beam in this problem starts at the point (0.0,10.1) just outside the white cell, and the beam first impacts the mirror at (1.4,-9.6). Each time the laser beam hits the surface of the ellipse, it follows the usual law of reflection “angle of incidence equals the angle of reflection.” That is, both the incident and reflected beams make the same angle with the normal line at the point of incidence. In the figure on the left, the red line shows the first two points of contact between the laser beam and the wall of the white cell; the blue line shows the line tangent to the ellipse at the point of incidence of the first bounce. The slope m of the tangent line at any point (x,y) of the given ellipse is m = -4x/y . The normal line is perpendicular to this tangent line at the point of incidence.

How many times does the beam hit the internal surface of the white cell before exiting?

Proposed Solution to Euler Problem 144

The first step was to rewrite the equation to use functions to generalise the problem. The general Cartesian equation for an ellipse is:

\frac{x^2}{a^2} + \frac{y^2}{b^2} = 1, a < b

The length of the axes for this problem is a =5 and b = 10. While the Project Euler description gives the formula for the slope of the tangent to the ellipse, I have generalised the code to reuse it for the elliptical billiards table. The slope of the tangent to an ellipse at point (x,y) is:


This first code snippet defines functions to draw an ellipse and calculate the bouncing angle. The last part of the code bounces the laser inside the cell until it exits through the top.

I sourced the formula to find the intersection between a line and an ellipse from Ambrsoft. The equation has two possible solutions, one of which is the same as the original point.

plot_ellipse &lt;- function(a, b, colour = NA, line = &quot;black&quot;) {
    plot.window(xlim = c(-a, a), ylim = c(-b, b), asp = 1)
    par(mar = rep(0,4))
    x &lt;- seq(-a, a, length = 200)
    y &lt;- sqrt(b^2 - (b^2 / a^2) * x^2)
    lines(x, y, col = line)
    lines(x, -y, col = line)
    polygon(x, y, col = colour, border = NA)
    polygon(x, -y, col = colour, border = NA)

bounce &lt;- function(coords) {
    x &lt;- coords$x
    y &lt;- coords$y
    ## Tangent to ellipse
    t &lt;- -(b^2 / a^2) * (x[2] / y[2])
    ## Deflection on sloping mirror y = mx + c
    dydx &lt;- diff(y) / diff(x)
    m &lt;- tan(pi - atan(dydx) + 2 * atan(t))
    c &lt;- y[2] - m * x[2]
    ## Determine intersection point
    ## Source:
    x[1] &lt;- x[2]
    y[1] &lt;- y[2]
    x2 &lt;- (-a^2 * m * c + c(-1, 1) * (a * b * sqrt(a^2 * m^2 + b^2 - c^2))) /
          (a^2 * m^2 + b^2)
    x[2] &lt;- ifelse(round(x[1] / x2[1], 6) == 1, x2[2], x2[1])
    y[2] &lt;- m * x[2] + c
    return(data.frame(x, y))

# Initial conditions
a &lt;- 5
b &lt;- 10
x1 &lt;- 0
y1 &lt;- 10.1
x2 &lt;- 1.4
y2 &lt;- -9.6
answer &lt;- 0
plot_ellipse(a, b)
points(c(0,0), c(-c, c), pch = 19)
## Bounce laser breams
laser &lt;- data.frame(x = c(x1, x2), y = c(y1, y2))
while((laser$x[2] &lt; -0.01 | laser$x[2] &gt; 0.01) | laser$y[2] &lt; 0) { ## Escape?
    lines(laser$x, laser$y, col = &quot;red&quot;, lwd = .5)
    laser &lt;- bounce(laser)
    answer &lt;- answer + 1
Solution to Euler Problem 144
Graphical solution to Euler Problem 144.

The result of this code is a pretty image of all the laser beams that have bounced around the mirror, which looks like the evil Eye of Sauron in Lord of the Rings.

Elliptical Pool Table

We can use the solution to Euler problem 144 to play billiards on an elliptical billiards table. To close the article, we return to the elliptical pool table demonstrated by Alex Bellos. This code draws the pool table to the dimensions mentioned in the video. We know that the table has an eccentricity of e = 0.43 and a long axis of a = 130 cm. The code defines the short axis (b) and the distance of the focal points from the centre.

The code selects a random starting point and angle of the shot. The code first determines whether the line passes through the pocket. If this is not the case, the algorithm then finds the place where the ball hits and keeps bouncing it until it falls into the pocket or the ball bounces 100 times.

Elliptical billiard tables have four possible outcomes. Any ball the pass through a focal point will fall into the pocket, ending the simulation. Any ball that passes outside the focal points will bounce around, and the combined trajectories form an ellipse. When the ball moves between the foci, the result is a hyperbola. Lastly, there are some unique circumstances which result in a regular polygon.

Elliptical billiards: Euler problem 144
Elliptical billiards: Three simulations.

If simulations are not enough for you, then head over to the Instructables website to find out how you can construct an elliptical billiards table. There is even a patent for an elliptical pocket billiard table, with the pockets at the edge.

e &lt;- 0.43
a &lt;- 130
b &lt;- a * sqrt((1 + e) * (1 - e)) # a &gt; b
f &lt;- sqrt(a^2 - b^2)
plot_ellipse(a, b, &quot;darkgreen&quot;, NA)
points(-f, 0, pch = 19, cex = 2)
points(f, 0, pch = 19, col = &quot;grey&quot;)

## Simulate random shot
angle &lt;- runif(1, 0, 2 * pi)
x1 &lt;- runif(1, -a, a)
ymax &lt;- sqrt(b^2 - (b^2 / a^2) * x1^2)
y1 &lt;- runif(1, -ymax, ymax)

## First shot
m &lt;- tan(angle)
c &lt;- y1 - m * x1
x2 &lt;- (-a^2 * m * c + c(-1, 1) * (a * b * sqrt(a^2 * m^2 + b^2 - c^2))) / (a^2 * m^2 + b^2)
y2 &lt;- m * x2 + c
x2 &lt;- x2[which(((x2 - x1) &lt; 0) == (cos(angle) &lt; 0))]
y2 &lt;- y2[which(((y2 - y1) &lt; 0) == (sin(angle) &lt; 0))]
shot &lt;- (data.frame(x = c(x1, x2), y = c(y1, y2)))

## Bounce ball
for (i in 1:100){
    dydx &lt;- diff(shot$y) / diff(shot$x)
    if (all.equal(dydx, (shot$y[1] - 0) / (shot$x[1] - -f)) == TRUE) {
        shot[2, ] &lt;- c(-f, 0)
    lines(shot$x, shot$y, col = &quot;yellow&quot;, lwd = 1)
    if (shot[2,2] == 0) break
    shot &lt;- bounce(shot)
points(x1, y1, pch = 19, col = &quot;blue&quot;, cex = 1.8)

Qualitative Data Science: Using RQDA to analyse interviews

Qualitative Data Science: Using RQDA to analyse interviews

Qualitative data science sounds like a contradiction in terms. Data scientists generally solve problems using numerical solutions. Even the analysis of text is reduced to a numerical problem using Markov chains, topic analysis, sentiment analysis and other mathematical tools.

Scientists and professionals consider numerical methods the gold standard of analysis. There is, however, a price to pay when relying on numbers alone. Numerical analysis reduces the complexity of the social world. When analysing people, numbers present an illusion of precision and accuracy. Giving primacy to quantitative research in the social sciences comes at a high price. The dynamics of reality are reduced to statistics, losing the narrative of the people that the research aims to understand.

Being both an engineer and a social scientist, I acknowledge the importance of both numerical and qualitative methods. My dissertation used a mixed-method approach to review the relationship between employee behaviour and customer perception in water utilities. This article introduces some aspects of qualitative data science with an example from my dissertation.

In this article, I show how I analysed data from interviews using both quantitative and qualitative methods and demonstrate why qualitative data science is better to understand text than numerical methods. The most recent version of the code is available on my GitHub repository. Unfortunately I cannot share the data set as this contains personally identifying data.

Qualitative Data Science

Qualitative Data Science

The often celebrated artificial intelligence of machine learning is impressive but does not come close to human intelligence and ability to understand the world. Many data scientists are working on automated text analysis to solve this issue (the topicmodels package is an example of such an attempt). These efforts are impressive but even the smartest text analysis algorithm is not able to derive meaning from text. To fully embrace all aspects of data science we need to be able to methodically undertake qualitative data analysis.

The capabilities of R in numerical analysis are impressive but it can also assist with Qualitative Data Analysis (QDA). Huang Ronggui from Hong Kong developed the RQDA package to analyse texts in R. RQDA assists with qualitative data analysis using a GUI front-end to analyse collections texts. The video below contains a complete course in using this software. Below the video, I share an example from my dissertation which compares qualitative and quantitative methods for analysing text.

For my dissertation about water utility marketing, I interviewed seven people from various organisations. The purpose of these interviews was to learn about the value proposition for water utilities. The data consists of the transcripts of six interviews which I manually coded using RQDA. For reasons of agreed anonymity, I cannot provide the raw data file of the interviews through GitHub.

Numerical Text Analysis

Word clouds are a popular method for exploratory analysis of texts. The wordcloud is created with the text mining and wordcloud packages. The transcribed interviews are converted to a text corpus (the native format for the tm package) and whitespace, punctuation etc is removed. This code snippet opens the RQDA file and extracts the texts from the database. RQDA stores all text in an SQLite database and the package provides a query command to extract data.


interviews &amp;amp;lt;- RQDAQuery(&amp;amp;amp;quot;SELECT file FROM source&amp;amp;amp;quot;)
interviews$file &amp;amp;lt;- apply(interviews, 1, function(x) gsub(&amp;amp;quot;…&amp;amp;quot;, &amp;amp;quot;...&amp;amp;quot;, x))
interviews$file &amp;amp;lt;- apply(interviews, 1, function(x) gsub(&amp;amp;quot;’&amp;amp;quot;, &amp;amp;quot;&amp;amp;quot;, x))
interviews &amp;amp;lt;- Corpus(VectorSource(interviews$file))
interviews &amp;amp;lt;- tm_map(interviews, stripWhitespace)
interviews &amp;amp;lt;- tm_map(interviews, content_transformer(tolower))
interviews &amp;amp;lt;- tm_map(interviews, removeWords, stopwords(&amp;amp;quot;english&amp;amp;quot;))
interviews &amp;amp;lt;- tm_map(interviews, removePunctuation)
interviews &amp;amp;lt;- tm_map(interviews, removeNumbers)
interviews &amp;amp;lt;- tm_map(interviews, removeWords, c(&amp;amp;quot;interviewer&amp;amp;quot;, &amp;amp;quot;interviewee&amp;amp;quot;))

wordcloud(interviews, min.freq = 10, max.words = 50, rot.per=0.35, 
          colors = brewer.pal(8, &amp;amp;quot;Blues&amp;amp;quot;)[-1:-5])
Word cloud of interview transcripts.
Word cloud of interview transcripts.

This word cloud makes it clear that the interviews are about water businesses and customers, which is a pretty obvious statement. The interviews are also about the opinion of the interviewees (think). While the word cloud is aesthetically pleasing and provides a quick snapshot of the content of the texts, they cannot inform us about their meaning.

Topic modelling is a more advanced method to extract information from the text by assessing the proximity of words to each other. The topic modelling package provides functions to perform this analysis. I am not an expert in this field and simply followed basic steps using default settings with four topics.

dtm &amp;amp;lt;- DocumentTermMatrix(interviews)
dtm &amp;amp;lt;- removeSparseTerms(dtm, 0.99)
ldaOut &amp;amp;lt;- LDA(dtm, k = 4)
terms(ldaOut, 6)

This code converts the corpus created earlier into a Document-Term Matrix, which is a matrix of words and documents (the interviews) and the frequency at which each of these words occurs. The LDA function applies a Latent Dietrich Allocation to the matrix to define the topics. The variable k defines the number of anticipated topics. An LDA is similar to clustering in multivariate data. The final output is a table with six words for each topic.

Topic 1Topic 2Topic 3Topic 4

This table does not tell me much at all about what was discussed in the interviews. Perhaps it is the frequent use of the word “water” or “think”—I did ask people their opinion about water-related issues. To make this analysis more meaningful I could perhaps manually remove the words water, yeah, and so on. This introduces bias in the analysis and reduces the reliability of the topic analysis because I would be interfering with the text.

Numerical text analysis sees a text as a bag of words instead of a set of meaningful words. It seems that any automated text mining needs a lot of manual cleaning to derive anything meaningful. This excursion shows that automated text analysis is not a sure-fire way to analyse the meaning of a collection of words. After a lot of trial and error to try to make this work, I decided to go back to my roots of qualitative analysis using RQDA as my tool.

Qualitative Data Science Using RQA

To use RQDA for qualitative data science, you first need to manually analyse each text and assign codes (topics) to parts of the text. The image below shows a question and answer and how it was coded. All marked text is blue and the codes are shown between markers. Coding a text is an iterative process that aims to extract meaning from a text. The advantage of this method compared to numerical analysis is that the researcher injects meaning into the analysis. The disadvantage is that the analysis will always be biased, which in the social sciences is unavoidable. My list of topics was based on words that appear in a marketing dictionary so that I analysed the interviews from that perspective.

Qualitative data science using RQDA.
Example of text coded with RQDA.

My first step was to look at the occurrence of codes (themes) in each of the interviews.

codings &amp;amp;lt;- getCodingTable()[,4:5]
categories &amp;amp;lt;- RQDAQuery(&amp;amp;quot;SELECT AS category, AS filename 
                         FROM treefile, filecat, source 
                         WHERE treefile.catid = filecat.catid AND treefile.fid = AND treefile.status = 1&amp;amp;quot;)
codings &amp;amp;lt;- merge(codings, categories, all.y = TRUE)
reorder_size &amp;amp;lt;- function(x) {
    factor(x, levels = names(sort(table(x))))
ggplot(codings, aes(reorder_size(codename), fill=category)) + geom_bar() + 
    facet_grid(~filename) + coord_flip() + 
    theme(legend.position = &amp;amp;quot;bottom&amp;amp;quot;, legend.title = element_blank()) + 
    ylab(&amp;amp;quot;Code frequency in interviews&amp;amp;quot;) + xlab(&amp;amp;quot;Code&amp;amp;quot;)

The code uses an internal RQDA function getCodingTable to obtain the primary data. The RQDAQuery function provides more flexibility and can be used to build more complex queries of the data. You can view the structure of the RQDA database using the RQDATables function.

Occurrence of themes from six interviews - qualitative data science
The occurrence of themes from six interviews.

This bar chart helps to explore the topics that interviewees discussed but it does not help to understand how these topic relate to each other. This method provides a better view than the ‘bag of words’ approach because the text has been given meaning.

RQDA provides a facility to assign each code to a code category. This structure can be visualised using a network. The network is visualised using the igraph package and the graph shows how codes relate to each other.

Qualitative data analysis can create value from a text by interpreting it from a given perspective. This article is not even an introduction to the science and art of qualitative data science. I hope it invites you to explore RQA and similar tools.

If you are interested in finding out more about how I used this analysis, then read chapter three of my dissertation on customer service in water utilities.

Qualitative data science: Using RQDA to analyse a text
Network diagram with communities of interview topics.
edges &amp;amp;lt;- RQDAQuery(&amp;amp;quot;SELECT, FROM codecat, freecode, treecode 
                    WHERE codecat.catid = treecode.catid AND = treecode.cid&amp;amp;quot;)
g &amp;amp;lt;- graph_from_edgelist(as.matrix(edges), directed = FALSE)
V(g)$name &amp;amp;lt;- gsub(&amp;amp;quot; &amp;amp;quot;, &amp;amp;quot;\n&amp;amp;quot;, V(g)$name)

c &amp;amp;lt;-
plot(c, g, 

Approximations of Pi: A Random Walk though the Beauty of Pi

Approximations of Pi: A Random Walk though the Beauty of Pi

Off all thinkable numbers, Pi has somewhat of celebrity status. It is so famous it even has its holiday. Mathematicians who use the American way to write dates recognise March the 14th as Pi Day. Some authors have even assigned a mystical status to the number of Pi. In the novel Contact by Carl Sagan, the heroine Ellie discovers a hidden message in the base-11 representation of Pi. In the 1998 film Pi by Darren Aronofsky, the protagonist is driven mad by the idea that a secret pattern in random numbers. The string of numbers that form the decimals of Pi might seem random, but they are of course perfectly predictable and calculable. Their apparent randomness can create artful visualisations.

This article discusses some approximations of Pi using the R language and visualises the results. The video below is an ode to the aesthetic beauty of the digits of Pi by Numberphile.

Approximations of Pi

My favourite approximation of Pi has served me well through the early part of my engineering career. When using calculators that did not have a button for \pi , I used 22/7. It is only accurate to two decimals, but that is more than enough for dredging projects.

The base R language can provide 22 decimals of Pi using options(digits = 22). If you need more accuracy, then the MPFR package in R provides a library for multiple-precision floating-point computations. The Const function can compute Pi to a high level of precision with thousands of digits.

pi_mpfr &amp;amp;lt;- Const(&amp;amp;quot;pi&amp;amp;quot;, prec = 200000)

There are also more creative ways to determine the number Pi using Monte Carlo simulations. Imagine you are terribly bored, sitting inside a room with floorboards. You decide to take a match with length l and the floorboards are t wide. The needle is half the length of the width of the floorboards (t/l = 2). You drop the matchstick randomly for about ten million times (n=10,000,000) and you record every time the needle crosses the groove of one of the boards (o). This thought experiment is known as Buffon’s needle. The number Pi can be approximated by:

\pi \approx \frac{c \cdot l}{o \cdot t}

This bit of code simulates dropping the needle ten million times, which gives an estimated value of \pi \approx 3.143096 . You would have to be extremely bored to throw a needle that many times to only achieve ccuracy to two decimals!

t &amp;amp;lt;- 2
l &amp;amp;lt;- 1
n &amp;amp;lt;- 10000000

needles &amp;amp;lt;- data_frame(phi = runif(n, 0, 2 * pi),
           x1 = runif(n, 0, (l + 3)), 
           y1 = runif(n, 0, (l + 3)),
           x2 = x1 + l * sin(phi),
           y2 = y1 + l * cos(phi), 
           overlap = (x1 &amp;amp;lt; 1 &amp;amp;amp; x2 &amp;amp;gt; 1) | (x1 &amp;amp;lt; 3 &amp;amp;amp; x2 &amp;amp;gt; 3))

ggplot(needles[1:1000,], aes(x1, y1)) + 
  geom_segment(aes(xend = x2, yend = y2, colour = overlap)) + 
  geom_vline(xintercept = c(1, 3), colour = &amp;amp;quot;red&amp;amp;quot;) +
  scale_color_manual(values = c(&amp;amp;quot;gray&amp;amp;quot;, &amp;amp;quot;black&amp;amp;quot;)) + 

pi_buffon &amp;amp;lt;- (n * l) / (sum(needles$overlap) * t)
Approximations of Pi: A Random Walk though the digits of Pi
Visualising Buffon’s Needle thought experiment.

Visualising the approximations of Pi

In 1888, the John Venn, inventor of the eponymous Venn diagram, wanted to visualise the apparent randomness of the digits of Pi. He obtained the first 707 digits of Pi from amateur mathematician William Shanks. Unfortunately, only the first 527 decimals of Pi were correct but this was only discovered in 1944. Venn assigned a compass point to the digits 0 to 7 and then drew lines to show the path indicated by each digit, ignoring the digits 8 and 9.

Running through the values of Pi this way produces a line that snakes its way through the graph. When you use random numbers with the same method, the graph looks very similar. In this case, I have downloaded the digits of Pi from the On-Line Encyclopedia of Integer Sequences (Nr. A000796).

# Download Pi Digits
pi_digits &amp;amp;lt;- read.csv(&amp;amp;quot;;amp;quot;, header = FALSE, sep = &amp;amp;quot; &amp;amp;quot;, skip = 1) %&amp;amp;gt;%
             select(digit = V2)

# Venn walk
venn_walk &amp;amp;lt;- function(digits) {
  digits &amp;amp;lt;- digits[digits != 8 &amp;amp;amp; digits != 9]
  l &amp;amp;lt;- length(digits) - 1
  x &amp;amp;lt;- rep(0, l)
  y &amp;amp;lt;- x
  for (i in 1:l) {
    a &amp;amp;lt;- digits[i + 1] * pi / 4
    dx &amp;amp;lt;- round(sin(a))
    dy &amp;amp;lt;- round(cos(a))
    x[i + 1] &amp;amp;lt;- x[i] + dx 
    y[i + 1] &amp;amp;lt;- y[i] + dy
  coords &amp;amp;lt;- data_frame(x = x, y = y) ggplot(coords, aes(x, y)) + geom_path() + geom_point(data = coords[c(1, l + 1), ], aes(x, y), colour = &amp;amp;quot;red&amp;amp;quot;, size = 2) + theme_void() } venn_walk(pi_digits) # Random Numbers data.frame(digit = sample(0:7, 20000, replace = TRUE)) %&amp;amp;gt;%
Random walk through the digits of Pi - The Venn method
Random walk through the digits of Pi – The Venn method.

The Numberphile video shows some beautiful visualisations of Pi, one of which I like to share with you to close this article. Martin Krzywinski created this, and many other, visualisations of Pi.

data_frame(x = rep(1:20, 20),
           y = unlist(lapply(1:20, function(x) rep(x, 20))),
           d = pi_digits$digit[1:400]) %&amp;amp;gt;%
  mutate(d = factor(d)) %&amp;amp;gt;%
  ggplot(aes(x, y, colour = d)) + geom_point(size = 3) +
  theme_void() + 
  theme(plot.background = element_rect(fill = &amp;amp;quot;black&amp;amp;quot;),
        panel.background = element_rect(fill = &amp;amp;quot;black&amp;amp;quot;),

As always, you can find the latest version of this code on GitHub. Feel free to subscribe to this blog if you like to receive articles in your mailbox.

Approximations of Pi: Art work that visualises the random nature of the decimals of Pi
Approximations of Pi: Art work that visualises the random nature of the decimals of Pi

Tap Water Sentiment Analysis using Tidytext

Tap Water Sentiment Analysis using Tidytext

In developed countries, tap water is safe to drink and available for a meagre price. Despite the fact that high-quality drinking water is almost freely available, the consumption of bottled water is increasing every year. Bottled water companies use sophisticated marketing strategies, while water utilities are mostly passive providers of public service. Australian marketing expert Russell Howcroft even called water utilities “lazy marketers”. Can we use data science to find out more about how people feel about tap water and learn about the reasons behind this loss in trust in the municipal water supply?

This tap water sentiment analysis estimates the attitudes people have towards tap water by analysing tweets. This article explains how to examine tweets about tap water using the R language for statistical computing and the Tidytext package. The most recent version of the code and the raw data set used in this analysis can be viewed on my GitHub page.

Sentiment Analysis

Sentiment analysis is a suitable technique to get a feel for the the perception a group of people has about a certain topic. Traditionaly, this ould require surveys. This method is problematic because it creates an artificial environment where the respondent often answers to meet the perceived expectations of the survey or customers exaggerate to get their point across. Using sentiment analysis of ego-documents written by consumers can overcome these problems. Ego-documents, i.e. forms of personal writing, are a more direct way to find out what consumers think, but they are not easy to obtain and analyse.

With the advent of social media, access to ego documents has become much simpler and many tools exist to collect and interpret this data. Using ego-documents brings you closer to the consumer than can be possible with surveys or focus groups. One medium gaining popularity with market researchers is Twitter.

Tap Water Sentiment Analysis

Each tweet that contains the words “tap water” contains a message about the attitude the author has towards that topic. Each text expresses a sentiment about the topic it describes. Sentiment analysis is a data science technique that extracts subjective information from a text. The basic method compares a string of words with a set of words with calibrated sentiments. These calibrated sets are created by asking many people how they feel about a certain word. For example, the word “stink” expresses a negative sentiment, while the word “nice” would be a positive sentiment.

This tap water sentiment analysis consists of three steps. The first step extracts 1000 tweets that contain the words “tap water” from Twitter. The second step cleans the data, and the third step undertakes the analysis visualises the results.

Extracting tweets using the TwitteR package

The TwitteR package by Geoff Gentry makes it very easy to retrieve tweets using search criteria. You will need to create an API on Twitter to receive the keys and tokens. In the code below, the actual values have been removed. Follow the instructions in this article to obtain these codes for yourself. This code snippet calls a private file to load the API codes, extracts the tweets and creates a data frame with a tweet id number and its text.


# Extract tap water tweets
setup_twitter_oauth(api_key, api_secret, token, token_secret)
tapwater_tweets &lt;- searchTwitter(&quot;tap water&quot;, n = 1000, lang = &quot;en&quot;) %&gt;%
  twListToDF() %&gt;%
  select(id, text)
tapwater_tweets &lt;- subset(tapwater_tweets, !duplicated(tapwater_tweets$text))
tapwater_tweets$text &lt;- gsub(&quot;’&quot;, &quot;'&quot;, tapwater_tweets$text)
write_csv(tapwater_tweets, &quot;Hydroinformatics/tapwater_tweets.csv&quot;)

When I first extracted these tweets, a tweet by CNN about tap water in Kentucky that smells like diesel was retweeted many times, so I removed all duplicate tweets from the set. Unfortunately, this left less than 300 original tweets in the corpus.

Sentiment analysis with Tidytext

Text analysis can be a powerful tool to help to analyse large amounts of text. The R language has an extensive range of packages to help you undertake such a task. The Tidytext package extends the Tidy Data logic promoted by Hadley Wickham and his Tidyverse software collection.

Data Cleaning

The first step in cleaning the data is to create unigrams, which involves splitting the tweets into individual words that can be analysed. The first step is to look at which words are most commonly used in the tap water tweets and visualise the result.

tidy_tweets &lt;- tapwater_tweets %&gt;%
  unnest_tokens(word, text)

tidy_tweets &lt;- tidy_tweets %&gt;%
  anti_join(stop_words) %&gt;%
  filter(!word %in% c(&quot;tap&quot;, &quot;water&quot;, &quot;rt&quot;, &quot;https&quot;, &quot;;, &quot;gt&quot;, &quot;amp&quot;, as.character(0:9)))

tidy_tweets %&gt;%
  count(word, sort = TRUE) %&gt;%
  filter(n &gt; 5) %&gt;%
  mutate(word = reorder(word, n)) %&gt;%
  ggplot(aes(word, n)) + geom_col(fill = &quot;dodgerblue4&quot;) +
    xlab(NULL) + coord_flip() + ggtitle(&quot;Most common words in tap water tweets&quot;)
ggsave(&quot;Hydroinformatics/tapwater_words.png&quot;, dpi = 300)

Most common words in tap water sentiment analysis

The most common words related to drinking the water and to bottled water, which makes sense. Also the recent issues in Kentucky feature in this list.

Sentiment Analysis

The Tidytext package contains three lexicons of thousands of single English words (unigrams) that were manually assessed for their sentiment. The principle of the sentiment analysis is to compare the words in the text with the words in the lexicon and analyse the results. For example, the statement: “This tap water tastes horrible” has a sentiment score of -3 in the AFFIN system by Finn Årup Nielsen due to the word “horrible”. In this analysis, I have used the “bing” method published by Liu et al. in 2005.

This method is certainly not fool proof as words with the same spelling can mean different things. For example, the phrase: “This tap water contains too much lead” will be assess as a positive sentiment because the verb lead is seen as positive. The noun lead has no sentiment as it depends on context.

sentiment_bing &lt;- tidy_tweets %&gt;%

sentiment_bing %&gt;%
  summarise(Negative = sum(sentiment == &quot;negative&quot;), 
            positive = sum(sentiment == &quot;positive&quot;))

sentiment_bing %&gt;%
  group_by(sentiment) %&gt;%
  count(word, sort = TRUE) %&gt;%
  filter(n &gt; 2) %&gt;%
  ggplot(aes(word, n, fill = sentiment)) + 
    geom_col(show.legend = FALSE) + 
    coord_flip() + 
    facet_wrap(~sentiment, scales = &quot;free_y&quot;) + 
    labs(x = NULL, y = NULL) + 
    ggtitle(&quot;Contribution to sentiment&quot;) 
ggsave(&quot;Hydroinformatics/tapwater_sentiment.png&quot;, dpi = 300)

This tap water sentiment analysis shows that two-thirds of the words that express a sentiment were negative. The most common negative words were “smells” and “scared”. This analysis is not a positive result for water utilities. Unfortunately, most tweets were not spatially located so I couldn’t determine the origin of the sentiment.

Tap Water sentiment analysis

Sentiment analysis is an interesting explorative technique, but it should not be interpreted as absolute truth. This method is not able to detect sarcasm or irony, and words don’t always have the same meaning as described in the dictionary.

The important message for water utilities is that they need to start taking the aesthetic properties of tap water as serious as the health parameters. A lack of trust will drive consumers to bottled water, or less healthy alternatives such as soft drinks are alternative water sources.

If you like to know more about customer perceptions of tap water, then read my book Customer Experience Management for Water Utilities by IWA Publishing.

Customer Experience Management for Water Utilities: Marketing Urban Water Supply

Topological Tomfoolery in R: Plotting a Möbius Strip

Topological Tomfoolery in R: Plotting a Möbius Strip

Geometry is an entertaining branch of mathematics. Topology is, according to Clifford Pickover, the “silly putty of mathematics”. This branch of maths studies the transformation of shapes, knots and other complex geometry problems.  One of the most famous topics in topology is the Möbius strip. This shape has some unusual properties which have inspired many artists, inventors, mathematicians and magicians.

You can make a Möbius strip by taking a strip of paper, giving it one twist and glue the ends together to form a loop. If you now cut this strip lengthwise in half, you don’t end-up with two separate strips, but with one long one.

The Möbius strip can also be described with the following parametric equations (where 0 \leq u \leq 2\pi , -1 \leq v \leq 1 and R is the radius of the loop):

x(u,v)= \left(R+\frac{v}{2} \cos \frac{u}{2}\right)\cos u
y(u,v)= \left(R+\frac{v}{2} \cos\frac{u}{2}\right)\sin u
z(u,v)= \frac{v}{2}\sin \frac{u}{2}

The mathematics of this set of parametric equations is not as compex as it looks. R is the radius of the ring, u is the polar angle of each point and v indicates the width of the strip. The polar angle u/2 indicates the number of half twists. To make the ring twist twice, change the anlge to u .

For my data science day job, I have to visualise some three-dimensional spaces so I thought I best learn how to do this by visualising a Möbis strip, using these three equations.

Plotting a Möbius Strip

The RGL package provides the perfect functionality for plotting Möbius strips. This package produces interactive three-dimensional plots that you can zoom and rotate. This package has many options to change lighting, colours, shininess and so on. The code to create for plotting a Möbius strip is straightforward.

The first section defines the parameters and converts the u and v sequences to a mesh (from the plot3D package). This function creates two matrices with every possible combination of u and v which are used to calculate the x, y, z points.

The last three lines define a 3D window with a white background and plot the 3D surface in blue. You can explore the figure with your mouse by zooming and rotating it. Parametric equations can be a bit of fun, play with the formula to change the shape and see what happens.


R <- 5
u <- seq(0, 2 * pi, length.out = 100)
v <- seq(-1, 1, length.out = 100)
m <- mesh(u, v)
u <- m$x
v <- m$y

## Móbius strip parametric equations
x <- (R + v/2 * cos(u /2)) * cos(u)
y <- (R + v/2 * cos(u /2)) * sin(u)
z <- v/2 * sin(u / 2)

## Visualise
bg3d(color = "white")
surface3d(x, y, z, color= "blue")

You can find the latest version of this code on GitHub.

Plotting a Möbius Strip
Plotting a Möbius Strip: RGL output.

We can take it to the next level by plotting a three-dimensional Möbius strip, or a Klein Bottle. The parametric equations for the bottle are mind boggling:

x(u,v) = -\frac{2}{15} \cos u (3 \cos{v}-30 \sin{u}+90 \cos^4{u} \sin{u} -60 \cos^6{u} \sin{u} +5 \cos{u} \cos{v} \sin{u})

y(u,v) = -\frac{1}{15} \sin u (3 \cos{v}-3 \cos^2{u} \cos{v}-48 \cos^4{u} \cos{v} + 48 \cos^6{u} \cos{v} - 60 \sin{u}+5 \cos{u} \cos{v} \sin{u}-5 \cos^3{u} \cos{v} \sin{u}-80 \cos^5{u} \cos{v} \sin{u}+80 \cos^7{u} \cos{v} \sin{u})

z(u,v) = \frac{2}{15} (3+5 \cos{u} \sin{u}) \sin{v}

Where: 0 \leq u \leq \pi and 0 \leq v \leq 2\leq .

The code to visualise this bottle is essentially the same, just more complex equations.

u <- seq(0, pi, length.out = 100)
v <- seq(0, 2 * pi, length.out = 100)
m <- mesh(u, v)
u <- m$x
v <- m$y
x <- (-2 / 15) * cos(u) * (3 * cos(v) - 30 * sin(u) + 90 * cos(u)^4 * sin(u) - 60 * cos(u)^6 * sin(u) + 5 * cos(u) * cos(v) * sin(u))
y <- (-1 / 15) * sin(u) * (3 * cos(v) - 3 * cos(u)^2 * cos(v) - 48 * cos(u)^4 * cos(v) + 48 * cos(u)^6 * cos(v) - 60 * sin(u) + 5 * cos(u) * cos(v) * sin(u) - 5 * cos(u)^3 * cos(v) * sin(u) - 80 * cos(u)^5 * cos(v) * sin(u) + 80 * cos(u)^7 * cos(v) * sin(u))
z <- (+2 / 15) * (3 + 5 * cos(u) * sin(u)) * sin(v)

bg3d(color = "white")
surface3d(x, y, z, color= "blue", alpha = 0.5)
Plotting a Klein Bottle in RGL
Plotting a Klein Bottle in RGL. Click to view RGL widget.

The RGL package has some excellent facilities to visualise three-dimensional objects, far beyond simple strips. I am still learning and am working toward using it to visualise bathymetric surveys of water reservoirs. Möbius strips are, however, a lot more fun.

If you are interested in visualising the oddities of mathematics then feel free to peruse my article about the Sierpinsi triangle.

Creating Real Möbius Strips

Even more fun than playing with virtual Möbius strips is to make some paper versions and start cutting, just like August Möbius did when he did his research. If you like to create a Möbius strip, you can recycle then purchase a large zipper from your local haberdashery shop, add some hook-and-loop fasteners to the ends and start playing. If you like to know more about the mathematics of the topological curiosity, then I can highly recommend Clifford Pickover’s book on the topic.

In the first half of the twentieth century, many magicians used the Möbius strip as a magic trick. The great Harry Blackstone performed it regularly in his show.

If you are interested in magic tricks and Möbius strips, including how to create a zippper version, then you can read my ebook on the Afghan bands.

The Möbius Strip in Magic. A Treatise on the Afghan Bands

Analysing Digital Water Meter Data using the Tidyverse

Analysing Digital Water Meter Data using the Tidyverse

In last week’s article, I discussed how to simulate water consumption data to help develop analytics and reporting. This post describes how to create a diurnal curve from standard digital metering data. The code for this article is available on GitHub.

Data Source

The simulated data consists  of three fields:

All analysis is undertaken in the local Australian Eastern Standard Time (AEST). The input to all functions is thus in AEST. The digital water meters send an hourly pulse at a random time within the hour. Each transmitter (RTU) uses a random offset to avoid network congestion. The digital meter counts each time the impeller makes a full turn, and for this analysis, we assume that this equates to a five-litre volume. The ratio between volume and count depends on the meter brand and type. The image below shows a typical data set for an RTU, including some missing data points.

Simulated water consumption.
Simulated water consumption (red: measured points, blue: interpolated points.

To analyse the data we need two auxiliary functions: one to slice the data we need and one to interpolate data for the times we need it. The Tidyverse heavily influences the code in this article. I like the Tidyverse way of doing things because it leads to elegant code that is easy to understand.

meter_reads &amp;amp;lt;- read.csv(&amp;amp;quot;meter_reads.csv&amp;amp;quot;)
rtu &amp;amp;lt;- unique(meter_reads$DevEUI)
meter_reads$TimeStampUTC &amp;amp;lt;- as.POSIXct(meter_reads$TimeStampUTC, tz = &amp;amp;quot;UTC&amp;amp;quot;)

Slicing Digital Water Metering Data

Data analysis is undertaken on slices of the complete data set. This function slices the available data by a vector of RTU ids and a timestamp range in AEST. This function adds a new timestamp variable in AEST. If no date range is provided, all available data for the selected RTUs is provided. The output of this function is a data frame (a Tibble in Tydiverse language).

slice_reads &amp;amp;lt;- function(rtus, dates = range(meter_reads$TimeStampUTC)) {
 filter(meter_reads, DevEUI %in% rtus) %&amp;amp;gt;%
    mutate(TimeStampAEST = as.POSIXct(format(TimeStampUTC, tz = &amp;amp;quot;Australia/Melbourne&amp;amp;quot;)) %&amp;amp;gt;%
    filter(TimeStampAEST &amp;amp;amp;amp;gt;= as.POSIXct(dates[1]) &amp;amp;amp; 
             TimeStampAEST &amp;amp;amp;amp;lt;= as.POSIXct(dates[2])) %&amp;amp;gt;%
    arrange(DevEUI, TimeStampAEST)

Interpolation of Meter Reads

This function interpolates the cumulative counts for a series of RTUs over a vector of timestamps in AEST. The function creates a list to store the results for each RTU, interpolates the data using the approx function and then flattens the list back to a data frame. The interpolation function contains a different type of pipe because of the approx for interpolation function does not take a data argument. The %$% pipe from the Magrittr package solves that problem.

The output is a data frame with DevEUI, the timestamp in AEST and the interpolated cumulative count. The image above shows the counts for two meters over two days an the graph superimposes an interpolated point over the raw data. Although the actual data consists of integer counts, interpolated values are numeric values. The decimals are retained to distinguish them from real reads.

interpolate_count &amp;amp;lt;- function(rtus, timestamps) {
  timestamps &amp;amp;lt;- as.POSIXct(timestamps, tz = &amp;amp;quot;Australia/Melbourne&amp;amp;quot;)
  results &amp;amp;lt;- vector(&amp;amp;quot;list&amp;amp;quot;, length(rtus))
  for (r in seq_along(rtus)) {
    interp &amp;amp;lt;- slice_reads(rtus[r]) %$%
      approx(TimeStampAEST, Count, timestamps)
    results[[r]] &amp;amp;lt;- data_frame(DevEUI = rep(rtus[r], length(timestamps)), 
                               TimeStampAEST = timestamps, Count = interp$y) 
  return(, results)) 

interpolate_count(rtu[2:3], seq.POSIXt(as.POSIXct(&amp;amp;quot;2020-02-01&amp;amp;quot;), as.POSIXct(&amp;amp;quot;2020-02-2&amp;amp;quot;), by = &amp;amp;quot;day&amp;amp;quot;)) 

slice_reads(rtu[2], c(&amp;amp;quot;2020-02-06&amp;amp;quot;, &amp;amp;amp;&amp;amp;quot;2020-02-08&amp;amp;quot;)) %&amp;amp;gt;%
  ggplot(aes(x = TimeStampAEST, y = Count))  + 
  geom_line(col = &amp;amp;quot;grey&amp;amp;quot;, size = 1) + 
    geom_point(col = &amp;amp;quot;red&amp;amp;quot;) + 
    geom_point(data = interpolate_count(rtu[2], as.POSIXct(&amp;amp;quot;2020-02-06&amp;amp;quot;) + (0:2) * 24 * 3600), colour = &amp;amp;quot;blue&amp;amp;quot;) + 
    ggtitle(paste(&amp;amp;quot;DevEUI&amp;amp;quot;, rtu[2]))

With these two auxiliary functions, we can start analysing the data.

Daily Consumption

Daily consumption for each connection is a critical metric in managing water resources and billing customers. The daily consumption of any water connection is defined by the difference between the cumulative counts at midnight. The interpolation function makes it easy to determine daily consumption. This function interpolates the midnight reads for each of the RTUs over the period, starting the previous day. The output of the function is a data frame that can be piped into the plotting function to visualise the data. When you group the data by date, you can also determine the total consumption over a group of services.

daily_consumption &amp;amp;lt;- function(rtus, dates) {
  timestamps &amp;amp;lt;- seq.POSIXt(as.POSIXct(min(dates)) - 24 * 3600, as.POSIXct(max(dates)), by = &amp;amp;quot;day&amp;amp;quot;) 
  interpolate_count(rtus, timestamps) %&amp;amp;gt;%
    group_by(DevEUI) %&amp;amp;gt;%
    mutate(Consumption = c(0, diff(Count)) * 5,
           Date = format(TimeStampAEST, &amp;amp;quot;%F&amp;amp;quot;)) %&amp;amp;gt;%
    filter(TimeStampAEST != timestamps[1]) %&amp;amp;gt;%
    select(DevEUI, Date, Consumption)

daily_consumption(rtu[32:33], c(&amp;amp;quot;2020-02-01&amp;amp;quot;, &amp;amp;quot;2020-02-7&amp;amp;quot;)) %&amp;amp;gt;%
  ggplot(aes(x = Date, y = Consumption)) + geom_col() + 
  facet_wrap(~DevEUI) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
Analysing digital water meter data: Daily consumption.
Analysing digital water meter data: Daily consumption.

Diurnal Curves

The diurnal curve is one of the most important pieces of information used in the design of water supply systems. This curve shows the usage of one or more services for each hour in the day. This curve is a reflection of human behaviour, as we use most water in the morning and the evenings.

This function slices data for a vector of RTUs over a period and then plots the average diurnal curve. The data is obtained by interpolating the cumulative counts for each whole hour in the period. The function then calculates the flow in litres per hour and visualises the minimum, mean and maximum value.

plot_diurnal_connections &amp;amp;lt;- function(rtus, dates) {
  timestamps &amp;amp;lt;- seq.POSIXt(as.POSIXct(dates[1]), as.POSIXct(dates[2]), by = &amp;amp;quot;hour&amp;amp;quot;) 
  interpolate_count(rtus, timestamps) %&amp;amp;gt;% 
    mutate(Rate = c(0, diff(Count * 5)),
           Hour = as.integer(format(TimeStampAEST, &amp;amp;amp;amp;quot;%H&amp;amp;amp;amp;quot;))) %&amp;amp;gt;% 
    filter(Rate &amp;amp;amp;amp;gt;= 0) %&amp;amp;gt;%
    group_by(Hour) %&amp;amp;gt;%
    summarise(min = min(Rate), mean = mean(Rate), max = max(Rate)) %&amp;amp;gt;%
    ggplot(aes(x = Hour, ymin = min, ymax = max)) + 
      geom_ribbon(fill = &amp;amp;quot;lightblue&amp;amp;quot;, alpha = 0.5) + 
      geom_line(aes(x = Hour, y = mean), col = &amp;amp;quot;orange&amp;amp;quot;;, size = 1) +
      ggtitle(&amp;amp;quot;Connections Diurnal flow&amp;amp;quot;) + ylab(&amp;amp;quot;Flow rate [L/h]&amp;amp;quot;)

plot_diurnal_connections(rtu[12:20], c(&amp;amp;quot;2020-02-01&amp;amp;quot;, &amp;amp;quot;2020-03-01&amp;amp;quot;))
Analysing digital water meter data: Diurnal curve.
Analysing digital water meter data: Diurnal curve.

Boxplots are also an informative way to visualise this curve. This method provides more statistical information on one page, and the ggplot function performs the statistical analysis.

plot_diurnal_box &amp;amp;lt;- function(rtus, dates) {
  timestamps &amp;amp;lt;- seq.POSIXt(as.POSIXct(dates[1]), as.POSIXct(dates[2]), by = &amp;amp;quot;hour&amp;amp;quot;) 
  interpolate_count(rtus, timestamps) %&amp;amp;gt;% 
    mutate(Rate = c(0, diff(Count * 5)),
           Hour = as.integer(format(TimeStampAEST, &amp;amp;quot;%H&amp;amp;quot;))) %&amp;amp;gt;% 
    filter(Rate &amp;amp;gt;= 0) %&amp;amp;gt;%
    group_by(Hour) %&amp;amp;gt;%
    ggplot(aes(x = factor(Hour), y = Rate)) + 
      geom_boxplot() + 
      ggtitle(&amp;amp;quot;Diurnal flow&amp;amp;quot;) + ylab(&amp;amp;quot;Flow rate [L/h]&amp;amp;quot;) + xlab(&amp;amp;quot;Time&amp;amp;quot;)
plot_diurnal_box(rtu[12:20], c(&amp;amp;quot;2020-02-01&amp;amp;quot;, &amp;amp;quot;2020-03-01&amp;amp;quot;))
Analysing digital water meter data: Diurnal curve.
Analysing digital water meter data: Diurnal curve.

Further Analysing Digital Water Metering Data

These are only glimpses into what is possible with this type of data. Further algorithms need to be developed to extract additional value from this data. I am working on developing leak detection algorithms and clustering diurnal curves, daily consumption graphs and so on. Any data science enthusiast who is interested in helping me to develop an Open Source R library to analyse digital metering data.


Simulating Water Consumption to Develop Analysis and Reporting

Simulating Water Consumption to Develop Analysis and Reporting

I am currently working on developing analytics for a digital water metering project. Over the next five years, we are enabling 70,000 customer water meters with digital readers and transmitters. The data is not yet available but we don’t want to wait to build reporting systems until after the data is live. The R language comes to the rescue as it has magnificent capabilities to simulate data. Simulating data is a useful technique to progress a project when data is being collected. Simulated data also helps because the outcomes of the analysis are known, which helps to validate the outcomes.

The raw data that we will eventually receive from the digital customer meters has the following basic structure:

  • DevEUI: Unique device identifier.
  • Timestamp: Date and time in (UTC) of the transmission.
  • Cumulative count: The number of revolutions the water meter makes. Each revolution is a pulse which equates to five litres of water.

Every device will send an hourly data burst which contains the cumulative meter read in pulse counts. The transmitters are set at a random offset from the whole our, to minimise the risk of congestion at the receivers. The time stamp for each read is set in the Coordinated Universal Time (UTC). Using this time zone prevents issues with daylight savings. All analysis will be undertaken in the Australian Eastern (Daylight) Time zone.

This article explains how we simulated test data to assist with developing reporting and analysis. The analysis of digital metering data follows in a future post. The code and the data can be found on GitHub. I have recently converted to using the Tidyverse for all my R coding. It has made my working life much easier and I will use it for all future posts.

Simulating water consumption

For simplicity, this simulation assumes a standard domestic diurnal curve (average daily usage pattern) for indoor water use. Diurnal curves are an important piece of information in water management. The curve shows water consumption over the course of a day, averaged over a fixed period. The example below is sourced from a journal article. This generic diurnal curve consists of 24 data points based on measured indoor water consumption, shown in the graph below.

Simulating water consumption: diurnal curve example
Source: Gurung et al. (2014) Smart meters for enhanced water supply network modelling and infrastructure planning. Resources, Conservation and Recycling (90), 34-50.

This diurnal curve only includes indoor water consumption and is assumed to be independent of seasonal variation. This is not a realistic assumption, but the purpose of this simulation is not to accurately model water consumption but to provide a data set to validate the reporting and analyses.

Simulating water consumption in R

The first code snippet sets the parameters used in this simulation. The unique device identifiers (DevEUI) are simulated as six-digit random numbers. The timestamps vector consists of hourly date-time variables in UTC. For each individual transmitter, this timestamp is offset by a random time. Each transmitter is also associated with the number of people living in each house. This number is based on a Poisson distribution.

# Boundary conditions
n <- 100 # Number of simulated meters
d <- 100 # Number of days to simulate
s <- as.POSIXct(&quot;2020-01-01&quot;, tz = &quot;UTC&quot;) # Start of simulation

set.seed(1969) # Seed random number generator for reproducibility
rtu <- sample(1E6:2E6, n, replace = FALSE) # 6-digit id
offset <- sample(0:3599, n, replace = TRUE) # Unique Random offset for each RTU

# Number of occupants per connection
occupants <- rpois(n, 1.5) + 1 %>%
  ggplot(aes(occupants)) + geom_bar(fill = &quot;dodgerblue2&quot;, alpha = 0.5) + 
  xlab("Occupants") + ylab("Connections") + ggtitle("Occupants per connection")
Simulated number of occupants per connection.
Simulated number of occupants per connection.

The diurnal curve is based on actual data which includes leaks as the night time use shows a consistent flow of about one litre per hour. For that reason, the figures are rounded and reduced by one litre per hour, to show a zero flow when people are usually asleep. The curve is also shifted by eleven hours because the raw data is stored in UTC.

diurnal <- round(c(1.36, 1.085, 0.98, 1.05, 1.58, 3.87, 9.37, 13.3, 12.1, 10.3, 8.44, 7.04, 6.11, 5.68, 5.58, 6.67, 8.32, 10.0, 9.37, 7.73, 6.59, 5.18, 3.55, 2.11)) - 1 

data.frame(TimeUTC = 0:23, Flow = diurnal) %>% 
  ggplot(aes(x = TimeUTC, y = Flow)) + 
  geom_area(fill = "dodgerblue2", alpha = 0.5) +
  scale_x_continuous(breaks = 0:23) + ylab(&quot;Flow [L/h/p]&quot;) + 
  ggtitle("Idealised diurnal curve for households")

tdiff <- 11
diurnal <- c(diurnal[(tdiff + 1): 24], diurnal[1:tdiff])

This simulation only aims to simulate a realistic data set and not to present an accurate depiction of reality. This simulation could be enhanced by using different diurnal curves for various customer segments and to include outdoor watering, temperature dependencies and so on.

Simulating Water Consumption

A leak is defined by a constant flow through the meter, in addition to the idealised diurnal curve. A weighted binomial distribution (θ = 0.1) models approximately one in ten properties with a leak. The size of the leak is derived from a random number between 10 and 50 litres per hour.

The data is stored in a matrix through a loop that cycles through each connection. The DevEUI is repeated over the simulated time period (24 times the number of days). The second variable is the time stamp plus the predetermined offset for each RTU. The meter count is defined by the cumulative sum of the diurnal flow, multiplied by the number of occupants. Each point in the diurnal deviates from the model curve by ±10%. Any predetermined leakage is added to each meter read over the whole period of 100 days. The hourly volumes are summed cumulatively to simulate meter reads. The flow is divided by five as each meter revolution indicate five litres.

The next code snippet simulates the digital metering data using the assumptions and parameters outlined above.

leaks <- rbinom(n, 1, prob = .1) * sample(10:50, n, replace = TRUE) data.frame(DevEUI = rtu, Leak = leaks) %>%
  subset(Leak > 0)

# Digital metering data simulation
meter_reads <- matrix(ncol = 3, nrow = 24 * n * d)
colnames(meter_reads) <- c("DevEUI", "TimeStampUTC , "Count")

for (i in 1:n) {
  r <- ((i - 1) * 24 * d + 1):(i * 24 * d)
  meter_reads[r, 1] <- rep(rtu[i], each = (24 * d))
  meter_reads[r, 2] <- seq.POSIXt(s, by = "hour", length.out = 24 * d) + offset[i]
  meter_reads[r, 3] <- round(cumsum((rep(diurnal * occupants[i], d) + leaks[i]) * 
                                     runif(24 * d, 0.9, 1.1))/5)

meter_reads &lt;- meter_reads %>% 
  as_data_frame() %>%
  mutate(TimeStampUTC = as.POSIXct(TimeStampUTC, origin = "1970-01-01", tz = "UTC"))

Missing Data Points

The data transmission process is not 100% reliable and the base station will not receive some reads. This simulation identifies reads to be removed from the data through the temporary variable remove. This simulation includes two types of failures:

  • Faulty RTUs (2% of RTUs with missing 95% of data)
  • Randomly missing data points (1% of data)
meter_reads <- mutate(meter_reads, remove = 0)
# Define faulty RTUs (2% of fleet)
faulty <- rtu[rbinom(n, 1, prob = 0.02) == 1]
meter_reads$remove[meter_reads$DevEUI %in% faulty] <- rbinom(sum(meter_reads$DevEUI %in% faulty), 1, prob = .95)

# Data loss
missing <- sample(1:(nrow(meter_reads) - 5), 0.005 * nrow(meter_reads))
for (m in missing){
  meter_reads[m:(m + sample(1:5, 1)), "remove"] <- 1

# Remove data points
meter_reads <- filter(meter_reads, remove == 0) %>%

filter(meter_reads, DevEUI %in% rtu[2]) %>%
  mutate(TimeStampAEST = as.POSIXct(format(TimeStampUTC, 
                                           tz = "Australia/Melbourne"))) %>%
  filter(TimeStampAEST >= as.POSIXct("2020-02-06") &
         TimeStampAEST <= as.POSIXct("2020-02-08")) %>%
  arrange(DevEUI, TimeStampAEST) %>% 
  ggplot(aes(x = TimeStampAEST, y = Count, colour = factor(DevEUI)))  + 
    geom_line() + geom_point() 

The graph shows an example of the cumulative reads and some missing data points.

Simulated water consumption


Analysing Digital Metering Data

Data simulation is a good way to develop your analysis algorithms before you have real data. I have also used this technique when I was waiting for survey results during my dissertation. When the data finally arrived, I simply had to plug it into the code and finetune the code. R has great capabilities to simulate reality to help you understand the data. The ggplot package provides excellent functionality to visualise water consumption.

In next week’s article, I will outline how I used R and the Tidyverse package to develop libraries to analyse digital metering data.

Writing Academic Articles using R Sweave and LaTeX

Body Image research paper written with Org Mode, R and laTeX

One of my favourite activities in R is using Markdown to create business reports. Most of my work I export to MS Word to communicate analytical results with my colleagues. For my academic work and eBooks, I prefer LaTeX to produce great typography. This article explains how to write academic articles and essays combining R Sweave and LaTeX. The article is formatted in accordance with the APA (American Psychological Association) requirements.

To illustrate the principles of using R Sweave and LaTeX, I recycled an essay about problems with body image that I wrote for a psychology course many years ago. You can find the completed paper and all necessary files on my GitHub repository.

In an other post I have outlined how to achieve the same result using Emacs and Org Mode to write acedemic articles.

Body Image

Body image describes the way we feel about the shape of our body. The literature on this topic demonstrates that many people, especially young women, struggle with their body image. A negative body image has been strongly associated with eating disorders. Psychologists measure body image using a special scale, shown in the image below.

My paper measures the current and ideal body shape of the subject and the body shape of the most attractive other sex. The results confirm previous research which found that body dissatisfaction for females is significantly higher than for men. The research also found a mild positive correlation between age and ideal body shape for women and between age and the female body shape found most attractive by men. You can read the full paper on my personal website.

Body shape measurement scale.
Body shape measurement scale.

R Sweave and LaTeX

The R file for this essay uses the Sweave package to integrate R code with LaTeX. The first two code chunks create a table to summarise the respondents using the xtable package. This package creates LaTeX or HTML tables from data generated by R code.

The first lines of the code read and prepare the data, while the second set of lines creates a table in LaTeX code. The code chunk uses results=tex to ensure the output is interpreted as LaTeX. This approach is used in most of the other chunks. The image is created within the document and saved as a pdf file and back integrated into the document as an image with appropriate label and caption.

View the Rnw file on my GitHub repository to see the full code.


I created this file in R Studio, using the Sweave and knitr functionality. To knit the R Sweave file for this paper you will need to install the apa6 and ccicons packages in your LaTeX distribution. The apa6 package provides macros to format papers in accordance with the requirements American Psychological Association.