Analyse Site Structure and Internal Linking Structure with igraph

One of the essential parts of Search Engine Optimisation (SEO) is a logical site structure. Site structure relates to how the pages of a website link to each other. A well-designed structure of internal links improves the user’s experience. It also helps web crawlers to find their way through the website. This article shows how to use the igraph software to analyse site structure.

The iGraph package is a versatile library that can analyse many different types of data, some examples:

The Lucid Manager website has been around for ten years now, and the content has grown organically over time, which is a reflection of my changing interests and evolving career. Although I use categories and tags to organise the articles, internal linking between posts is a bit haphazard. This post shows how I extracted the relevant data from the WordPress database and visualised the internal links with the iGraph package. This analysis shows that my website needs a bit of rework to improve its jungle of internal links to provide you and search engines a better experience. You can view the R code to investigate site structure below or download the latest version from my GitHub repository.

Site Structure

A website needs to have a structure to prevent it from being a loose collecting of articles. Internal links on a website organise the available information through taxonomy, and these links provide context to the text by referring to related pages. The taxonomy consists of menus, breadcrumbs, categories, tags and other structural elements. Contextual links are located in the body of text of each article and provide additional or related information to the reader.

WordPress automatically adds links to categories and tags and the creates an ordered structure. Body text links are produced much more organically and can therefore quickly turn into a chaotic jungle of relationships. WordPress has several plugins that help with site structure, but not of these tools provide a complete overview of the structure in the internal links.

The taxonomy provides an automated structure of the website by categorising articles. The internal linking structure is organic and needs some further consideration. This analysis focuses on the organic links within the body of the text to visualise and analyse the internal linking structure.

Extracting WordPress Data

The Lucid Manager is created with WordPress. This system stores the body text of all articles in the wp_posts table in the database. To analyse site structure, we need the following fields:
post_name (slug)
post_content (full text)
post_type (only posts)
post_status (only published posts)

The post type is in most instances either a page or a post, but can also be attachments and so on. The post name (slug) links posts together, and the post content contains the full text of the article in HTML.

Several methods are available to extract the required data from the WordPress database. The easiest way is to use a plugin such as WP All Export. You can drag-and-drop the required fields or run a SQL query. The second method is to log in to the phpMyAdmin on your cPanel and run the relevant queries.

The third method involves using the RMySQL package to download data from the site directly. You will need to create a new user with limited access rights and open the database to your IP address.

The first step reads the data from the database and selects all published posts. The credentials are read from the dbconnect.R file, so I don’t publish sensitive data on this website. The table names will be different if you use a WordPress network. Thanks to ‘phaskat’ on the WordPress StackExchange for helping with the queries. Please note that when using a WordPress network (as I do), you will need to replace the wp prefix in the table name with the relevant string.

Converting the data to a network

The second step uses the tidytext package to convert the texts of the articles into tokens. In this case, a token is defined as any set of characters between spaces. In the default setting, the tokenisation function also splits words at dashes, which is not helpful because I want to preserve the hyperlinks as one token.

The third step uses regular expressions to find internal links and only retain the slug. The slug is the text after the website name. The post_name field in the database contains the slug for the post itself.

These steps result in a table that shows how all slugs link to each other. In network analysis, this is an adjacency list as each line represents a relationship in the network.

Analysing Site Structure with iGraph

Network analysis is the perfect tool to analyse site structure because each post on the website is a node and a link between two articles is a graph edge (arrow).

The Lucid Manager currently discusses strategic and fun data analysis, but in the past, I wrote articles about water utility marketing and critical perspectives on management theories. The questions I like to answer with this analysis are:
– Are all pages connected to each other in one or more steps?
– What is the most linked page?
– Which pages link to themselves?
– Are there duplicated links within one post?

The iGraph software transforms the link table to a network. Any posts and pages without any links (solitary pages) are added separately. Each post is also given a colour related to its category, and pages are a separate colour.

This analysis shows that the internal link structure contains several sub networks, which are groups of pages that are only linked to each other. This graph will help in to identify more linking opportunities to dissolve sub networks.

The degree function in iGraph determines the number of adjacent edges of each graph. The degree function can specify incoming and outgoing links. My article about service quality in water utilities is the most linked post on this site. In network analysis, this is the node with the highest degree of linking.

The loops in the diagram indicate self-referencing articles. The which_loop function identifies which arrows have the same start and end. The E function shows the list of edges, which shows that four pages on my website refer to themselves.

Lastly, the which_multiple function identifies any duplicated edges in the graph. Within this context, these are pages that contain the same link more than once. The Lucid Manager website has seventeen instances of duplicated links.

This first phase gives me plenty of homework to improve the internal linking structure of this website.

Visualising website structure. Click to enlarge.
Visualising website structure.

Future improvements

This analysis is a great step to start systematically analysing organic website structure. Several other techniques help to understand the structure of a website further. Community detection and the inclusion of tags and categories in the data can provide additional context.

SEO experts might find this concept useful. Perhaps somebody else can develop a WordPress plugin to provide this visualisation.

Decode Lyrics in Pop Music: Visualise Prose with the Songsim algorithm

Music is an inherently mathematical form of art. Ancient Greek mathematician Pythagoras was the first to describe the logic of the scales that form melody and harmony. Numbers can also represent the rhythm of the music. Even the lyrics have a mathematical structure. Poets structure syllables and repeat words to create pleasing sounding prose. This article shows how to decode lyrics from pop songs and visualise them using the Songsim method to analyse their metre.

Decode Lyrics using the Songsim algorithm

Data visualiser, pop music appreciator and machine learner Colin Morris has extensively analysed the repetitiveness of song lyrics. Colin demonstrated that lyrics are becoming more repetitive since the early days of pop music. The most repetitive song is Around the World by Daft Punk, which should not be a surprise since the artist repeats the same phrase 144 times. Bohemian Rhapsody by Queen has some of the least repetitive lyrics in popular music.

The TedX presentation (see below) by Colin Morris shows how he visualises the repetitiveness of song lyrics with what he calls the Songsim algorithm. The more points in the graph, the more often a word is repeated.

Decdodig lyrics: Daft Punk versus Queen
Visualise the lyrics of Around the World and Bohemian Rhapsody.

The visual language of song lyrics

Morris decided to use a self-similarity matrix, used to visualise DNA sequences, to decode lyrics. In this method, the individual words of the song are the names of the columns and the names of the rows in a matrix. For every point in the song where the row name equals the column name, shows a dot. By definition, the diagonal of every similarity matrix is filled. The timeline of the song runs along the diagonal from top left to bottom right.

Patterns away from the diagonal represent two different points in time that have the same words. The more of these patterns we see, the more repetitive a song is. Let’s demonstrate this with the first words ever recorded by Thomas Edison in 1877.

Mary had a little lamb, whose fleece was white as snow. And everywhere that Mary went, the lamb was sure to go.

The similarity matrix below visualises the two first sentences of the famous nursery rhyme. It shows where the words “Mary”, “lamb” and “was” are repeated once.

Self-similarity matrix for Mary had a Little Lamb by Thomas Edison.
Self-similarity matrix for Mary had a Little Lamb by Thomas Edison.

The snowflake diagrams are a visual language to decode lyrics. The verses are the gutters with only diagonal lines. A verse is not very repetitive besides some stop words. The verse repeats through the song. Many songs have a bridge that contrasts with the rest of the song. The bridge is in most songs a unique pattern with self-similarity.

The diagram below visualises the lyrics of one of the most famous pop songs ever, Waterloo by Abba. The first 30 words are the opening verse, which shows little repetition, other than stop words such as and the pronoun I. After that we see diagonal lines appearing that represent the repetitive use of the song title. Towards the end of the song, we see the bridge, which is like a little snowflake within the diagram.

Decoding lyrics with songsim: Waterloo by Abba.
Decoding lyrics: Waterloo by Abba.

The next section shows how to implement this approach with ggplot, scraping pop song lyrics from the azlyrics.com website.

Implementing Songsim with ggplot

The code below visualises song lyrics or poetry as suggested by Colin Morris. The code uses four libraries. I use the tidyverse series of libraries because it makes life very easy. The tidytext library uses the tidyverse principles to analyse text. The old reshape2 library helps to transform a matrix, and lastly, rvest helps to scrape song lyrics from the azlyrics website.

The first function scrapes song lyrics from the azlyrics website using the artist and song as input. The first three lines clean the artist and song variables. This code removes any character that is not a number or a letter, converts to lowercase and lastly removes the definite article in the artist name. These two fields are then concatenated to create the URL, which the function prints. The remainder of the code scrapes the lyrics from the website or trips on an error 404 when it cannot find the song/artist combination.

The second function implements the Morris method to visualise the lyrics. The code extracts single words from the text and places them in a data frame (tibble). This data frame is converted to a boolean matrix that contains the visualisation.

The code looks at each word and places the value TRUE where reappears in the song. Each of the vectors is then concatenated to a matrix. Lastly, ggplot visualises the matrix is visualised as a raster.

What does your favourite song look like a snowflake diagram?

The Songsim code

You can view the code below or download the latest version from my GitHub repository.

Strategic Data Science: Creating Value With Data Big and Small

Data science is without a doubt the most popular business fad of the past decade. The promise of machine learning blinds many managers, so they forget about deploying these new approaches strategically. This article provides a framework for data science strategy and is a synopsis of my book Principles of Strategic Data Science, available on LeanPub.

What is Data Science?

The frist data scientist in a business context was Frederick Taylor. He was a pioneer in using data in business to lower the influence of opinions and rules-of-thumb in favour of a scientific approach to management. The term data science emerged in the middle of the last century when electronic computation first became a topic of study. In those days, the discipline was literally a science of storing and manipulating data.

The current definition has drifted away from this initial academic activity to business activity. The present data science hype can be traced back to an article in the 2012 edition of Harvard Business Review. Davenport and Patil proclaimed data scientist to be the sexiest job of the twenty-first century. In the wake of this article, the number of data science searches in Google increased rapidly.

Organisations have for a long time used data to improve the lives of their customers, shareholders or society overall. Management gurus promoted concepts such as the data-driven organisation, evidence-based management, business intelligence and Six Sigma to help businesses realise the benefits of their data. Data science is an evolution of these methods enabled by the data revolution.

The Data Revolution

Recent developments in information technology have significantly improved what we can do with data, resulting in what we now know as data science. Firstly, most business processes are managed electronically, which has exponentially increased the amount of available data. Developments in communication, such as the Internet of Things and personal mobile devices, have significantly reduced the price of collecting data.

Secondly, the computing capabilities on the average office worker’s desk outstrip the capabilities of the supercomputers of the past. Not only is it cheaper to collect vast amounts of electronic data, but processing these enormous volumes has also come within reach of the average office worker.

Lastly, developments in applied mathematics and open source licensing have accelerated our capabilities in analysing this data. These new technologies allow us to discover patterns that were previously invisible. Most tools required to examine data are freely available on the internet with a helpful community sharing knowledge on how to use them.

These three developments enabled an evolution from traditional business analysis to data science. Data science is the strategic and systematic approach to analysing data to achieve organisational objectives using electronic computing. This definition is agnostic of the promises of machine learning and leverages the three developments mentioned above. Data science is the next evolution in business analysis that maximises the value we can extract from data.

Data Science Strategy Competencies

The craft of data science combines three different competencies. Data scientist Drew Conway visualised the three core competencies of data science in a Venn diagram.

Data Science Venn Diagram (Conway, 2010).

Data Science Venn Diagram (Conway, 2010).

Firstly and most importantly, data science requires domain knowledge. Any analysis needs to be grounded in the reality it seeks to improve. Subject-matter expertise is necessary to make sense of the investigation. Professional expertise in most areas uses mathematics to understand and improve outcomes. New mathematical tools expand the traditional approaches to develop a deeper understanding of the domain under consideration. Computer science is the competency that binds the available data with mathematics. Writing computer code to extract, transform and analyse data to create information and stimulate knowledge is an essential skill for any data scientist.

Good Data Science

To create value with data, we need to know how to create or recognise good data science. The second chapter uses three principles originally introduced two thousand years ago by Roman architect and engineer Vitruvius. He wrote that buildings need to be useful, sound and aesthetic. These requirements are also ideally suited to define best-practice in data science.

The Vitruvian triangle for data science.

The Vitruvian triangle for data science.

For data science to be useful, it needs to contribute to the objectives of an organisation positively. It is in this sense that data science is an applied science and not an academic pursuit. The famous Data-Information-Knowledge pyramid visualises the process of creating value from data.

Usefulness

Useful data science meaningfully improves our reality through data. Data is a representation of either a social or physical reality. Any data source is ever only a sample of the fullness and complexity of the real world. Information is data imbued with context. The raw data collected from reality needs to be summarised, visualised and analysed for managers to understand the reality of their business. This information increases knowledge about a business process, which is in turn used to improve the reality from which the data was collected. This feedback loop visualises the essence of analysing data in businesses. Data science is a seductive activity because it is reasonably straightforward to create impressive visualisations with sophisticated algorithms. If data products don’t improve or enlighten the current situation, they are in essence useless.

The Reality, Data, Information, Knowledge pyramid.

The Reality, Data, Information, Knowledge pyramid.

Soundness

Data science needs to be sound in that the outcomes are valid and reliable. The validity and reliability of data are where the science meets the traditional approaches to analysing data. Validity is the extent to which the data represents the reality it describes. The reliability of data relates to the accuracy of the measurement. These two concepts depend on the type of data under consideration. Measuring physical processes is less complicated than the social aspects of society. Validity and reliability are in essence a sophisticated way of expressing the well-known Garbage-In-Garbage-Out principle.

Reliability and validity of data and analysis.

Reliability and validity of data and analysis.

The soundness of data science also relates to the reproducibility of the analysis to ensure that other professionals can review the outcomes. Reproducibility prevents that the data and the process by which it was transformed and analysed become a black-box where we have no reason to trust the results. Data science also needs to be sound concerning the governance of the workflow. All data sources need to be curated by relevant subject matter experts to ensure their validity and reliability. Data experts provide that the data is available to those who need it.

Aesthetics

Lastly, data science needs to be aesthetic to ensure that any visualisation or report is easy to understand by the consumer of the analysis. This requirement is not about beautification through infographics. Aesthetic data products minimise the risk or making wrong decisions because the information is presented without room for misinterpretation. Any visualisation needs to focus on telling a story with the data. This story can be a comparison, a prediction, a trend or whatever else is relevant to the problem.

One of the essential principles of aesthetic data science is the data-to-pixel ratio. This principle means that we need to maximise the ratio between all the pixels on a screen and those pixels that present information. Good data visualisation practices austerity to ensure that the people that consume the information understand the story that needs to be told.

Example of low and high data-to-pixel ratio.

Example of low and high data-to-pixel ratio.

Strategic Data Science

The data science continuum is a strategic journey for organisations that seek to maximise value from data. As an organisation moves along the continuum, increased complexity is the payoff for increased value. This continuum is a hierarchy as all phases are equally important. The latter stages cannot exist without the previous ones.

 

Data science continuum

Data science continuum.

Collecting data is requires important considerations on what to collect, how to collect it and at what frequency. To collect meaningful data requires a good understanding of the relationship between reality and data. There is no such thing as raw data as all information relies on assumptions and practical limitations.

Describing the data is the first step in extracting value. Descriptive statistics are the core of most business reporting and are an essential first step in analysing the data.

Diagnostics or analysis is the core activity of most professions. Each subject area uses specialised methods to create new information from data.

Predictive analysis seems to be the holy grail for many managers. A prediction is not a perfect description of the future but provides the distribution of possible futures. Managers can use this information to change the present to construct their desired future.

Prescriptive analysis uses the knowledge created in the previous phases to automatically run a business process and even decide on future courses of action.

Any organisation starting with data science should follow the five phases in this process and not jump ahead to try to bypass the seemingly less valuable stages.

The Data-Driven Organisation

Implementing a data science strategy is more than a matter of establishing a specialised team and solve complex problems. Creating a data-driven organisation that maximises the value of data requires a whole-of-business approach that involves people with the right attitude and skills, appropriate systems and robust processes.

A data science team combines the three competencies described in the Conway Venn diagram. People that have skills in all three of these areas are rare, and the industry calls them unicorns. There is no need for recruiters to start hunting unicorns because these three areas of expertise can also exist within a team. Possibly more important than the technical skills are the social skills of a data scientist. Not only need they create useful, sound and aesthetic data science, they also need to convince the consumers of their work of its value.

One of the problems of creating value with data is ensuring that the results are implemented in the organisation. A starting point to achieve this is to ensure that the users of data products have a relevant level of data literacy. Developing data literacy among the consumers of data science is perhaps the greatest challenge. The required level of data literacy depends on the type of position and the role of the data consumer within the organisation.

Data scientists use an extensive range of tools and are often opportunistic in their choice of software. Spreadsheets are not very suitable to create good data science. Data science requires coding skills and the Python and R languages are powerful tools to solve complex problems. After the data specialists have developed the best way to analyse data, they need to communicate these to their customers. Many specific products exist to communicate data to users with interactive dashboards and many other dynamic systems.

The final part of this book delves into the ethics of data science. From the fact that something can be done, we cannot conclude that it should be done. Just like any other profession that impacts humans, data scientists need ethical guidelines to ensure that their activities cause no harm. This book provides some basic guidelines that can assist data scientists to assess the ethical merits of their projects.

Data science ethics: Don't be creepy.

Factor Analysis in R with Psych Package: Measuring Consumer Involvement

The first step for anyone who wants to promote or sell something is to understand the psychology of potential customers. Getting into the minds of consumers is often problematic because measuring psychological traits is a complex task. Researchers have developed many parameters that describe our feelings, attitudes, personality and so on. One of these measures is consumer involvement, which is a measure of the attitude people have towards a product or service.

The most common method to measure psychological traits is to ask people a battery of questions. Analysing these answers is complicated because it is difficult to relate the responses to a survey to the software of the mind. While the answers given by survey respondents are the directly measured variables, what we like to know are the hidden (latent) states in the mind of the consumer. Factor Analysis is a technique that helps to discover latent variables within a responses set of data, such as a customer survey.

The basic principle of measuring consumer attitudes is that the consumer’s state of mind causes them to respond to questions in a certain way. Factor analysis seeks to reverse this causality by looking for patterns in the responses that are indicative of the consumer’s state of mind. Using a computing analogy, factor analysis is a technique to reverse-engineer the source code by analysing the input and output.

This article introduces the concept of consumer involvement and how it can be predictive of other important marketing metrics such as service quality. An example using data from tap water consumers illustrates the theory. The data collected from these consumers is analysed using factor analysis in R, using the psych package.

What is Consumer Involvement?

Involvement is a marketing metric that describes the relevance of a product or service in somebody’s life. Judy Zaichkowsky defines consumer involvement formally as “a person’s perceived relevance of the object based on inherent needs, values, and interests”. People who own a car will most likely be highly involved with purchasing and driving the vehicle due to the money involved and the social role it plays in developing their public self. Consumers will most likely have a much lower level of involvement with the instant coffee they drink than with the clothes they wear.

From a managerial point of view, involvement is crucial because it is causally related to willingness to pay and perceptions of quality.  Consumers with a higher level of involvement are willing to pay more for a service and have a more favourable perception of quality. Understanding involvement in the context of urban water supply is also important because sustainably managing water as a common pool resource requires the active involvement of all users.

The level of consumer involvement depends on a complex array of factors, which are related to psychology, situational factors and the marketing mix of the service provider. The lowest level of involvement is considered a state of inertia which occurs when people habitually purchase a product without comparing alternatives.

Cult products have the highest possible level of involvement as customers are fully devoted to a particular product or brand. Commercial organisations use this knowledge to their advantage by maximising the level of consumer involvement through branding and advertising. This strategy is used effectively by the bottled water industry. Manufacturers focus on enhancing the emotional aspects of their product rather than on improving the cognitive elements. Water utilities tend to use a reversed strategy and emphasise the cognitive aspects of tap water, the pipes, plants and pumps, rather than trying to create an emotional relationship with their consumers.

Measuring Consumer Involvement

Asking consumers directly about their level of involvement would not lead to a stable answer because each respondent will interpret the question differently. The best way to measure psychological states or psychometrics is to ask a series of questions that are linguistically related to the topic of interest.

The most cited method to measure consumer involvement in the Personal Involvement Index, developed by Judy Zaichowsky. This index is a two-dimensional scale consisting of:

  • cognitive involvement (importance, relevance, meaning, value and need)
  • affective involvement (involvement, fascination, appeal, excitement and interest).

The survey instrument consists of ten semantic-differential items. A Semantic Differential is a type of a rating scale designed to measure the meaning of objects, events or concepts. The concept that is being measured, such as involvement, is translated into a list of several synonyms and their associated antonyms.

In the involvement survey, participants are asked to position their views between two extremes such as Worthless and Valuable or Boring and Interesting. The level of involvement is defined as the sum of all answers, which is a number between 10 and 70.

Measuring Consumer Invovement using the Personal Involvement Inventory (Zaichowsky 1994).
Personal Involvement Inventory (Zaichowsky 1994).

Exploratory Analysis

For my dissertation about customer service in water utilities, I measured the level of involvement that consumers have with tap water. 832 tap water consumers completed this survey in Australia and the United States.

This data set contains other information, and the code selects only those variable names starting with “p” (for Personal Involvement Inventory). Before any data is analysed, customers who provided the same answer to all items, or did not respond to all questions, are removed as these are most likely invalid responses., which leaves 757 rows of data.

A boxplot is a convenient way to view the responses to multiple survey items in one visualisation. This plot immediately shows an interesting pattern in the answers. It seems that responses to the first five items were generally higher than those for the last five items. This result seems to indicate a demarcation between cognitive and affective involvement.

Responses to Personal Involvement Index by tap water consumers.
Responses to Personal Involvement Index by tap water consumers.

Next step in the exploratory analysis is to investigate how these factors correlate with each other. The correlation plot below shows that all items strongly correlate with each other. In correspondence with the boxplots above, the first five and the last five items correlate more strongly with each other. This plot suggests that the two dimensions of the involvement index correlate with each other.

Correlation matrix for Personal Involvement Index
Correlation matrix for the Personal Involvement Index

Factor Analysis in R

Factor Analysis is often confused with Principal Component Analysis because the outcomes of are very similar when applied to the same data set. Both methods are similar but have a different purpose. Principal Component Analysis is a data-reduction technique that serves to reduce the number of variables in a problem. The specific purpose of Factor Analysis is to uncover latent variables. The mathematical principles for both techniques are similar, but not the same and should not be confused.

One of the most important decisions in factor analysis is to decide how to rotate the factors. There are two types: orthogonal or oblique. In simple terms, orthogonal rotations seek to reduce the correlation between dimensions and oblique rotation allow for dimensions to relate to each other. Given the strong correlations in the correlation plot and the fact that both dimensions measure involvement, this analysis uses oblique rotation. The visualisation below shows how each of the items how, and the two dimensions relate to each other.

Factor analysis in R with Psych package
Factor analysis in R with Psych package.

This simple exploratory analysis shows the basic principle of how to analyse psychometric data. The psych package has a lot more specialised tools to dig deeper into the information. This article has not assessed the validity of this construct, or evaluated the reliability of the factors. Perhaps that is for a future article.

The R Code

You can view the code below. Go to my Github Repository to see the code and the data source.

5½ Reasons to Ditch Spreadsheets for Data Science: Code is Poetry

When I studied civil engineering some decades ago, we solved all our computing problems by writing code. Writing in BASIC or PASCAL, I could quickly perform fundamental engineering analysis, such as reinforced concrete beams, with my home-brew software library. Soon after I started my career, spreadsheets became widely available, and I fully embraced this fantastic business tool, first Lotus 123 and later grudgingly moved to MS Excel.

Screendump of Atari BASIC program to estimate concrete reinforcement surface area.
Screendump of Atari BASIC program to estimate concrete reinforcement surface area.

Spreadsheets were excellent in those early days because data, code, visualisations and tabular output are all stored in one convenient file. Creating graphs with computer code was a bit of a nightmare in those days, so spreadsheets were a minor miracle. The next twenty years, I must have created thousands of spreadsheets of varying complexity. I even developed a ‘jungle’ of interlinked spreadsheets to manage progress reporting.

In the pioneering days of spreadsheets, they provided enormous convenience for engineers and other professionals to quickly develop analytical tools. But after using this tool for a few years, cracks started to appear.

Spreadsheets are Chaos

Throughout my career, I had many nightmarish experiences trying to reverse engineer spreadsheets, even the ones I wrote myself. The combination of data, code and output that I loved at the start of my career was reaching its limits. Spreadsheets use incomprehensible names for variables (AZ346, XC89 and so on) and the formulas are impossible to read because all code is cramped on one line with deeply nested logic. The multiple parentheses make Excel formulas are even harder to read than LISP expressions.

Furthermore, spreadsheets hide the formulas behind the results, which renders spreadsheets notoriously hard to understand. My love affair with the spreadsheet came to an end when I started writing my dissertation about customer service for water utilities. Excel was incapable of helping me with the complex machine learning I needed to draw my conclusions. A colleague suggested I look into this new thing called ‘Data Science’ and this advice changed my career.

My focus is to implement strategic data science to help organisations to create value from data. One of the ways to achieve this goal is to ditch the spreadsheet and start writing code instead.

Code is Poetry

I decided to learn how to write code in the R language for statistical analysis. The R language is like a Swiss army chainsaw for engineers with capabilities that far exceed anything a spreadsheet can do. Writing in code, such as R or Python, is like writing an instruction manual on how to analyse data. Anyone who understands the language will be able to know how you derived your conclusions. Modern data science languages can generate print-quality visualisations and can output results in many formats, including a spreadsheet. In my job as a data science manager for a water utility I use R code for everything. The awesome power of being able to easily combine large data sets, visualise data and undertake complex analysis. Now that I have rediscovered the poetry of writing computer code, I advocate learning to use R or perhaps Python and ditch the spreadsheet. On my data science blog, I share examples of creating value and having fun with the R language. The only purpose I still have for spreadsheets is an interface for small data sets.

Geographic Bubble Chart: Visualising Water Consumption in Vietnam.
Geographic Bubble Chart: Visualising Water Consumption in Vietnam.

5½ Reasons to Ditch the Spreadsheet

If you are still using spreadsheets, or you are trying to convince a colleague to ditch this tool, here are 5½ reasons to start using code to analyse data:

  1. Good analysis is reproducible and can be peer-reviewed. Spreadsheets are hard to understand because of non-sequential references. Computer code is like an instruction book that can be read step-by-step.
  2. Spreadsheet variables are hard to understand (e.g. ZX81:ZX99). In computer code, you give them meaningful names (e.g. sales[81:99]).
  3. Best practice in data management is to separate data, code and output. In spreadsheets, it is not immediately clear which cell is the result of another cell and which ones are raw data. Computer code separates the data from the code and the output.
  4. You can only share spreadsheet output with people who have access to the relevant software package. Computer code can produce output in multiple formats, such as HTML, PDF or even Excel, including interactive dashboards you can publish on the web.
  5. Functionality in spreadsheets is limited to what is made available by Microsoft. The R and Python languages are extendable and have extensive libraries to solve complex problems.

The bonus reason to ditch the spreadsheet is that the best data science software, such as R and Python, is Open Source and freely available on the web. No license fees and it comes with terrific community support. Feel free to leave a comment if you like to defend the spreadsheet. If you have additional reasons to ditch this venerable but largely obsolete tool. Subscribe to this monthly blog if you are interested in using the R language for practical data science and some fun.

Discourse Network Analysis: Undertaking Literature Reviews in R

Literature reviews are the cornerstone of science. Keeping abreast of developments within any given field of enquiry has become increasingly difficult given the enormous amounts of new research. Databases and search technology have made finding relevant literature easy but, keeping a coherent overview of the discourse within a field of enquiry is an ever more encompassing task.

Scholars have proposed many approaches to analysing literature, which can be placed along a continuum from traditional narrative methods to systematic analytic syntheses of text using machine learning. Traditional reviews are biased because they rely entirely on the interpretation of the researcher. Analytical approaches follow a process that is more like scientific experimentation. These systematic methods are reproducible in the way literature is searched and collated but still rely on subjective interpretation.

Machine learning provides new methods to analyse large swaths of text. Although these methods sound exciting, these methods are incapable of providing insight. Machine learning cannot interpret a text; it can only summarise and structure a corpus. Machine learning still requires human interpretation to make sense of the information.

This article introduces a mixed-method technique for reviewing literature, combining qualitative and quantitative methods. I used this method to analyse literature published by the International Water Association as part of my dissertation into water utility marketing. You can read the code below, or download it from GitHub. Detailed infromation about the methodology is available through FigShare.

A literature review with RQDA

The purpose of this review was to ascertain the relevance of marketing theory to the discourse of literature in water management. This analysis uses a sample of 244 journal abstracts, each of which was coded with the RQDA library. This library provides functionality for qualitative data analysis. RQDA provides a graphical user interface to mark sections of text and assign them to a code, as shown below.

Literature Review with RQDA
Marking topics in an abstract with RQDA.

You can load a corpus of text into RQDA and mark each of the texts with a series of codes. The texts and the codes are stored in an SQLite database, which can be easily queried for further analysis.

I used a marketing dictionary to assess the abstracts from journals published by the International Water Association from the perspective of marketing. This phase resulted in a database with 244 abstracts and their associated coding.

Discourse Network Analysis

Once all abstracts are coded, we can start analysing the internal structure of the IWA literature. First, let’s have a look at the occurrence of the topics identified for the corpus of abstracts.

The first lines in this snippet call the tidyverse and RQDA libraries and open the abstracts database. The getCodingTable function provides a data frame with each of the marked topics and their location.  This function allows us to visualise the occurrence of the topics in the literature.

Frequencies of topics in IWA literature.
Frequencies of topics in IWA literature.

This bar chart tells us that the literature is preoccupied with asset management and the quality of the product (water) or the service (customer perception). This insight is interesting, but not very enlightening information. We can use discourse network analysis to find a deeper structure in the literature.

Discourse Network Analysis

We can view each abstract with two or more topics as a network where each topic is connected. The example below shows four abstracts with two or more codes and their internal networks.

Examples of complete networks for four abstracts.
Examples of complete networks for four abstracts.

The union of these four networks forms a more extensive network that allows us to analyse the structure of the corpus of literature, shown below.

Union of networks and community detection.
Union of networks and community detection.

We can create a network of topics with the igraph package. The first step is to create a Document-Term-Matrix. This matrix counts how often a topic occurs within each abstract. From this matrix, we can create a graph by transforming it into an Adjacency Matrix. This matrix describes the graph which can be visualised and analysed. For more detailed information about this method, refer to my dissertation.

Network of topics in IWA literature.
The network of topics in IWA literature.

In this graph, each node is a topic in the literature, and each edge implies that a topic is used in the same abstract. This graph uses the Fruchterman-Reingold algorithm to position each of the nodes, with the most connected topic in the centre.

The last step is to identify the structure of this graph using community detection. A community is a group of nodes that are more connected with each other than with nodes outside the community.

Community detection in IWA literature
Community detection in IWA literature.

We have now succeeded to convert a corpus of 244 journal abstracts to a parsimonious overview of four communities of topics. This analysis resulted in greater insight into how marketing theory applies to water management, which was used to structure a book about water utility marketing.

Celebrate Halloween with Creepy Computer Games in R

Halloween is upon us once more and who ever said that data science can’t be scare. This article translates the Gravedigger game from the 1983 Creepy Computer Games book to celebrate Halloween.  This article is part of my series on gaming with the R language.

In the 1980s I spent my time writing code on my 8-bit ZX81 and Atari computers. I learnt everything I know about programming from copying and modifying printed code listings from books with computer games. The games in these books are mostly simple text-based games, but the authors gave them enticing names, often imaginatively illustrated to visualise the virtual world they represent. A line and a dot become a game of tennis, and a computer that was able to play Tic Tac Toe seemed like your machine had come alive.

Creepy Computer Games in R

The old books by Usborne publishing are unique because it contains several versions of each program to ensure compatibility with some of the many dialects of the BASIC language. I first entered this code into the atari800 emulator to test what it does, after which I converted it to the R language.

Let’s step into the creepy world of computer games as imagined by Usborne Publishing.

Reynold, Colin and McCaig, Rob, Creepy Computer Games (Usborne, London).

Reynold, Colin and McCaig, Rob, Creepy Computer Games (Usborne, London).

Gravedigger

Gravedigger by Alan Ramsey is a typical example of the games listed in the books of the early days of home computing. You can download the original book for free from the publisher ‘s Google drive. The Gravedigger listing starts on page 10. The lyrical description of the game provides the instructions:

It’s dark and windy—not the kind of night to be lost in a graveyard, but that’s where you are. You have until midnight to find your way out. Skeletons lurk in the shadows waiting to scare you to death should you come to close. You can dig holes to help keep them away but digging is tiring work and you cannot manage more than five in one game.  You have to be careful not to fall down the holes you have dug. Grave stones (marked +) and  the walls of the graveyard (marked :) block your way. The holes you digs are marked O, you are * and the skeletons are X. See if you can escape.

Gravedigger code snippet

Partial page of the Gravedigger game in BASIC.

I translated the BASIC code as close to the original as possible. This game is not pretty code, but it works. Some of the variable names have been changed because, in BASIC, single variables and vectors can have the same name and names of character vectors end in $. A future version of this game could use graphics as I did in the Tic Tac Toe game.

The game is quite tricky, and I have only managed to escape the graveyard once. It looks like the likelihood of success very much depends on the random distribution of the graves. Perhaps we need some machine learning to optimise strategy.

You can view the code below, or download it from GitHub. I leave it up to you to deconstruct the program and safely work your way through the graveyard.

Happy Halloween!

Gravedigger screenshot (Emacs).

Gravedigger screenshot (Emacs).

Geocoding with ggmap and the Google API

Some of the most popular articles on the Devil is in the Data show how to visualise spatial data creatively. In the old days, obtaining latitude and longitude required a physical survey, with Google maps, this has become a lot easier.

The geocode function from the ggmap package extracts longitude and latitude from Google maps, based on a location query. The example below shows the result of a simple geocoding call for the White House and Uluru. The geocode function essentially constructs a URL to obtain the data.

In the middle of 2018, Google recently tightened access to the database, which means you need to register an API for it to work. This article explains how to use the latest version of ggmap and a Google account to continue using this function.

The Google API

Before we can start geocoding, we need to obtain an API key from Google. Go to the registration page, and follow the instructions (select all mapping options). The geocoding API is a free service, but you nevertheless need to associate a credit card with the account.

Please note that the Google Maps API is not a free service. There is a free allowance of 40,000 calls to the geocoding API per month, and beyond that calls are $0.005 each.

Geocoding with ggmap

You will need to ensure that you have the latest version of ggmap installed on your system. The current version on CRAN is 3.0.

The code snippet below shows a minimum-working-example of how you can map coordinates using ggplot. The register_google function stores the API key. I have stored the key itself in a private text file. The getOption("ggmap") function summarises the Google credentials to check how you are connected.

The geocode function converts the request into a URL and captures the output into a data frame. The plot shows the places I have lived, projected orthogonally on the globe.

Geocoding with ggmap and the Google API

Map Porn

The articles on this blog that rely on the geocode function are categorised as Map Porn because they mostly discuss having fun with maps in R. All code of these articles has been amended to function with the new method.

Flat Earth Mathematics with examples in the R Language

In September I am embarking on a trip around the world to speak at the 2018 World Water Congress in Tokyo, visit family and friends in the Netherlands and some tourism in San Francisco. One of the workshops I speak at is about Fake News in the water industry. There are many fake stories about tap water in circulation on the web, mainly related to the use of chlorine and fluoride. The internet provides humanity with almost unlimited knowledge, but instead, they use it to spread conspiracy theories. One of the craziest fake news trends is the flat earth conspiracy. Some of you might ask whether my trip actually around the world or am I travelling across a flat disk?

This article discusses flat earth mathematics, and how to convert and visualise map projections in R. This article uses the code I published earlier in articles about creating flight maps and Pacific island hopping. You can view the code below or download the full version from GitHub.

The Flat Earth

YouTube contains thousands of videos from people that claim the earth is flat. These people pontificate that science as we know it is a “globalist conspiracy”. While their claims are hilarious, people that believe our planet is flat are often passionate and even conduct simple experiments and engage in naive flat earth mathematics.

Adherents to the idea that the world is flat often propose Gleason’s Map as their version of the correct representation of our planet. This map is not a map of the flat earth but a polar Azimuthal equidistant projection of the globe. Even though the map itself states that it is a projection of the world, flat earth believers nevertheless see it as a literal representation of the world they live on.

Gleason's Map is often touted as a map of the flat earth.

Their belief is based on an appeal to common sense as they can see that the world is flat with their own eyes. The second ingredient is a deep distrust in science, often inspired by religious motives.

This article shows two different ways to look at the earth and show that the spherical model better fits the reality of my trip around the world.

Projecting the spherical earth on a flat surface is a complex task which will always require compromise as it is impossible to truthfully draw the surface of the globe on a piece of paper. The video below provides a practical introduction to map projections that show how maps are stretched to display them on a flat surface.

We can recreate Gleason’s map with ggplot, which incorporates the mapproj package to show maps in various projections. The Azimuthal equidistant method is popularised in the flag of the United Nations. Antarctica is in this projection displayed as a ring around the world. Flat earth evangelists believe that the South Pole a ring of ice that prevents us from proceeding beyond the disc. The edge of this disc cannot be shown because this method projects the south pole at an infinite distance from the centre of the map.

The coord_map function allows you to change projections. The orientation argument defines the centre point of the projection. The most common location is the North Pole (90, 0). The third value gives the clockwise rotation in degrees. Gleason’s map uses a rotation of 270 degrees.

Polar azimuthal equidistant projection with ggmap.

Polar azimuthal equidistant projection with ggmap.

My Round-the-World Itinerary

My trip will take me from Australia to Japan, from where I go to the Netherlands. The last two legs will bring me back to Australia via San Francisco.

The itinerary is stored in a data frame, and the ggmap package geocodes the longitude and latitude of each of the locations on my trip. You will need a Google API to enable the geocoding function.

As the earth is a sphere, an intermediate point needs to be added for trips that pass the dateline, as I explained in my article about flight maps.

I visualised the itinerary using the same method as above but centring on Antarctica. The geosphere package helps to estimate the total travel distance, which is approximately 38,995 km, slightly less than a trip around the equator. This distance is the great circle distance, which is the shortest distance between two points on a sphere, adjusted for a spheroid (a flattened sphere).

The flight paths on this map are curved because of the inevitable distortions when projecting a sphere on a flat surface.

Round the World Itinerary

Round the world trip in polar Azimuthal equidistant projection.

Round the world trip in polar Azimuthal equidistant projection.

Flat Earth Mathematics

If the Gleason map were an actual map of the flat earth, then the flight paths on the map would show as straight lines.

The mapproj package contains the mapproject function that calculates the projected coordinates based on longitude and latitude. The output of this function is a grid with limits from -\pi to \pi. The first part of the code converts the longitude and latitude from the world data frame to projected coordinates.

A line from the lon/lat 0,0 to the north pole has a projected distance of \pi/2, which in the spherical world is \pi / 2 * 6378.137 = 10018.75 km. We need to multiply the Euclidean distances with the radius of the Earth to derive the Gleason map coordinates.

This last code snippet converts the world map to flat earth coordinates. It calculates the Euclidean distance between the points on the itinerary and multiplies this with the Earth’s diameter.

This last code snippet shows why the Gleason map is not a map of a flat earth. On this map, the shortest distance between Sydney and Santiago de Chili is about 25,000 km, more than twice the real value. The real travel time is about 14 hours, which implies that passenger jets break the sound barrier. This problem exists for journeys along the lines of latitude in the Southern hemisphere. The distortion in this projection increases with the distance from the centre of the map.

This map looks like the first one, but the coordinate system is now Euclidean instead of longitudes and latitudes, as indicated by the square grid. On a projected map, the shortest distance is a curved line, parallel to Antarctica, which is how ships and aeroplanes move between these cities.

Sydney to Santiago de Chili on a flat earth map.

Sydney to Santiago de Chili on a flat earth map.

This article proves that the Gleason map is not a representation of a flat earth. Aeroplanes would have to break the sound barrier to fly these distances in the time it takes to travel. Whichever way you project the globe on a flat map will lead to inevitable distortions. The Gleason map itself mentions that it is a projection.

However, these facts will not dissuade people from believing in a flat earth. I am after all an engineer and thus part of the globalist science conspiracy.

The Flat Earth Mathematics Code

You can view the complete code on my GitHub repository.

Marketing for Engineers: An Introduction

This presentation introduces marketing for engineers by providing some insight into what marketing is and how this applies to engineering as a profession. This guest lecture formed part of the Engineering Enterprise subject at La Trobe University by Eddie Custovic.

Definition of Marketing

As an engineer with a philosophy background, I used to believe that marketing is about selling people things they don t need. These words are what I told my MBA lecturer in my marketing subject many years ago. The late emeritus professor Rhett Walker managed to convince me otherwise, which eventually resulted in a PhD in the topic.

When you think about marketing, you might think about advertising and brands. But there is more to marketing than selling goods and services. The word cloud below visualises 72 definitions of marketing collected by blogger Heidi Cohen. I analysed these definitions using data science to summarise them in one image. This image is illuminating, but it does not help us to define marketing for engineers.

72 Definitions of marketing summarised.

72 Definitions of marketing summarised.

The American Marketing Association provides a comprehensive definition of the subject that shows how it also relates to engineering:

Marketing is the activity, set of institutions, and processes for creating, communicating, delivering, and exchanging offerings that have value for customers, clients, partners, and society at large.

The significant words in this definition are “creating … value for … society at large”, which is precisely what engineers aim to do. Engineering is never done merely for the sake of engineering, but always to in some shape or form improve the lives of humanity. Marketing can help engineers to create products that better align with the needs and wants of the end users of their work.

Marketing For Engineers

The defining difference between marketing and engineering is that engineering uses the physical sciences to achieve its objectives while marketing implements the social sciences.

The physical and the social sciences are very different because the first is objective while the social sciences are subjective. This difference in methodology is not a value judgment but a fact. Engineers can find solutions by entering input into a model, in marketing reality cannot be so easily predicted.

What marketing and engineering have in common is that they both seek to create societal value. In the words of Phillip Kotler and Sidney Levy:[note]Kotler, P., & Levy, S. J. (1969). Broadening the concept of marketing. Journal of Marketing, 33(January), 10–15.[/note]

Marketing is customer satisfaction engineering.

Engineering Enterprise Guest Lecture

This 35-minute guest lecture discusses the broad definition of marketing proposed by the AMA and applies this to engineering.