There are lots of freely available resources for learning R online, and for any one task there are many different packages that can help. It can be difficult to identify which packages you should use, ideally you want packages that are being actively maintained and widely used.

The aim of this guide is not to be a training course but to be a list of recommended packages and useful resources, focused around the types of activities our analysts typically do. Often we include the basic commands to get you up and running with a package as soon as possible. It also shows the wide range of tasks that can easily be tackled with R; in fact this guide was written in R!

If you are after training that systematically goes through the R language and how it can applied to analytics we recommend:

Either reading the book R for Data Science
Or obtaining a licence for DataCamp and following the Data Scientist with R career track

There is also a good style guide which we would recommend as the best approach to formatting your R code for clarity and consistency.

Working with structured data

This section focuses on working with structured, tabular data. This could be from a csv file, a file format associated with a piece of proprietary software, or data held in a database.

Importing data from files

The are a number of different ways to import common data types. We recommend using: the readr package to import csv files; readxl to import excel files; and haven to import SAS, Stata, and SPSS files.

Examples:

Read a csv file

library(readr)
df <- read_csv("file.csv")

Read an excel file

library(readxl)
df <- read_excel("file.xlsx")

Read a SAS file

library(haven)
df <- read_sas("file.sas7bat")

Read a Stata file (up to v14)

library(haven)
df <- read_dta("file.dta")

Read an SPSS file

library(haven)
df_1 <- read_por("file.por")
df_2 <- read_sav("file.sav")

Resources:

RStudio cheat sheet on importing data (download here)
Data import chapter of R for Data Science

Connecting to a database

R can connect directly to databases. At time of writing this only works for our ADW database, but you should consult the CoDE/KAI IT Team guidance for the latest recommendations for connecting to databases.

To connect to ADW you could run:

library(DBI)
library(odbc)

connection <- "Driver={SQL Server};server=server_name,Database=db_name;trusted_connection=true;"

con <- dbConnect(
  odbc(),
  .connection_string = connection
)

where server_name should be replaced with the server name, and db_name by the database name.

Similarly, to connect to an existing Microsoft Access file do the following:

library(odbc)
library(DBI)

cs = "Driver=Microsoft Access Driver (*.mdb, *.accdb);DBQ=C:/path/to/access/file.mdb"
con = dbConnect(odbc::odbc(), .connection_string = cs)

# List tables within the database
dbListTables(con)

However, our version of Microsoft Access is 32-bit and by default RStudio uses 64-bit R. To run the code above you need to be using 32-bit R.

Transforming data

Once you have accessed your structured data you will either have read it directly into an R dataframe or established a connection to a database. Next you will want to explore the data and transform it.

Exploring a dataframe:

head(df) displays the first 6 rows of dataframe df
str(df) displays the structure of dataframe df
summary(df) displays summaries about the variables in dataframe df
names(df) displayes the column names in dataframe df
glimpse(df) is an improved version of the summary() function provided in the dplyr package

When using a connection to a database in R you will have used a package (such as obdc) which also imports the DBI package behind the scenes. This package allows you to perform many operations on databases, including running SQL commands. For example:

# List tables from database connection con
dbListTables(con)

# Run an SQL query and save the results to a variable
# But we would recommend using dplyr instead
dfGetQuery(con, "SELECT ... FROM ...")

These SQL commands are run on the server rather than on your local computer, but do return the results to your R session. Additionally, the sqldf command allows you to run SQL style queries against an R dataframe. Using SQL allows you to reuse old code developed outside of R. However, we would recommend a different approach to working with both database connections and dataframes: use the dplyr package instead. This provides a powerful, modern syntax for working with dataframes and database connections within R which is very similiar to SQL.

For example, if you want to select column1 from dataframe df where column2 contains the number 2 you would run

library(dplyr)

# Select column1 from df and where column2 equals 2
df_out <-
  df %>% 
  select(column1)
  filter(column2 == 2)

And dplyr also contains commands for joining data, grouping and summarising data, sorting data and many others. Details on these commands can be found in the Introduction to R in 3 hours course and in the data transformation chapter of R for Data Science. The Introduction to R in 3 hours course summarises these functions as:

filter() pick rows by values
select() pick variables by names
arrange() sort/reorder rows
mutate() create new variables from existing ones
summarise() collapse many values down to a summary
group_by() group up data and perform operations at group level
ungroup() remove the grouping of the variables

These commands can also be used on a table from a database connection using the package dbplyr. This extends dplyr and converts your commands into SQL to be run on the database. For example, if you have a database connection con and want to work with the table example_table within schema dbo as though it were a dataframe, you could run the following code:

library(dbplyr)

df <- tbl(con, in_schema("dbo", "example_table"))

And then you can use dplyr commands as before.

Resources:

RStudio cheat sheet on transforming data (download here)
Data transformation chapter of R for Data Science
Relational data chapter of R for Data Science
Tidy data chapter of R for Data Science

Working with json and XML

XML and json are semi-structured, hierarchical data structures. To import XML data into R use the xml2 package, and for json use jsonlite. XML and json are often the outputs of API calls and R can access HTTP-based APIs using the httr package.

Resources:

Documentation for xml2
Documentation for jsonlite
Getting started with httr is an introduction to using httr
data.parliament.uk lists information about APIs that provide access to data on the work of Parliament
Companies House has a series of APIs

Working with text

Text data can be explored in a number of different ways: identifying the most common words or phrases; sentiment analysis; using machine learning to split the documents into different topics (a form of machine learning); or, more generally, using text as features in a predictive model.

There are a number of packages in R for working with text. tm is the most established and will be the focus of the discussion here but many new packages have been released, with text2vec in particular looking very promising due to its simple interface and its inclusion of advanced techniques.

Before text can be used it needs to be preprocessed, which usually means cleaning the text and converting words into their root forms by either stemming or lemmatisation.

Resources:

tm package documentation
An example of text mining
A quick introduction to text mining in R
Text Mining with R focuses on using the tidytext package to analyse text. This is particularly good for sentiment analysis
More information about the text2vec package can be found here.

Preprocessing

A package you can use for stemming and cleaning data is: tm - Package for Text Mining.

Reading data into tm

Data can be made into a corpus (the main data structure tm uses) using either the VectorSource() function that reads a list of strings or DirSource which reads documents from a directory. For example, if you have a csv file called example.csv with a column of text called text you could run:

library(readr)
library(tm)

df <- read_csv('example.csv')
tm_corpus <- VCorpus(VectorSource(df$text))

Cleaning text

The following functions can be used in this package to clean your data:

tm_map(corpus, content_transformer(to_lower)) will remove all capital letters and replace them with lowercase.
tm_map(corpus, removePunctuation) will remove all punctuation.
tm_map(corpus, removeNumbers) will remove all numbers.
tm_map(corpus, removeWords, all_stop) takes a list of words that you wish to exclude from the data and removes them. R already has built-in lists of common words such as ‘i’, ‘me’, ‘you’, and ‘he’. In this case we have used a personalised list called ‘all_stop’.
tm_map(corpus, stripWhitespace) will remove any spacing that has occurred when removing numbers, punctuation, etc.

Below is an example of a function taking a corpus of documents, cleaning and stemming it and assigning it as cleaned_corpus.

library(tm)
library(dplyr)

stopwds <- stopwords('en')
all_stop <- c("concern", "concerned", "concerns", "may", "also",
              "will", "see", "around","yet","though", stopwds)

clean_corpus <- function(corpus){
  corpus %>%
  tm_map(content_transformer(tolower)) %>%    
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(removeWords, all_stop) %>%
  tm_map(stripWhitespace) %>%
  tm_map(PlainTextDocument) -> clean_corpus
  return(clean_corpus)
}

This can then be applied on a corpus.

cleaned_corpus <- clean_corpus(tm_corpus)

You can also write your own functions to be applied to the data using the tm_map() function, for example you might apply regular expressions created with the stringr package. Often these functions will have to be wrapped within the content_transformer() function (see the tolower() example above).

Stemming and lemmatisation

Stemming is the process of reducing words to their stem, base or root form. This is done to group together similar words so that their frequencies reflect the true use of the word, another more sophisticated version of this is lemmatisation.

Stemming just removes the end of the word to approximate the root and can be done in tm using the following command:

tm_corpus %>% tm_map(stemDocument)

Lemmatisation finds the root word itself but is much slower. For example,

library(textstem)
corpus %>% tm_map(lemmatize_strings)

Spelling

Misspelt text can lead to underestimates of the prevalence of terms. The hunspell package can check spelling and suggest correct alternatives.

library(hunspell)

bad <- hunspell("spell checkers are not neccessairy for langauge ninja's")
print(bad[[1]])

## [1] "neccessairy" "langauge"

hunspell_suggest(bad[[1]])

## [[1]]
## [1] "necessary"   "necessarily"
## 
## [[2]]
## [1] "language" "Augean"   "Angela"

Visualisations

Common visualisations used to explore text data include word frequency charts and word clouds. These are demonstrated below, where the crude dataset (a corpus of documents about crude oil provided by the tm package) has been used .

Word frequency charts

Once your data has been cleaned you can start creating visualisations. One of these could be a word frequency chart. To do this we will use the ggplot2 package.
After cleaning your data you’ll have to create a document-term matrix using tm. This is a matrix whose elements indicate the number of times a given word (or term) has appeared within a given document. The columns denote words and the rows denote documents. The code below generate a document-term matrix on the crude dataset.

library(tm)
library(dplyr)
library(tibble)

data(crude)
crude <- clean_corpus(crude)

dtm <- DocumentTermMatrix(crude)

You will then need to select the word frequencies you want. The code below will generate word counts.

word_counts <- 
  data.frame(freq = dtm %>% as.matrix() %>% colSums()) %>% 
  rownames_to_column("word")

You are now ready to create your word frequency chart. The code below creates a word-frequency chart displaying the ten most frequently occuring words.

library(ggplot2)

wf <- word_counts %>% top_n(10, freq) 

ggplot(wf, aes(x = reorder(word, -freq), y = freq, fill="")) +
  geom_bar(stat = "identity", colour="black") + 
  scale_fill_manual(values=c("#3399FF")) +
  theme(axis.text.x=element_text(angle=90, hjust=1),
        axis.title.x = element_blank(),
        legend.position = "none")

Word clouds

Another way you can visualise your data is by using a word cloud. Word clouds do have their limitations but are good for picking out words at a quick glance. A package you can use to create these is ggwordcloud. This package provides a couple of shortcut functions to quickly produce wordclouds (ggwordcloud() and ggwordcloud2()) and also extends extends ggplot2 by providing a number of wordcloud geoms. Using these geoms requires a bit more code but gives you full control over your wordcloud; for example, it provides control over whether the size of the text scales with the word frequency or the square of the frequency and over the shape of the overall wordcloud.

Below we create a word cloud of all words appearing more than two times in the corpus. Note that in this example the word frequencies need to be sorted so that the most commonly occuring words appear at the start of the dataframe.

library(ggwordcloud)

words_filtered <- word_counts %>% filter(freq > 2) %>% arrange(desc(freq))
ggwordcloud2(words_filtered, shuffle = F, size = 2.5, ellipticity = 0.9)

More information about plotting with ggplot2 can be found in the Creating charts section and more information about ggwordcloud can be found here.

Machine learning

Machine learning can be used to try and detect patterns across text documents (topic modelling) and text can also be used as features in predictive models.

Topic modelling

Latent Dirichlet Allocation (LDA) is an unsupervised learning technique for identifying topics within a corpus. This is implemented in the topicmodels package. For example, the code below looks to identify five topics within the document-term matrix dtm.

library(topicmodels)

lda <- LDA(dtm, k = 5)

Once topics have been identified they can be explored interactively using the LDAvis package. This post on stackoverflow demonstrates how to convert the output of topicmodels into the format required by LDAvis.

Resources:

Text features

For other types of machine learning you will want to convert the document-term matrix into a dataframe of features. However, the document-term matrix may contain features that do not appear very often and that you want to remove. This can be done with the removeSparseTerms() function before converting to a dataframe.

dtm = removeSparseTerms(dtm, 0.99)
dtm <- as.data.frame(as.matrix(dtm))

N-grams

An n-gram is a list of n sequential words taken from a document. For example, the phrase

“the quick brown fox jumps over the lazy dog”

contains the bigrams

“the quick”, “quick brown”, “brown fox”, etc.

These are often useful to visualise in order to understand phrases in the data or as additional features for machine learning models.

The package tidytext can be used to produce a dataframe of bigrams that is useful for data exploration and data visualisation:

library(tidytext)

bigrams <- tm_corpus %>% 
  tm_map(PlainTextDocument) %>%
  tidy() %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

The tm FAQ shows how to include bigrams within a document-term matrix.

Working with dates

Dates and date-times are a base type in R and can be used in arithmetic and within logical expressions.

If you import dates using one of the functions within readr it will try to infer the date format; turning the date field into a date variable where possible. Otherwise the field would appear as a string and you would need to tell R that this string is a date by specifying the format. For example:

string_date <- "30-01-2010"
date <- as.Date(string_date, format = "%d-%m-%Y")

Similarly date-times can be read in as follows:

string_datetime <- "30-01-2010 10:30:05"
date_time <- as.POSIXct(string_datetime, format = "%d-%m-%Y %H:%M:%S")

The lubridate package contains a variety of utilities that make it easier to work with dates and times. For example, it offers a series of functions that infer the date for you:

library(lubridate)

date_1 <- ymd("2010 April 01")
date_2 <- mdy("Apr 01, 10")
date_3 <- dmy("1st April 2010")

It also provides functions that work with intervals between date ranges, and to easily add a set number of days, weeks or months to dates.

Resources:

RStudio cheat sheet on working with dates and times (download here)
Quick-R guide to date values

Working with geographic data

There are a wide range of packages for working with geographic data in R, from packages such as sf that make it easy to work with geographic file types, to packages such as ggplot2, tmap, ggmap, and leaflet that enable easy plotting of this data.

Typically you want either to create a choropleth map (where different regions are coloured differently depending on a quantity associated with that geographic area), or to plot the location of given points on a map. These plots can either be static (perfect for traditional reports or slide decks) or interactive where you can pan and zoom across the map.

To make plots like these you need (in addition to your data containing geographic identifiers such as post codes): - Boundary data: to show the shape of areas (countries, counties, constituencies, wards etc.). - Map tiles: if you want to overlap the data on maps, for example road maps.

Resources:

Introduction to visualising spatial data in R is an introduction to displaying geographic information in R (PDF available here)
Geocomputation with R is a book for “people who want to analyse, visualize and model geographic data” in R
Spatial cheat sheet

Getting data

There are many resources of geographic data available but here we will focus on the ones provided by the ONS. Their data is collected in the ONS Geography Portal.

Particularly useful are a number of datasets that allow post codes to be mapped to areas (wards, output areas etc.). The outlines of these areas can be obtained by clicking on the boundaries ribbon:

Typically there are Full, Generalised, Super Generalised, and Ultra Generalised versions of these boundaries, with the Ultra Generalised version being the smallest download. These are available in a range of different file formats, and the increasingly popular GeoJSON format can be downloaded from the API tab:

The above boundaries data splits Britain into areas (for example, counties) and provides all the boundaries of that type. Individual boundaries (for example, for a single county) can be obtained from the ONS Geography Linked Data website.

The code below demonstrates downloading boundary data for the three countries within Great Britain as GeoJSON and loading it into R using the sf package.

library(sf)

url <- "https://opendata.arcgis.com/datasets/37bcb9c9e788497ea4f80543fd14c0a7_4.geojson"
download.file(url, "gb.json")
gb <- st_read("gb.json")

Static maps

Static maps can be created using ggplot2 which can natively handle data objects produced by the sf package using geom_sf(). These objects can be easily plotted.

library(ggplot2)

ggplot(gb) + 
  geom_sf(aes(fill = factor(ctry16nm))) +
  geom_point(aes(x = 0.1278, y = 51.5074)) +
  theme(axis.title=element_blank(),
        axis.text=element_blank(),
        axis.ticks = element_blank(),
        legend.title = element_blank(), 
        legend.spacing.x = unit(0.1, 'cm'))

Note that the sf geom has been used below for the converted boundary data and a point geom has been used to plot a point by specifying its longitude and latitude. More information about plotting with ggplot2 can be found in the Creating charts section of this guide.

The sf library extends dplyr so that its powerful range of data manipulation functions can be used with sf objects. This means that geographic data can be joined to other data. For example, given county boundaries in an sf object you could join this to a table listing each of their populations, which could then be plotted to produce a choropleth map.

The package ggmap package extends the mapping capabilities of ggplot2, allowing maps to be drawn on top of map tiles from the internet. However, there are issues getting this to work, but this can be done with the package tmap or with leaflet (discussed below).

Interactive maps

The package leaflet allows the creation of interactive maps that allow panning and zooming, as well as other features. It produces HTML, which can either be shared as a stand-alone file or be included within a report created using R Markdown (see the Creating Reports section).

The code below creates a basic leaflet map.

library(leaflet)

# This line fixes an issue with leaflet 2.0.2 and sf 0.7-1
names(st_geometry(gb)) = NULL

leaflet(gb) %>% 
  setView(lng = -5, lat = 55, zoom = 5) %>% 
  addTiles() %>% 
  addPolygons(weight = 1, color = "black", fillColor = c("red", "green", "blue"))

Note that that leaflet goes to the internet to retrieve the map tiles. These map tiles are provided by third parties and you should check the licences of the ones you are using.

Taking samples

This section focuses on generating samples from data. Most types of sampling can be done by manipulating the data using functions from the packages dplyr or purrr. These should be enough for most purposes.

More advanced sampling techniques can be found in the sampling package or in the boot package (which focuses on bootstrapping methods). If you already have survey data, the survey package allows you to calculate statistics taking into account factors such as different finite population corrections.

Examples of sampling the df dataframe:

Simple random sampling

library(dplyr)

my_sample <- 
  df %>%
    sample_n(100, replace = TRUE)

Stratified sampling (weighted)

library(dplyr)

my_sample <- 
  df %>%
  group_by(group_col) %>%
  mutate(num = n()) %>%
    sample_n(100, weight = num)

Cluster sampling

library(dplyr)

# Weights for each class can also be included
class_names <- c("class1", "class2", "class3", "class4")
class_sample <- sample(class_names, 3)

my_sample <- 
  df %>%
  filter(class_col %in% class_sample)

Systematic sampling

library(purrr)

n <- 100
every <- 10
rows <- nrow(df)
start <- sample(1:rows, size = 1)

# Generate indices for the sample
indx <- seq(from = start, to = (start + (n-1)*every), by = every)

# If index larger than the number of rows, cycle round the dataframe
indx2 <-
  indx %>%
  map_dbl(~ (.x -1) %% rows + 1 )

# Subset original data
my_sample <- df[indx2, ]

Note that the sample_n command might not work with database connections.

Resources:

dplyr documentation on sampling row
Quick-R guide to bootstrapping
survey package documentation

Making inferences

R is a programming language built for statistics with statistical techniques built in and a wide range of additional techniques available in packages. It is straightforward to perform ANOVA, t-tests, power analysis etc.

Resources:

Quick-R guide to basic statistics covers all these techniques and more

Making predictions

Predictive analytics is a suite of techniques for making predictions by learning patterns from data, typically using machine learning techiques. These techniques can be split into two types: supervised learning and unsupervised learning. In supervised learning you know the true value of the variable of interest, whereas with unsupervised learning you do not have this variable so you instead look to understand the structure present within the data.

Supervised learning

R contains an extremely wide range of different techniques, from well-established methods to more niche approaches. Typically, different techniques have their own package, and whilst syntaxes between packages tend to be similar they normally differ to some degree.

If you will only ever use one technique it makes sense to learn that package or function in detail. For example,

lm() from the build-in stats package performs linear regression.
glm() from the built-in stats package performs logistic regression (and other generalised linear models)
rpart() from the rpart package constructs decision trees.

However, if you want to use a range of different techniques, be it within a single project or using different techniques for different tasks, it makes sense to use the caret package (Classification And REgression Training), which provides a common interface to a wide range of R packages that deal with classification and regression.

caret also provides functions to preprocess data (for example scaling features or performing principal components analysis), for combining models (ensembling), and for evaluating the performance of models (including train/test splits, cross-validation and confusion matrices).

Quick overview of the caret package

caret requires that the package containing the underlying technique is also installed. To install caret alongside a number of different models run the command:

install.packages("caret", dependencies = c("Depends", "Suggests"))

which might take some time.

A typical workflow could be:

Preparing and preprocessing data
Splitting the data into training and test sets
Selecting and creating features
Training the model
Tuning the model hyperparameters
Evaluating the model
Running the model

All these steps can be done using tools provided by caret, but here we will only touch on a few of the above steps to outline the main commands that caret provides. The caret documentation provides more details and is easy to follow.

Splitting the data into training and test sets

The createDataPartition() function can be used to split your data into a training set and a test set.

in_train <- createDataPartition(df$outcome, p = 0.66)

training <- df[ inTrain, ]
testing <-  df[-inTrain, ]

caret also contains tools for the alternative approach of using cross-validation.

Training the model

Any of the models that caret contains can be trained using the train() function. For example, if you had a dataframe called training containing the columns outcome, feature_1, feature_2, and feature_3 you can train a decision using the following code:

# Train using just feature_1
model <- train(outcome ~ feature_1, data = training, method="rpart")

# Train using feature_1 and feature_2
model <- train(outcome ~ feature_1 + feature_2, data = training, method="rpart")

# Train using all features
model <- train(outcome ~ . , data = training, method="rpart")

Note that this uses R’s formula notation, where outcome ~ feature_1 means that you want to explain outcome using feature_1. Instead you can pass a matrix of features to the function train() as the x argument and a matrix of outcomes as the y argument.

Evaluating the model

Predictions can then be made using this model; this can be done on the test set in order to evaluate the model or on new data when the model is run for real.

predictions <- predict(model, newdata = testing)

which can be evaluated, for example by calculating the confusion matrix if it were a classification task.

confusionMatrix(data = prediction, testing$outcome)

Resources:

The detailed caret documentation
A short introduction to the caret package.
This guide to caret is also useful
More information on R’s formula notation can be found here and here
An introduction to machine learning
8 ways to deal with continuous variables in predictive modelling
Using Support Vector Machines for machine learning with caret
An introduction to feature selection
Feature selection with caret
A practical guide to PCA for feature selection
Understanding feature engineering
How to get good at feature engineering
Metrics to evaluate your machine learning algorithm
Performance metrics for classification methods
The pROC package contains tools for visualising ROC curves and for calculating the area under the curve. More information can be found here.

Unsupervised learning

One of the main techniques in unsupervised learning is clustering. This is a way of exploring the data and seeing which data points naturally group together. Clustering is typically used outside of making predictions, but can be used as features for a supervised learning model or, when you are happy with the clusters identified, you can predict which clusters unseen data points belong to.

The built-in stats package provides the function hclust() for hierarchical clustering and the function kmeans() for K-means clustering. Packages such as dbscan are also available that implement other clustering techniques.

K-means clustering is often used because it is computationally efficient, but it can fail to produce the clusters you expect in certain circumstances because it makes a number of assumptions about the data. The documentation for sklearn (a Python package) has a good demonstration of this.

More information about how to apply these two techniques in R can be found on Quick-R and a tutorial on using K-means clustering can be found on DataCamp.

Making forecasts

The R package forecast contains a number of tools for performing forecasting, including the techniques of ARIMA, exponential smoothing, and Holt-Winters. In addition it creates fan charts to show the uncertainty of the forecast.

Alternatively, if you currently use the X13 command line tool for performing forecasting you can use the X12 package within R to produce the exact same output.

Resources:

Two of the authors of the forecast package have published a book, Forecasting: Principles and Practice about forecasting with examples in R
Quick-R guide to time series and forecasting

Creating charts

One of the strengths of R is its ability to easily produce good charts. Here we give a quick introduction to how to create both static and interactive charts.

Static plots

While you can build plots with base graphics in R, we recommend using ggplot2, because it allows a high level of both flexibility and control. It’s based on the grammar of graphics, with all plots requiring 3 necessary elements:

Data
Aesthetic mappings
Geometries

A number of further components allow you to add additional layers, and overwrite plot defaults:

Statistics
Positions
Scales
Labels
Coordinates
Faceting specifications
Themes

ggplot2 contains some datasets, which you can use to practice creating charts. This includes diamonds, a dataset with the prices and 9 other characteristics of 50,000 round cut diamonds.

library(ggplot2)

ggplot(diamonds, aes(x=carat, y=price, colour = color)) +
  geom_point(shape = 18) +
  facet_wrap(~cut)

In the code above we defined a plot with data from the diamonds dataset, with carat mapped onto the x axis, price mapped onto the y axis and color mapped to colour. Then we added a geom layer to create a scatterplot with diamond-shaped points, and then split the plot into panels subset by cut.

The ggsave("filename.png") function can be used to save the plot which was most recently produced.

Aesthetics vs Attributes

It is worth noting here the difference between aesthetics and attributes. In the code above, colour is an aesthetic, because it has a variable mapped to it within the aes() function. By contrast, shape is an attribute; all points have the same shape, regardless of what category they are in. We could make the colour an attribute, and the shape an aesthetic, but this is not a sensible choice given the dataset.

ggplot(diamonds, aes(x=carat, y=price)) +
  geom_point(colour = "skyblue2", aes(shape = color)) +
  facet_wrap(~cut)

Resources:

Data Visualisation of R for Data Science
A Layered Grammar of Graphics is Hadley Wickham’s 2009 academic article introducing the grammar of graphics
ggplot2 Reference
Data Visualization with ggplot2 Cheat Sheet
sape ggplot2 Quick Reference
Recreating an existing chart can be a good way to explore the flexibility of the ggplot2 package. Here is an example on how to Recreate This Economist Graph

Interactive charts

There are many different packages for creating interactive graphs in R, including: plotly, dygraphs, ggiraph, and ggvis. At the moment plotly looks the most promising for interactive graphs in general, with dygraphs being good for time series data. Note, charts involving maps were discussed in the Working with geographic data section.

These interactive charts are produced in HTML format and can either be saved as their own HTML file or contained within a HTML report writtin in R Markdown (see the Creating reports section) or contained within a HTML dashboard or application such as those produced by Shiny and flexdashboard (see the Creating dashboards section).

One nice feature of the plotly package is that many graphs produced by ggplot2 can be quickly converted into interactive versions using the ggplotly command as demonstrated below.

library(plotly)

plot <- ggplot(diamonds, aes(x=price, fill=cut)) + 
          geom_histogram() 

ggplotly(plot)

Resources:

Plotly pages on the R interface contain simple examples of how to produce the most common chart types.
Plotly full reference contains full details of all the plotly functions.
Plotly for R book contains a good introduction to using plotly.
The dygraphs for R website contains all you need to know to get started with dygraphs.

Network visualisation

Networks describe the interactions between entities. This could be people communicating on social media, companies trading with one another, or structures such as corporate groups. This is an example interactive network diagram; try dragging the nodes.

There are a number of packages available for network visualisation in R. Which package is most suitable will depend on your requirements; are you seeking to visualise a static network or an interactive network, or wanting to perform network analysis? For non-interactive graphs ggraph is a good option, and for interactive graphs visNetwork is a good option.

Resources:

Introduction to Network Analysis with R is a general introduction to creating and plotting network type objects using 6 different packages.
Network visualization with R is an in-depth tutorial which is continuously updated and expanded. The 2018 version is available as a PDF here.
Interactive and Dynamic Network Visualization in R is a set of slides from a presentation focused on making dynamic and interactive network graphs.

Note that the simple network diagram above was generated with the following code.

library(visNetwork)

nodes <- data.frame(id = 1:5, label = c("A", "B", "C", "D", "E"))

edges <- data.frame(from  = c(1, 1, 1, 4), 
                    to    = c(2,3, 4, 5), 
                    value = c(1.5, 1.5, 1, 1))

visNetwork(nodes, edges, height = "350px", width = "100%")

And a similar non-interactive version could be created with this code.

library(igraph)
library(ggraph)

static_graph <- graph_from_data_frame(edges, vertices = nodes)
  
ggraph(static_graph, layout = "fr") +
            geom_edge_link2(
              aes(width = value),
              colour = "lightblue",
              show.legend = FALSE) +
            geom_node_point(
              size = 20,
              fill = "cornflowerblue", 
              color = "black", 
              stroke = 1, 
              shape = 21, 
              show.legend = FALSE) +
            geom_node_text(
              aes(label = name), 
              vjust = 3, 
              hjust = 3) +
            theme_void()

Creating reports

R Markdown can be used to write reports. You write your document using Markdown (a widely used simple markup language) and you can insert R code within the document. This code can be run when you create your output document (a process called knitting), allowing you to insert numbers, tables and charts produced by R in your report. This is good for reproducibility as the outputs of your analysis and the code that produced it are together within the same document.

R Markdown is also perfect for producing guidance on using R as it is simple to include blocks of code where different parts of the R syntax have been highlighted in different colours. You can also write equations using LaTeX syntax; for example, the code

$$e^{i\pi} + 1 = 0$$

produces the following rendered equation \[e^{i\pi} + 1 = 0\]

R Markdown can be used to create HTML, PDF and Word documents, and various presentation formats. However, at the time of writing only HTML documents (or HTML slides) can be easily produced on our network. Nevertheless, HTML documents have the advantage that they can include interactive elements from packages that produce HTML output such as leaflet for maps or plotly for graphs.

RStudio contains a lot of built-in support for R Markdown. To create a new R Markdown document just go to New File > R Markdown. This opens a menu which guides you through setting up a markdown project and even produces a template document to get you started.

To convert your R Markdown code into a document click the knit button within RSudio.

Resources:

RStudio reference guide to R Markdown (download here)
R Markdown: The Definitive Guide
R Markdown chapter of R for Data Science

Creating dashboards

Dashboards are typically collections of key visualisations and are most effective when they are interactive. Such interactivity is producing by writing R code that is converted into HTML and JavaScript that you can then view in a browser. These dashboards can often have so much functionality that they effectively become web apps.

Package options

In order to understand which package to use you need to understand the differences between static and dynamic webpages. A static webpage loads everything at the start and never changes; this loading could include interactive content such as D3 visualisations or leaflet maps which are all run client-side (the processing is done in the browser). For a dynamic webpage the content can be reloaded; for example, data could be pulled from a database or items changed due to the running of R code, normally on a remote server. Dynamic webpages are more flexible and allow you to work with larger datasets but require a web server, whereas static web pages can be run independently, not requiring the user to have R themselves or the web page to be hosted on a server.

Flexdashboard

The flexdashboard package in R extends R markdown (used to create reports in R) to include dashboard layouts. In effect these layouts consists of elements (be they tables, ggplot2 charts, plotly charts, or leaflet maps) which are arranged in rows and columns.

When using flexdashboard on its own these separate elements cannot talk to one another, but the resulting dashboard is a self-contained HTML file that can be emailed to a customer.

Examples can be found on the flexdashboard website. These dashboard layouts can be combined with crosstalk and shiny for additional functionality.

Crosstalk

The crosstalk package allows HTML content produced by certain packages to interact. This means that if you click on one element (which could be a slider to select data ranges), the others will update. At the time of writing crosstalk only works with plotly, leaflet, DT, SummaryWidget, and rgl.

Combining crosstalk with flexdashboard allows the creation of dashboards where the elements of a dashboard can simultaneously update based on user interaction with one particular element, and the output produced is still an easily sharable HTML file.

More information can be found on the crosstalk website.

Shiny

The shiny package allows the creation of very powerful dashboards where user interaction (for example, clicking a button) can result in arbitary R code being run behind the scenes and the dashboard being updated accordingly. This allows almost limitless functionality within the dashboard; models could be rerun, new data called from a database etc. However, Shiny requires R to be running somewhere, ideally on a server but it could be on a local computer.

At the moment CoDE does not have a web server on which to run these dashboards. The ways in which you could currently share a Shiny dashboard are:

Get your customer to install RStudio on their computer and run the dashboard code themselves.
Find an unused networked computer to host the dashboard on.
Pay for an internal server to run the application on.

Getting started with Shiny

Shiny apps have two main components, the user interface (file saved as ui.R) and the reactive server (file saved as server.R).

ui.R

The user interface script holds all information on the static (permanent) design and layout of the app.

There are a variety of different packages and pre-set designs available to use - to name a few:

navBarPage - good for tabs
fluidPage - good when using on different device types
dashboardpage - general purpose

Each of these designs allows you to easily create a sidebar/mainPanel layout for easy dashboard use.

This is where you tell Shiny where to position all visible elements:

An input is a reference for Shiny to take the value chosen by the user to change another element on the page (usually added on the ui.R side)
An output is a widget, plot, table, map, chart or input box (tip of the iceberg!) that can change depending on any other input on the page
The server script (see below) handles all interactive elements (inputs and outputs), updating them on user changes (AKA reactivity)

Note: It is possible to make pretty much any ui element reactive by adding an Output function (with an id) on the ui side, and then building it out on the server side.

library(shiny)

# Define UI for app that draws a histogram ----
ui <- fluidPage(

  # App title ----
  titlePanel("Hello Shiny!"),

  # Sidebar layout with input and output definitions ----
  sidebarLayout(

    # Sidebar panel for inputs ----
    sidebarPanel(

      # Input: Slider for the number of bins ----
      sliderInput(inputId = "bins",
                  label = "Number of bins:",
                  min = 1,
                  max = 50,
                  value = 30)

    ),

    # Main panel for displaying outputs ----
    mainPanel(

      # Output: Histogram ----
      plotOutput(outputId = "distPlot")

    )
  )
)

server.R

As mentioned above, the server element of the Shiny app exists to make it interactive.

In order to reference clicks/choices/hovers/selections from other elements in the app, use input$id (e.g. “input$bins” as below).

You can prevent elements from updating each time or make them wait for dependent inputs before rendering by using the isolate() and req() functions.

# Define server logic required to draw a histogram ----
server <- function(input, output) {

  # Histogram of the Old Faithful Geyser Data ----
  # with requested number of bins
  # This expression that generates a histogram is wrapped in a call
  # to renderPlot to indicate that:
  #
  # 1. It is "reactive" and therefore should be automatically
  #    re-executed when inputs (input$bins) change
  # 2. Its output type is a plot
  output$distPlot <- renderPlot({

    x    <- faithful$waiting
    bins <- seq(min(x), max(x), length.out = input$bins + 1)

    hist(x, breaks = bins, col = "#75AADB", border = "white",
         xlab = "Waiting time to next eruption (in mins)",
         main = "Histogram of waiting times")

    })

}

Running the app

It’s good practice to separate your ui and server scripts into separate files and then use a runApp() function (only requires the folder location) to draw them together.

However, for immediate learning processes here the full script that you need to run the app is shown above.

library(shiny)

# Define UI for app that draws a histogram ----
ui <- fluidPage(

  # App title ----
  titlePanel("Hello Shiny!"),

  # Sidebar layout with input and output definitions ----
  sidebarLayout(

    # Sidebar panel for inputs ----
    sidebarPanel(

      # Input: Slider for the number of bins ----
      sliderInput(inputId = "bins",
                  label = "Number of bins:",
                  min = 1,
                  max = 50,
                  value = 30)

    ),

    # Main panel for displaying outputs ----
    mainPanel(

      # Output: Histogram ----
      plotOutput(outputId = "distPlot")

    )
  )
)

# Define server logic required to draw a histogram ----
server <- function(input, output) {

  # Histogram of the Old Faithful Geyser Data ----
  # with requested number of bins
  # This expression that generates a histogram is wrapped in a call
  # to renderPlot to indicate that:
  #
  # 1. It is "reactive" and therefore should be automatically
  #    re-executed when inputs (input$bins) change
  # 2. Its output type is a plot
  output$distPlot <- renderPlot({

    x    <- faithful$waiting
    bins <- seq(min(x), max(x), length.out = input$bins + 1)

    hist(x, breaks = bins, col = "#75AADB", border = "white",
         xlab = "Waiting time to next eruption (in mins)",
         main = "Histogram of waiting times")

    })

}

shinyApp(ui = ui, server = server)

Advanced considerations

Shiny’s capabilities are effectively endless; here are some things to consider when designing your dashboards.

User Experience:

Speed:

Run all the aggregations, data processing and queries in advance, then run Shiny from condensed datasets at an individual level.
Add the data used into a database, then query as and when needed instead of running in memory.
Use the req() function inside render({}) functions on your server.R script to stop the app trying to draw plots/tables until the required reactive data/filters have already been sorted out.
Use the isolate() function to stop input changes affecting every output they’re linked to every time.

Other considerations:

Learning resources

Tutorials

Visualisation packages

Inspiration

Influencers

HMRC Examples

shinyTemplate

Programming

R is a fully fledged programming language. This section is a very quick introduction to how common programming constructs are implemented within R.

Defining your own function

To define a function called function_name you would use the following syntax.

function_name <- function(arguement_1, arguement_2){
  ...
  return(result)
}

The final return statement is optional; if it is not present the function will return the value of the last expression evaluated. These functions can then be called in the usual way, e.g. function_name(arg1, arg2).

If statements

The syntax for if statements is the following:

if (condition_1) {
  ...
} else if (condition_2) {
  ...
} else {
  ...
}

where the conditions are any logical tests.

For loops

R does give you the option to use for loops but also provides a number of other ways to iterate through a series of values. Consider the following example where we first define a vector of strings.

vec_a <- c("string", "str", "s")

We want to find the length of each of these strings. You could use a for loop to iterate over all the elements in this vector.

for (i in vec_a){
  print(nchar(i))
}

(Another helpful syntax is that, for example, 1:10 produces a list of the numbers between one and ten that can be iterated through.)

However, many functions in R have been designed to work directly with vectors and produce vectors as outputs. nchar() is one such function.

nchar(vec_a)

Let’s imagine that nchar did not work on a vector, that it would only take a single value as input. If we wanted to iterate over our vector and for each element apply the nchar function we could use the functions lapply, sapply, and vapply.

# The lapply function produces a list
lapply(vec_a, nchar)

# The sapply function tries to simplify to a vector
sapply(vec_a, nchar)

# It is recommended to use vapply where you specify the output type
vapply(vec_a, nchar, integer(1))

The package purrr (part of the tidyverse) provides newer, more consistent replacements for the apply family of functions. These are the map functions.

library(purrr)

# map produces a list
map(vec_a, nchar)

# map_int produces a vector of integers
# map_dble, map_char and map_lgl exist for other data types
map_int(vec_a, nchar)

Resources:

R for Data Science has a good chapter on programming which introduces many important concepts.
Programming with R has a number of episodes demonstrating different aspects of R programming.
RStudio cheat sheet on applying functions with purrr (Download here).
More advanced topics such as different programming styles and writing more performant code are discussed in Advanced R.
Creating your own R package is discussed here.
Debugging code is discussed in these articles.
The tidyverse style guide is a commonly followed style guide.

Finding out more

This section shows ways to find out more, from finding out how to use a given function or package to websites where you can find out about new packages.

Getting Help

There are a number of ways to find help with R commands and packages.

Built-in help

Help can be searched in the console with the commands shown to the left; this will automatically open the Help file in the Help pane. You can also search the help files using the search box on the Help pane.

?: Displays the Help file for a specific function. For example, ?data.frame displays the Help file for the data.frame() function.
??: Searches for a word (or pattern) in the Help files. For example, ??list returns the names of functions that contain the word list.
help(package = "package_name") displays the manual for the package in question. For example, help(package = "caret") shows the manual for the caret package.

Vignettes

Many packages contain vignettes; these are examples that demonstrate how to use the package by working through an example with actual code. Some packages have more than one vignette. For example, browse for the vignettes for the dplyr package with the following command: browseVignettes(package = "dplyr")

Stack Overflow

Stack Overflow is a programming question and answer site and is a good place to start when needing help with R as the chances are that someone else has already asked about the very thing you are stuck on. The questions on R can be found at: https://stackoverflow.com/questions/tagged/r

Additional resources

There is a wide range of R resources available online.

Test data sets

R comes with a number of test data sets built in, and many other packages provide additional example data sets. A list of available datasets (alongside which package needs to be loaded to access them) can be found here.

Cheat sheets

Cheat sheets are helpful reminders of shortcuts, functions, arguments and the like. They tend to be short but dense infographics, which can be a bit intimidating. However, they are useful for reminding you of the bit of code that does a particular task.

Useful cheat sheets include:

Further examples are listed under specific topics.

GitHub

GitHub is a development platform, and many R packages were developed on there. The homepage is not very helpful unless you have an account, but you should be able to view search results that point to GitHub from Google (other search engines are available…). Every repository should have a README.md file explaining the contents, which appears automatically at the bottom of the page.

Example repositories:

Style guide

Styles guides aim to make code consistent and easier to read by setting out how to format the different components of a program. For R the tidyverse style guide is commonly followed.

Other useful resources

These are good resources for working with R, and many have been referenced throughout this guide:

Summary

Below is a quick summary of useful packages to use for common tasks.

Working with structured data

readr - read csv files.
read_xl - read excel files.
haven - read SAS, Stata and SPSS files.
DBI and odbc - connect to a database.
tidyr - convert between wide and long data.
dplyr - manipulate / wrangle data.
dbplyr - using dplyr with database connections.

Working with unstructured data

jsonlite - read json data.
xml2 - read XML data.
httr - perform API calls.

Working with text

tm - clean and wrangle text data.
textstem - lemmatise text.
hunspell - check spellings.
ggwordcloud - produce wordclouds.
topicmodels - build LDA topic models.
LDAvis - visualise LDA topic models.
tidytext - produce n-grams and perform sentiment analysis.
text2vec - a modern alternative to tm.

Working with dates

lubridate - parse and manipulate dates.

Working with geographic data

sf - read common geographic data types.
ggplot2 - plot sf data to produce non-interactive maps.
leaflet - produce interactive maps.

Taking samples

dplyr - use to sample data.

Making inferences

stats - perform statistical analysis (preloaded package).

Making predictions

caret - perform supervised learning (common interface to a wide range of packages).
stats - perform unsupervised learning (preloaded package).

Making forecasts

forecast - perform forecasts and manipulate time series.
X12 - replicate the functionality of the X13 command line tool.

Creating charts

ggplot2 - produce non-interactive charts.
plotly - produce interactive charts.
dygraphs produce interactive time series charts.
visNetwork - produce interactive network visualisations.
ggraph - produce non-interactive network visualisations.

Creating reports

knitr - convert R Markdown documents into rendered files (via the “knit” button in RStudio).

Creating dashboards

flexdashboard - produce simple dashboards in self-contained HTML files.
crosstalk - produce dashboards where separate elements interact.
shiny - produce dashboards that can run any R code but need to be hosted.
DT - produce interactive tables.

Self-learning guide to R

HMRC Centre for Data Exploitation

Updated: December 2018

Working with structured data

Importing data from files

Connecting to a database

Transforming data

Working with json and XML

Working with text

Preprocessing

Reading data into tm

Cleaning text

Stemming and lemmatisation

Spelling

Visualisations

Word frequency charts

Word clouds

Machine learning

Topic modelling

Text features

N-grams

Working with dates

Working with geographic data

Getting data

Static maps

Interactive maps

Taking samples

Simple random sampling

Stratified sampling (weighted)

Cluster sampling

Systematic sampling

Making inferences

Making predictions

Supervised learning

Quick overview of the caret package

Splitting the data into training and test sets

Training the model

Evaluating the model

Unsupervised learning

Making forecasts

Creating charts

Static plots

Aesthetics vs Attributes

Interactive charts

Network visualisation

Creating reports

Creating dashboards

Package options

Flexdashboard

Crosstalk

Shiny

Getting started with Shiny

ui.R

server.R

Running the app

Advanced considerations

User Experience:

Speed:

Other considerations:

Learning resources

Programming

Defining your own function

If statements

For loops

Finding out more

Getting Help

Built-in help

Vignettes

Stack Overflow

Additional resources

Test data sets

Cheat sheets

GitHub

Style guide

Other useful resources

Summary

Working with structured data

Working with unstructured data

Working with text

Working with dates