6 min read

tidyr for finance

In this post I’ll show how to use tidyr package which is one of the core tidyverse packages. Throughout this post I will use three messy datasets, stocks1, stocks3, stocks4, and I’ll point out the underlying problem for each dataset and show how to tidy them using most used functions (gather, spread, separate and unite) from tidyr package.

tidyr is written by Hadley Wickham and he specifies the tidy data concept as:

“Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table (Hadley Wickham).”

There are three interrelated rules which make a dataset tidy:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell (Hadley Wickham, http://r4ds.had.co.nz).

Figure 1: tidy-dataset-requirements, source: http://r4ds.had.co.nz

There are four main functions in tidyr package:

  • gather() : Gather columns into rows.
  • spread() : Spread rows into columns.
  • separate() : Separate one column into several columns.
  • unite() : Unite several columns into one column.

Figure 2: tidyr-main-functions

Let’s explain each of them with examples.

Problem with dataset: rows are columns

Let’s work with a messy data; stocks1 where columns “2018-03-26” to “2018-03-29” represents dates for each observation. Column names “2018-03-26” to “2018-03-29” are values of a variable.

# printing out stocks2 dataset
stocks1
##   eq_name 2018-03-26 2018-03-27 2018-03-28 2018-03-29
## 1    aapl     172.77     168.34     166.48     167.78
## 2    goog    1053.21    1005.10    1004.56    1031.79
## 3    msft      93.78      89.47      89.39      91.27

We need to gather those columns into a new pair of variables. In other words we will gather for columns in stocks2 data set and put into a column. We can manage this via gather function.

Let’s quickly have a check its arguments.

args(gather)
## function (data, key = "key", value = "value", ..., na.rm = FALSE, 
##     convert = FALSE, factor_key = FALSE) 
## NULL
  • data : is the dataset we’d like to gather its columns. A data frame.
  • key : Names of new key columns, as strings or symbols.
  • value : Names of new value columns, as strings or symbols.
  • … : A selection of columns. If empty, all variables are selected.

Figure 3: tidyr-gather-function, source: http://r4ds.had.co.nz

Let’s gather the columns “2018-03-26” to “2018-03-29” in stocks2 data set into new column named “date”.

stocks1 <- gather(data = stocks1, date, price, 2:5)
stocks1
##    eq_name       date   price
## 1     aapl 2018-03-26  172.77
## 2     goog 2018-03-26 1053.21
## 3     msft 2018-03-26   93.78
## 4     aapl 2018-03-27  168.34
## 5     goog 2018-03-27 1005.10
## 6     msft 2018-03-27   89.47
## 7     aapl 2018-03-28  166.48
## 8     goog 2018-03-28 1004.56
## 9     msft 2018-03-28   89.39
## 10    aapl 2018-03-29  167.78
## 11    goog 2018-03-29 1031.79
## 12    msft 2018-03-29   91.27

Looks better however stocks1 data is still not tidy.

Problem with dataset: columns are rows

In stocks1 dataset the eq_name column contains several column names. In this case we need to spread the rows in eq_name to three other columns which represents stock prices for Apple, Google and Microsoft. To overcome this we will use spread() function.

Let’s quickly have a check its arguments.

args(spread)
## function (data, key, value, fill = NA, convert = FALSE, drop = TRUE, 
##     sep = NULL) 
## NULL
  • data : is the dataset we’d like to spread its columns. A data frame.
  • key : Column names or positions(existing).
  • value : Column names or positions(existing) for kept values under it/them.

Figure 4: tidyr-spred-function, source: http://r4ds.had.co.nz

Spreading eq_name column in stocks1 into three columns.

stocks1_tidied <- spread(stocks1, eq_name, price)
stocks1_tidied
##         date   aapl    goog  msft
## 1 2018-03-26 172.77 1053.21 93.78
## 2 2018-03-27 168.34 1005.10 89.47
## 3 2018-03-28 166.48 1004.56 89.39
## 4 2018-03-29 167.78 1031.79 91.27

It now looks tidied where each column represent a variable and each rows represents an observation.

Problem with dataset: several columns are pasted into one column

As seem from stocks3 dataset aapl/goog/msft column pasted together altogouh it should be in three columns. To correct this we will use separate() function which takes a column and divides into separate columns.

stocks3
##         date                   aapl/goog/msft
## 1 2018-03-26 172.770004/1053.209961/93.779999
## 2 2018-03-27 168.339996/1005.099976/89.470001
## 3 2018-03-28 166.479996/1004.559998/89.389999
## 4 2018-03-29 167.779999/1031.790039/91.269997

Let’s quickly have a check its arguments.

args(separate)
## function (data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, 
##     convert = FALSE, extra = "warn", fill = "warn", ...) 
## NULL
  • data : is the dataset we’d like to separate its columns. A data frame.
  • col : Column name or position (existing) to be separated.
  • into : New column names.
  • sep : Separator between columns.

Figure 5: tidyr-separate-function, source: http://r4ds.had.co.nz

Let’s now separate “aapl/goog/msft” column into three new columns called aapl, goog and msft.

stocks3_separated <- separate(stocks3, "aapl/goog/msft",  c("aapl", "goog", "msft"), sep = "/")
stocks3_separated
##         date       aapl        goog      msft
## 1 2018-03-26 172.770004 1053.209961 93.779999
## 2 2018-03-27 168.339996 1005.099976 89.470001
## 3 2018-03-28 166.479996 1004.559998 89.389999
## 4 2018-03-29 167.779999 1031.790039 91.269997

Problem with dataset: one column are divided into several columns

As seem from stocks3 dataset year, month and day columns are divide although they should be in a single column. This to be corrected with use of unite() function which unites several columns into a single column.

stocks4
##   year month day   aapl    goog  msft
## 1 2018    03  26 172.77 1053.21 93.78
## 2 2018    03  27 168.34 1005.10 89.47
## 3 2018    03  28 166.48 1004.56 89.39
## 4 2018    03  29 167.78 1031.79 91.27

Looking at its arguments.

args(unite)
## function (data, col, ..., sep = "_", remove = TRUE) 
## NULL
  • data : is the dataset we’d like to unite its columns. A data frame.
  • col : New column name.
  • … : Column names or positions(existing) to be united.
  • sep : Separator between values.

Figure 6: tidyr-unite-function, source: http://r4ds.had.co.nz

Let’s now unite year, month and day columns into into a single column called “date”.

stocks4_united <- unite(stocks4, "date", c("year","month", "day"), sep = "-")
stocks4_united
##         date   aapl    goog  msft
## 1 2018-03-26 172.77 1053.21 93.78
## 2 2018-03-27 168.34 1005.10 89.47
## 3 2018-03-28 166.48 1004.56 89.39
## 4 2018-03-29 167.78 1031.79 91.27

Now looks stocks4_united is a tidied dataset.

The other functions in tidyr package:

  • complete() : completes a data frame with missing combinations of data.
  • drop_na() : drops rows containing missing values.
  • expand() : expands data frame to include all combinations of values.
  • extract() : extracts one column into multiple columns.
  • fill() : fills in missing values.
  • full_seq() : creates the full sequence of values in a vector.
  • nest() : nests repeated values in a list-variable.
  • replace_na() : replaces missing values.
  • separate_rows() : separates a collapsed column into multiple rows.
  • uncount() : uncounts a data frame. “Opposite of dplyr::count()”.
  • unnest() : unnests a list column.

The datasets in the package:

  • smiths
  • table1-2-3-4-5
  • who