[![AnalyticsDojo](https://github.com/rpi-techfundamentals/spring2019-materials/blob/master/fig/final-logo.png?raw=1)](http://rpi.analyticsdojo.com)
<center><h1>Introduction to R - Tidyverse </h1></center>
<center><h3><a href = 'http://rpi.analyticsdojo.com'>rpi.analyticsdojo.com</a></h3></center>

## Overview

> It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003)

Thus before you can even get to doing any sort of sophisticated analysis or plotting, you'll generally first need to: 

1. ***Manipulating*** data frames, e.g. filtering, summarizing, and conducting calculations across groups.
2. ***Tidying*** data into the appropriate format



# What is the Tidyverse?

## Tidyverse
- "The tidyverse is a set of packages that work in harmony because they share common data representations and API design." -Hadley Wickham
- The variety of packages include `dplyr`, `tibble`, `tidyr`, `readr`, `purrr` (and more).


![](http://r4ds.had.co.nz/diagrams/data-science-explore.png)
- From [R for Data Science](http://r4ds.had.co.nz/explore-intro.html) by [Hadley Wickham](https://github.com/hadley)

## Schools of Thought

There are two competing schools of thought within the R community.

* We should stick to the base R functions to do manipulating and tidying; `tidyverse` uses syntax that's unlike base R and is superfluous.
* We should start teaching students to manipulate data using `tidyverse` tools because they are straightfoward to use, more readable than base R, and speed up the tidying process.

We'll show you some of the `tidyverse` tools so you can make an informed decision about whether you want to use base R or these newfangled packages.

## Dataframe Manipulation using Base R Functions

- So far, you’ve seen the basics of manipulating data frames, e.g. subsetting, merging, and basic calculations. 
- For instance, we can use base R functions to calculate summary statistics across groups of observations,
- e.g. the mean GDP per capita within each region:


In [43]:
gapminder <- read.csv("../../input/gapminder-FiveYearData.csv",
          stringsAsFactors = TRUE)
head(gapminder)

country,year,pop,continent,lifeExp,gdpPercap
Afghanistan,1952,8425333,Asia,28.801,779.4453
Afghanistan,1957,9240934,Asia,30.332,820.853
Afghanistan,1962,10267083,Asia,31.997,853.1007
Afghanistan,1967,11537966,Asia,34.02,836.1971
Afghanistan,1972,13079460,Asia,36.088,739.9811
Afghanistan,1977,14880372,Asia,38.438,786.1134


## But this isn't ideal because it involves a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.


# Dataframe Manipulation using dplyr



Here we're going to cover 6 of the most commonly used functions as well as using pipes (`%>%`) to combine them.

1. `select()`
2. `filter()`
3. `group_by()`
4. `summarize()`
5. `mutate()`
6. `arrange()`

If you have have not installed this package earlier, please do so now:


```r
install.packages('dplyr')
```

## Dataframe Manipulation using `dplyr`

Luckily, the [`dplyr`](https://cran.r-project.org/web/packages/dplyr/dplyr.pdf) package provides a number of very useful functions for manipulating dataframes. These functions will save you time by reducing repetition. As an added bonus, you might even find the `dplyr` grammar easier to read.

- ["A fast, consistent tool for working with data frame like objects, both in memory and out of memory."](https://cran.r-project.org/web/packages/dplyr/index.html)
- Subset observations using their value with `filter()`.
- Reorder rows using `arrange()`.
- Select columns using  `select()`.
- Recode variables useing `mutate()`.
- Sumarize variables using `summarise()`.

In [44]:
#Now lets load some packages:
library(dplyr)
library(ggplot2)
library(tidyverse)

# dplyr select

Imagine that we just received the gapminder dataset, but are only interested in a few variables in it. We could use the `select()` function to keep only the columns corresponding to variables we select.


In [47]:
year_country_gdp <-gapminder[,c("year","country")] 
year_country_gdp

year,country
1952,Afghanistan
1957,Afghanistan
1962,Afghanistan
1967,Afghanistan
1972,Afghanistan
1977,Afghanistan
1982,Afghanistan
1987,Afghanistan
1992,Afghanistan
1997,Afghanistan


In [45]:
year_country_gdp <- select(gapminder, year, country, gdpPercap)
head(year_country_gdp)

year,country,gdpPercap
1952,Afghanistan,779.4453
1957,Afghanistan,820.853
1962,Afghanistan,853.1007
1967,Afghanistan,836.1971
1972,Afghanistan,739.9811
1977,Afghanistan,786.1134


## dplyr Piping
- `%>%` Is used to help to write cleaner code.
- It is loaded by default when running the `tidyverse`, but it comes from the `magrittr` package.
- Input from one command is piped to another without saving directly in memory with an intermediate throwaway variable.
-Since the pipe grammar is unlike anything we've seen in R before, let's repeat what we've done above using pipes.

In [27]:
year_country_gdp <- gapminder %>% select(year,country,gdpPercap)


## dplyr filter

Now let's say we're only interested in African countries. We can combine `select` and `filter` to select only the observations where `continent` is `Africa`.

As with last time, first we pass the gapminder dataframe to the `filter()` function, then we pass the filtered version of the gapminder dataframe to the `select()` function.

To clarify, both the `select` and `filter` functions subsets the data frame. The difference is that `select` extracts certain *columns*, while `filter` extracts certain *rows*.

**Note:** The order of operations is very important in this case. If we used 'select' first, filter would not be able to find the variable `continent` since we would have removed it in the previous step.


In [28]:
year_country_gdp_africa <- gapminder %>%
    filter(continent == "Africa") %>%
    select(year,country,gdpPercap)

## dplyr Calculations Across Groups

A common task you'll encounter when working with data is running calculations on different groups within the data. For instance, what if we wanted to calculate the mean GDP per capita for each continent?

In base R, you would have to run the `mean()` function for each subset of data:


In [29]:
mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])


# dplyr split-apply-combine

The abstract problem we're encountering here is know as "split-apply-combine":

![](../../fig/splitapply.png)

We want to *split* our data into groups (in this case continents), *apply* some calculations on each group, then  *combine* the results together afterwards. 

Module 4 gave some ways to do split-apply-combine type stuff using the `apply` family of functions, but those are error prone and messy.

Luckily, `dplyr` offers a much cleaner, straight-forward solution to this problem. 


```r
# remove this column -- there are two easy ways!

```

## dplyr group_by

We've already seen how `filter()` can help us select observations that meet certain criteria (in the above: `continent == "Europe"`). More helpful, however, is the `group_by()` function, which will essentially use every unique criteria that we could have used in `filter()`.

A `grouped_df` can be thought of as a `list` where each item in the `list` is a `data.frame` which contains only the rows that correspond to the a particular value `continent` (at least in the example above).

![](../../fig/dplyr-fig2.png)


In [30]:
#Summarize returns a dataframe. 
gdp_bycontinents <- gapminder %>%
    group_by(continent) %>%
    summarize(mean_gdpPercap = mean(gdpPercap))
head(gdp_bycontinents)

continent,mean_gdpPercap
Africa,2193.755
Americas,7136.11
Asia,7902.15
Europe,14469.476
Oceania,18621.609


![](../../fig/dplyr-fig3.png)

That allowed us to calculate the mean gdpPercap for each continent. But it gets even better -- the function `group_by()` allows us to group by multiple variables. Let's group by `year` and `continent`.



In [48]:
gdp_bycontinents_byyear <- gapminder %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap))
gdp_bycontinents_byyear

continent,year,mean_gdpPercap
Africa,1952,1252.572
Africa,1957,1385.236
Africa,1962,1598.079
Africa,1967,2050.364
Africa,1972,2339.616
Africa,1977,2585.939
Africa,1982,2481.593
Africa,1987,2282.669
Africa,1992,2281.81
Africa,1997,2378.76


In [50]:

mpg<-mpg
str(mpg)


Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	234 obs. of  11 variables:
 $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
 $ model       : chr  "a4" "a4" "a4" "a4" ...
 $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
 $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
 $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
 $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
 $ drv         : chr  "f" "f" "f" "f" ...
 $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
 $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
 $ fl          : chr  "p" "p" "p" "p" ...
 $ class       : chr  "compact" "compact" "compact" "compact" ...


### That is already quite powerful, but it gets even better! You're not limited to defining 1 new variable in `summarize()`.


In [33]:
gdp_pop_bycontinents_byyear <- gapminder %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop))
head(gdp_pop_bycontinents_byyear)

continent,year,mean_gdpPercap,sd_gdpPercap,mean_pop,sd_pop
Africa,1952,1252.572,982.9521,4570010,6317450
Africa,1957,1385.236,1134.5089,5093033,7076042
Africa,1962,1598.079,1461.8392,5702247,7957545
Africa,1967,2050.364,2847.7176,6447875,8985505
Africa,1972,2339.616,3286.8539,7305376,10130833
Africa,1977,2585.939,4142.3987,8328097,11585184


## Basics
- Use the mpg dataset to create summaries by manufacturer/year for 8 cyl vehicles. 

In [34]:
mpg<-mpg
head(mpg)

manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact


In [35]:
#This just gives a dataframe with 70 obs, only 8 cylinder cars 
mpg.8cyl<-mpg %>% 
  filter(cyl == 8)
mpg.8cyl


manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
audi,a6 quattro,4.2,2008,8,auto(s6),4,16,23,p,midsize
chevrolet,c1500 suburban 2wd,5.3,2008,8,auto(l4),r,14,20,r,suv
chevrolet,c1500 suburban 2wd,5.3,2008,8,auto(l4),r,11,15,e,suv
chevrolet,c1500 suburban 2wd,5.3,2008,8,auto(l4),r,14,20,r,suv
chevrolet,c1500 suburban 2wd,5.7,1999,8,auto(l4),r,13,17,r,suv
chevrolet,c1500 suburban 2wd,6.0,2008,8,auto(l4),r,12,17,r,suv
chevrolet,corvette,5.7,1999,8,manual(m6),r,16,26,p,2seater
chevrolet,corvette,5.7,1999,8,auto(l4),r,15,23,p,2seater
chevrolet,corvette,6.2,2008,8,manual(m6),r,16,26,p,2seater
chevrolet,corvette,6.2,2008,8,auto(s6),r,15,25,p,2seater


In [52]:
#Filter to only those cars that have miles per gallon equal to 
mpg.8cyl<-mpg %>% 
  filter(cyl == 8)

#Alt Syntax
mpg.8cyl<-filter(mpg, cyl == 8)

mpg.8cyl

manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
audi,a6 quattro,4.2,2008,8,auto(s6),4,16,23,p,midsize
chevrolet,c1500 suburban 2wd,5.3,2008,8,auto(l4),r,14,20,r,suv
chevrolet,c1500 suburban 2wd,5.3,2008,8,auto(l4),r,11,15,e,suv
chevrolet,c1500 suburban 2wd,5.3,2008,8,auto(l4),r,14,20,r,suv
chevrolet,c1500 suburban 2wd,5.7,1999,8,auto(l4),r,13,17,r,suv
chevrolet,c1500 suburban 2wd,6.0,2008,8,auto(l4),r,12,17,r,suv
chevrolet,corvette,5.7,1999,8,manual(m6),r,16,26,p,2seater
chevrolet,corvette,5.7,1999,8,auto(l4),r,15,23,p,2seater
chevrolet,corvette,6.2,2008,8,manual(m6),r,16,26,p,2seater
chevrolet,corvette,6.2,2008,8,auto(s6),r,15,25,p,2seater


In [54]:
#Sort cars by MPG highway(hwy) then city(cty)
mpgsort<-arrange(mpg, hwy, cty)
mpgsort

manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
dodge,dakota pickup 4wd,4.7,2008,8,auto(l5),4,9,12,e,pickup
dodge,durango 4wd,4.7,2008,8,auto(l5),4,9,12,e,suv
dodge,ram 1500 pickup 4wd,4.7,2008,8,auto(l5),4,9,12,e,pickup
dodge,ram 1500 pickup 4wd,4.7,2008,8,manual(m6),4,9,12,e,pickup
jeep,grand cherokee 4wd,4.7,2008,8,auto(l5),4,9,12,e,suv
chevrolet,k1500 tahoe 4wd,5.3,2008,8,auto(l4),4,11,14,e,suv
jeep,grand cherokee 4wd,6.1,2008,8,auto(l5),4,11,14,p,suv
chevrolet,c1500 suburban 2wd,5.3,2008,8,auto(l4),r,11,15,e,suv
chevrolet,k1500 tahoe 4wd,5.7,1999,8,auto(l4),4,11,15,r,suv
dodge,dakota pickup 4wd,5.2,1999,8,auto(l4),4,11,15,r,pickup


In [55]:
#From the documentation https://cran.r-project.org/web/packages/dplyr/dplyr.pdf  
select(iris, starts_with("petal")) #returns columns that start with "Petal"
select(iris, ends_with("width")) #returns columns that start with "Width"
select(iris, contains("etal"))
select(iris, matches(".t."))
select(iris, Petal.Length, Petal.Width)
vars <- c("Petal.Length", "Petal.Width")
select(iris, one_of(vars))

Petal.Length,Petal.Width
1.4,0.2
1.4,0.2
1.3,0.2
1.5,0.2
1.4,0.2
1.7,0.4
1.4,0.3
1.5,0.2
1.4,0.2
1.5,0.1


Sepal.Width,Petal.Width
3.5,0.2
3.0,0.2
3.2,0.2
3.1,0.2
3.6,0.2
3.9,0.4
3.4,0.3
3.4,0.2
2.9,0.2
3.1,0.1


Petal.Length,Petal.Width
1.4,0.2
1.4,0.2
1.3,0.2
1.5,0.2
1.4,0.2
1.7,0.4
1.4,0.3
1.5,0.2
1.4,0.2
1.5,0.1


Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
5.1,3.5,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,1.3,0.2
4.6,3.1,1.5,0.2
5.0,3.6,1.4,0.2
5.4,3.9,1.7,0.4
4.6,3.4,1.4,0.3
5.0,3.4,1.5,0.2
4.4,2.9,1.4,0.2
4.9,3.1,1.5,0.1


Petal.Length,Petal.Width
1.4,0.2
1.4,0.2
1.3,0.2
1.5,0.2
1.4,0.2
1.7,0.4
1.4,0.3
1.5,0.2
1.4,0.2
1.5,0.1


Petal.Length,Petal.Width
1.4,0.2
1.4,0.2
1.3,0.2
1.5,0.2
1.4,0.2
1.7,0.4
1.4,0.3
1.5,0.2
1.4,0.2
1.5,0.1


In [56]:
#Recoding Data
# See Creating new variables with mutate and ifelse: 
# https://rstudio-pubs-static.s3.amazonaws.com/116317_e6922e81e72e4e3f83995485ce686c14.html 
mutate(mpg, displ_l = displ / 61.0237)


manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,displ_l
audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,0.02949674
audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,0.02949674
audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,0.03277415
audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,0.03277415
audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,0.04588381
audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact,0.04588381
audi,a4,3.1,2008,6,auto(av),f,18,27,p,compact,0.05079994
audi,a4 quattro,1.8,1999,4,manual(m5),4,18,26,p,compact,0.02949674
audi,a4 quattro,1.8,1999,4,auto(l5),4,16,25,p,compact,0.02949674
audi,a4 quattro,2.0,2008,4,manual(m6),4,20,28,p,compact,0.03277415


In [59]:
# Example taken from David Ranzolin
# https://rstudio-pubs-static.s3.amazonaws.com/116317_e6922e81e72e4e3f83995485ce686c14.html#/9 
section <- c("MATH111", "MATH111", "ENG111")
grade <- c(78, 93, 56)
student <- c("David", "Kristina", "Mycroft")
gradebook <- data.frame(section, grade, student)

#As the output is a tibble, here we are saving each intermediate version.
gradebook2<-mutate(gradebook, Pass.Fail = ifelse(grade > 60, "Pass", "Fail"))  

gradebook3<-mutate(gradebook2, letter = ifelse(grade %in% 60:69, "D",
                                               ifelse(grade %in% 70:79, "C",
                                                      ifelse(grade %in% 80:89, "B",
                                                             ifelse(grade %in% 90:99, "A", "F")))))

gradebook3

section,grade,student,Pass.Fail,letter
MATH111,78,David,Pass,C
MATH111,93,Kristina,Pass,A
ENG111,56,Mycroft,Fail,F


In [61]:
#Here we are using piping to do this more effectively. 
gradebook4<-gradebook %>%
mutate(Pass.Fail = ifelse(grade > 60, "Pass", "Fail"))  %>%
mutate(letter = ifelse(grade %in% 60:69, "D", 
                                  ifelse(grade %in% 70:79, "C",
                                         ifelse(grade %in% 80:89, "B",
                                                ifelse(grade %in% 90:99, "A", "F")))))


gradebook4

section,grade,student,Pass.Fail,letter
MATH111,78,David,Pass,C
MATH111,93,Kristina,Pass,A
ENG111,56,Mycroft,Fail,F


In [63]:
#find the average city and highway mpg
summarise(mpg, mean(cty), mean(hwy))
#find the average city and highway mpg by cylander
summarise(group_by(mpg, cyl), mean(cty), mean(hwy))
summarise(group_by(mtcars, cyl), m = mean(disp), sd = sd(disp))

# With data frames, you can create and immediately use summaries
by_cyl <- mtcars %>% group_by(cyl)
by_cyl %>% summarise(a = n(), b = a + 1)

mean(cty),mean(hwy)
16.85897,23.44017


cyl,mean(cty),mean(hwy)
4,21.01235,28.80247
5,20.5,28.75
6,16.21519,22.82278
8,12.57143,17.62857


cyl,m,sd
4,105.1364,26.87159
6,183.3143,41.56246
8,353.1,67.77132


cyl,a,b
4,11,12
6,7,8
8,14,15


#This was adopted from the Berkley R Bootcamp. 