Introduction to R - Merging and Aggregating Data

rpi.analyticsdojo.com

Overview

Merging Dataframes
Aggregating Dataframes
Advanced Functions

Merging Data Frame with Vector

Can combine vector with data frame in multiple ways.
data.frame(a,b) where a & b can be vectors, matrices, or data frames.

#Below is the sample data we will be creating 2 dataframes  
key=(1:10)

#Here we are passing the row names and column names as a list. 
m<- data.frame(matrix(rnorm(40, mean=20, sd=5), nrow=10, ncol=4, dimnames=list((1:10),c("a","b","c","d"))))
m2<- data.frame(matrix(rnorm(40, mean=1000, sd=5), nrow=10, ncol=4, dimnames=list((1:10),c("e","f","g","h"))))

#This is one way of combining a vector with a dataframe. 
df<-  data.frame(key,m)
df2<- data.frame(key,m2)

#This is another way way of combining a vector with a dataframe. 
dfb<-  cbind(key,m)
df2b<- cbind(key,m2)

df
df2
dfb
df2b

key	a	b	c	d
1	16.19873	14.41495	27.73225	15.564688
2	27.76677	18.54772	21.06688	15.697810
3	11.68592	14.91207	22.82086	15.666790
4	16.39982	28.30284	10.97550	19.083633
5	22.16232	16.82574	14.28676	20.162797
6	17.17425	14.36932	18.55487	13.498498
7	20.15380	18.00987	15.99028	14.325000
8	20.68866	12.83505	25.24119	24.538494
9	18.84664	24.01079	12.69775	8.095156
10	16.29913	21.51270	15.14676	23.722103

key	e	f	g	h
1	1004.1240	997.4379	997.5697	1000.8540
2	1002.6933	998.4041	1009.1720	1010.4120
3	995.9138	1001.0959	1004.6025	1002.5405
4	999.5493	998.8054	1003.9649	1000.0133
5	1007.2373	1006.2580	1000.1882	992.9980
6	1000.2068	994.7482	998.2876	1002.7093
7	999.1622	998.6231	998.7175	998.0497
8	1003.1263	1002.7279	1004.1623	1000.5204
9	1003.1548	994.2030	1002.1614	999.3726
10	1006.8744	1004.2677	998.8720	993.5726

key	a	b	c	d
1	16.19873	14.41495	27.73225	15.564688
2	27.76677	18.54772	21.06688	15.697810
3	11.68592	14.91207	22.82086	15.666790
4	16.39982	28.30284	10.97550	19.083633
5	22.16232	16.82574	14.28676	20.162797
6	17.17425	14.36932	18.55487	13.498498
7	20.15380	18.00987	15.99028	14.325000
8	20.68866	12.83505	25.24119	24.538494
9	18.84664	24.01079	12.69775	8.095156
10	16.29913	21.51270	15.14676	23.722103

key	e	f	g	h
1	1004.1240	997.4379	997.5697	1000.8540
2	1002.6933	998.4041	1009.1720	1010.4120
3	995.9138	1001.0959	1004.6025	1002.5405
4	999.5493	998.8054	1003.9649	1000.0133
5	1007.2373	1006.2580	1000.1882	992.9980
6	1000.2068	994.7482	998.2876	1002.7093
7	999.1622	998.6231	998.7175	998.0497
8	1003.1263	1002.7279	1004.1623	1000.5204
9	1003.1548	994.2030	1002.1614	999.3726
10	1006.8744	1004.2677	998.8720	993.5726

Merging Columns of Data Frame with another Data Frame

Can combine data frame in multiple ways.
merge(a,b,by="key") where a & b are dataframes with the same keys.
cbind(a,b) where a & b are dataframes with the same number of rows.

# This manages the merge by an associated key.
df3 <- merge(df,df2,by="key")
# This just does a "column bind" 
df4<- cbind(df,df2)
df5<- data.frame(df,df2)
df3
df4
df5

key	a	b	c	d	e	f	g	h
1	16.278799	23.004297	19.262524	22.11648	1004.4714	995.4055	1001.5156	1004.8862
2	19.287252	19.229253	18.817575	13.67939	1000.7399	1004.6248	995.8950	1008.5242
3	18.833623	16.142004	18.454224	18.29241	994.3794	1002.2578	998.6004	999.7609
4	23.780847	10.934207	13.448540	17.83936	1001.4795	1009.0885	1002.5866	998.8287
5	13.982935	16.924402	19.037475	20.19748	993.9745	999.9868	1001.0336	987.3751
6	15.534589	23.437320	6.795926	20.19305	996.1284	1008.8440	1005.5196	1003.6926
7	16.660076	18.315077	32.107139	23.35534	994.5026	1004.9990	1004.0972	1005.6532
8	19.447799	18.278384	11.823108	13.09162	1007.8858	993.8745	1005.1093	996.8686
9	9.225069	24.925796	13.868021	17.06181	997.6026	1001.1045	991.7969	1000.5898
10	25.809451	7.492747	18.483003	24.99244	995.6190	1010.2642	998.6192	998.8618

key	a	b	c	d	key	e	f	g	h
1	16.278799	23.004297	19.262524	22.11648	1	1004.4714	995.4055	1001.5156	1004.8862
2	19.287252	19.229253	18.817575	13.67939	2	1000.7399	1004.6248	995.8950	1008.5242
3	18.833623	16.142004	18.454224	18.29241	3	994.3794	1002.2578	998.6004	999.7609
4	23.780847	10.934207	13.448540	17.83936	4	1001.4795	1009.0885	1002.5866	998.8287
5	13.982935	16.924402	19.037475	20.19748	5	993.9745	999.9868	1001.0336	987.3751
6	15.534589	23.437320	6.795926	20.19305	6	996.1284	1008.8440	1005.5196	1003.6926
7	16.660076	18.315077	32.107139	23.35534	7	994.5026	1004.9990	1004.0972	1005.6532
8	19.447799	18.278384	11.823108	13.09162	8	1007.8858	993.8745	1005.1093	996.8686
9	9.225069	24.925796	13.868021	17.06181	9	997.6026	1001.1045	991.7969	1000.5898
10	25.809451	7.492747	18.483003	24.99244	10	995.6190	1010.2642	998.6192	998.8618

key	a	b	c	d	key.1	e	f	g	h
1	16.278799	23.004297	19.262524	22.11648	1	1004.4714	995.4055	1001.5156	1004.8862
2	19.287252	19.229253	18.817575	13.67939	2	1000.7399	1004.6248	995.8950	1008.5242
3	18.833623	16.142004	18.454224	18.29241	3	994.3794	1002.2578	998.6004	999.7609
4	23.780847	10.934207	13.448540	17.83936	4	1001.4795	1009.0885	1002.5866	998.8287
5	13.982935	16.924402	19.037475	20.19748	5	993.9745	999.9868	1001.0336	987.3751
6	15.534589	23.437320	6.795926	20.19305	6	996.1284	1008.8440	1005.5196	1003.6926
7	16.660076	18.315077	32.107139	23.35534	7	994.5026	1004.9990	1004.0972	1005.6532
8	19.447799	18.278384	11.823108	13.09162	8	1007.8858	993.8745	1005.1093	996.8686
9	9.225069	24.925796	13.868021	17.06181	9	997.6026	1001.1045	991.7969	1000.5898
10	25.809451	7.492747	18.483003	24.99244	10	995.6190	1010.2642	998.6192	998.8618

Merging Rows of Data Frame with another Data Frame

rbind(a,b) combines rows of data frames of a and b.
rbind(a,b, make.row.names=FALSE) this will reset the index.

#Here we can combine rows with rbind. 
df5<-df
#The make Row
df6<-rbind(df,df5)
df6
df7<-rbind(df,df5, make.row.names=FALSE)
df7

	key	a	b	c	d
1	1	16.19873	14.41495	27.73225	15.564688
2	2	27.76677	18.54772	21.06688	15.697810
3	3	11.68592	14.91207	22.82086	15.666790
4	4	16.39982	28.30284	10.97550	19.083633
5	5	22.16232	16.82574	14.28676	20.162797
6	6	17.17425	14.36932	18.55487	13.498498
7	7	20.15380	18.00987	15.99028	14.325000
8	8	20.68866	12.83505	25.24119	24.538494
9	9	18.84664	24.01079	12.69775	8.095156
10	10	16.29913	21.51270	15.14676	23.722103
11	1	16.19873	14.41495	27.73225	15.564688
21	2	27.76677	18.54772	21.06688	15.697810
31	3	11.68592	14.91207	22.82086	15.666790
41	4	16.39982	28.30284	10.97550	19.083633
51	5	22.16232	16.82574	14.28676	20.162797
61	6	17.17425	14.36932	18.55487	13.498498
71	7	20.15380	18.00987	15.99028	14.325000
81	8	20.68866	12.83505	25.24119	24.538494
91	9	18.84664	24.01079	12.69775	8.095156
101	10	16.29913	21.51270	15.14676	23.722103

key	a	b	c	d
1	16.19873	14.41495	27.73225	15.564688
2	27.76677	18.54772	21.06688	15.697810
3	11.68592	14.91207	22.82086	15.666790
4	16.39982	28.30284	10.97550	19.083633
5	22.16232	16.82574	14.28676	20.162797
6	17.17425	14.36932	18.55487	13.498498
7	20.15380	18.00987	15.99028	14.325000
8	20.68866	12.83505	25.24119	24.538494
9	18.84664	24.01079	12.69775	8.095156
10	16.29913	21.51270	15.14676	23.722103
1	16.19873	14.41495	27.73225	15.564688
2	27.76677	18.54772	21.06688	15.697810
3	11.68592	14.91207	22.82086	15.666790
4	16.39982	28.30284	10.97550	19.083633
5	22.16232	16.82574	14.28676	20.162797
6	17.17425	14.36932	18.55487	13.498498
7	20.15380	18.00987	15.99028	14.325000
8	20.68866	12.83505	25.24119	24.538494
9	18.84664	24.01079	12.69775	8.095156
10	16.29913	21.51270	15.14676	23.722103

df7

key	a	b	c	d
1	16.19873	14.41495	27.73225	15.564688
2	27.76677	18.54772	21.06688	15.697810
3	11.68592	14.91207	22.82086	15.666790
4	16.39982	28.30284	10.97550	19.083633
5	22.16232	16.82574	14.28676	20.162797
6	17.17425	14.36932	18.55487	13.498498
7	20.15380	18.00987	15.99028	14.325000
8	20.68866	12.83505	25.24119	24.538494
9	18.84664	24.01079	12.69775	8.095156
10	16.29913	21.51270	15.14676	23.722103
1	16.19873	14.41495	27.73225	15.564688
2	27.76677	18.54772	21.06688	15.697810
3	11.68592	14.91207	22.82086	15.666790
4	16.39982	28.30284	10.97550	19.083633
5	22.16232	16.82574	14.28676	20.162797
6	17.17425	14.36932	18.55487	13.498498
7	20.15380	18.00987	15.99028	14.325000
8	20.68866	12.83505	25.24119	24.538494
9	18.84664	24.01079	12.69775	8.095156
10	16.29913	21.51270	15.14676	23.722103

`aggregate` and `by`

Aggregation is a very important function.
Can have variables/analyses that happen at different levels.
by(x, by, FUN) provides similar functionality.

iris=read.csv(file="../../input/iris.csv", header=TRUE,sep=",")
head(iris)

sepal_length	sepal_width	petal_length	petal_width	species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

iris<-read.csv(file="../../input/iris.csv", header=TRUE,sep=",")

#Aggregate by Species  aggregate(x, by, FUN, ...)
iris.agg<-aggregate(iris[,1:4], by=list("species" = iris$species), mean)
print(iris.agg)

#Notice this gives us the same output but structured differently. 
by(iris[, 1:4], iris$species, colMeans)

     species sepal_length sepal_width petal_length petal_width
   setosa        5.006       3.418        1.464       0.244
versicolor        5.936       2.770        4.260       1.326
virginica        6.588       2.974        5.552       2.026

iris$species: setosa
sepal_length  sepal_width petal_length  petal_width 
       5.006        3.418        1.464        0.244 
------------------------------------------------------------ 
iris$species: versicolor
sepal_length  sepal_width petal_length  petal_width 
       5.936        2.770        4.260        1.326 
------------------------------------------------------------ 
iris$species: virginica
sepal_length  sepal_width petal_length  petal_width 
       6.588        2.974        5.552        2.026 

`apply`(plus `lapply`/`sapply`/`tapply`/`rapply`)

apply - Applying a function to an array or matrix, return a vector or array or list of values. apply(X, MARGIN, FUN, ...)
lapply - Apply a function to each element of a list or vector, return a list.
sapply - A user-friendly version if lapply. Apply a function to each element of a list or vector, return a vector.
tapply - Apply a function to subsets of a vector (and the subsets are defined by some other vector, usually a factor), return a vector.
rapply - Apply a function to each element of a nested list structure, recursively, return a list.
Some functions aren’t vectorized, or you may want to use a function on every row or column of a matrix/data frame, every element of a list, etc.
For more info see this tutorial

`apply`

apply - Applying a function to an array or matrix, return a vector or array or list of values. apply(X, MARGIN, FUN, ...)
If you are using a data frame the data types must all be the same.
`apply(X, MARGIN, FUN, …) where X is an array or matrix.
MARGIN is a vector giving the where function should be applied. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns.
FUN is any function.

iris<-read.csv(file="../../input/iris.csv", header=TRUE,sep=",")
iris$sum<-apply(iris[1:4], 1, sum) #This provides a sum across  for each row. 
iris$mean<-apply(iris[1:4], 1, mean)#This provides a mean across collumns for each row. 
head(iris)
apply(iris[1:4], 2, mean)

sepal_length	sepal_width	petal_length	petal_width	species	sum	mean
5.1	3.5	1.4	0.2	setosa	10.2	2.550
4.9	3.0	1.4	0.2	setosa	9.5	2.375
4.7	3.2	1.3	0.2	setosa	9.4	2.350
4.6	3.1	1.5	0.2	setosa	9.4	2.350
5.0	3.6	1.4	0.2	setosa	10.2	2.550
5.4	3.9	1.7	0.4	setosa	11.4	2.850

sepal_length

5.84333333333333

sepal_width

3.054

petal_length

3.75866666666667

petal_width

1.19866666666667

</dl>

`lapply` & `sapply`

lapply - Apply a function to each element of a list or vector, return a list.
lapply(X, FUN, ...)
sapply - A user-friendly version if lapply. Apply a function to each element of a list or vector, return a vector.
sapply(X, FUN, ...)

# create a list with 2 elements
sample <- list("count" = 1:5, "numbers" =5:10)

# sum each and return as a list. 
sample.sum<-lapply(sample, sum)

class(sample.sum)
print(c(sample.sum, sample.sum["numbers"],sample.sum["count"]))

'list'

$count
[1] 15

$numbers
[1] 45

$numbers
[1] 45

$count
[1] 15

# create a list with 2 elements
sample <- list("count" = 1:5, "numbers" =5:10)

# sum each and return as a list. 
sample.sum<-sapply(sample, sum)

class(sample.sum)
print(c(sample.sum, sample.sum["numbers"],sample.sum["count"],sample.sum[["count"]]))

#Note the differenece between #sample.sum[["count"]] and sample.sum["count"]

'integer'

  count numbers numbers   count         
     15      45      45      15      15 

# We can also utilize simple 
square<-function(x) x^2
square(1:5)

# We can use our own function here.     
sapply(1:10, square)

#We can also specify the function directly in sapply.
sapply(1:10, function(x) x^2)

</ol>

100

</ol>

100

</ol>

`tapply`

tapply - Apply a function to subsets of a vector (and the subsets are defined by some other vector, usually a factor), return a vector.
Can do something similar to aggregate.

#Tapply example
#tapply(X, INDEX, FUN, …) 
#X = a vector, INDEX = list of one or more factor, FUN = Function or operation that needs to be applied. 
iris<-read.csv(file="../../input/iris.csv", header=TRUE,sep=",")
iris.sepal_length.agg<-tapply(iris$sepal_length, iris$species, mean)
print(iris.sepal_length.agg)

    setosa versicolor  virginica 
     5.006      5.936      6.588 

CREDITS

Copyright AnalyticsDojo 2016. This work is licensed under the Creative Commons Attribution 4.0 International license agreement. This work is adopted from the Berkley R Bootcamp.

Introduction to R - Merging and Aggregating Data

rpi.analyticsdojo.com

Overview

Merging Data Frame with Vector

Merging Columns of Data Frame with another Data Frame

Merging Rows of Data Frame with another Data Frame

aggregate and by

apply(plus lapply/sapply/tapply/rapply)

apply

lapply & sapply

tapply

CREDITS

`aggregate` and `by`

`apply`(plus `lapply`/`sapply`/`tapply`/`rapply`)

`apply`

`lapply` & `sapply`

`tapply`