Introduction to R - DataFrames

rpi.analyticsdojo.com

Introduction to R DataFrames

Data frames are combinations of vectors of the same length, but can be of different types.
It is a special type of list.
Data frames are what is used for standard rectangular (record by field) datasets, similar to a spreadsheet
Data frames are a functionality that both sets R aside from some languages (e.g., Matlab) and provides functionality similar to some statistical packages (e.g., Stata, SAS) and Python’s Pandas Packages.

frame=read.csv(file="../../input/iris.csv", header=TRUE,sep=",")
class(frame)
head(frame) #The first few rows.
tail(frame) #The last few rows.
str(frame) #The Structure.

'data.frame'

sepal_length	sepal_width	petal_length	petal_width	species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

	sepal_length	sepal_width	petal_length	petal_width	species
145	6.7	3.3	5.7	2.5	virginica
146	6.7	3.0	5.2	2.3	virginica
147	6.3	2.5	5.0	1.9	virginica
148	6.5	3.0	5.2	2.0	virginica
149	6.2	3.4	5.4	2.3	virginica
150	5.9	3.0	5.1	1.8	virginica

'data.frame':	150 obs. of  5 variables:
 $ sepal_length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ sepal_width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ petal_length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ petal_width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

dim(frame) #Results in rows x columns
nrow(frame)  #The number of Rows
names(frame) #Provides the names
length(frame) #The number of columns
summary(frame) #Provides summary statistics.
is.matrix(frame) #Yields False because it has different types.  
is.list(frame) #Yields True
class(frame$sepal_length)
class(frame$species)
levels(frame$species)

150

</ol>

150

'sepal_length'

'sepal_width'

'petal_length'

'petal_width'

'species'

</ol>

  sepal_length    sepal_width     petal_length    petal_width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.054   Mean   :3.759   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

FALSE

TRUE

'numeric'

'factor'

'setosa'

'versicolor'

'virginica'

</ol>

frame[c("species","sepal_width")]

species	sepal_width
setosa	3.5
setosa	3.0
setosa	3.2
setosa	3.1
setosa	3.6
setosa	3.9
setosa	3.4
setosa	3.4
setosa	2.9
setosa	3.1
setosa	3.7
setosa	3.4
setosa	3.0
setosa	3.0
setosa	4.0
setosa	4.4
setosa	3.9
setosa	3.5
setosa	3.8
setosa	3.8
setosa	3.4
setosa	3.7
setosa	3.6
setosa	3.3
setosa	3.4
setosa	3.0
setosa	3.4
setosa	3.5
setosa	3.4
setosa	3.2
⋮	⋮
virginica	3.2
virginica	2.8
virginica	2.8
virginica	2.7
virginica	3.3
virginica	3.2
virginica	2.8
virginica	3.0
virginica	2.8
virginica	3.0
virginica	2.8
virginica	3.8
virginica	2.8
virginica	2.8
virginica	2.6
virginica	3.0
virginica	3.4
virginica	3.1
virginica	3.0
virginica	3.1
virginica	3.1
virginica	3.1
virginica	2.7
virginica	3.2
virginica	3.3
virginica	3.0
virginica	2.5
virginica	3.0
virginica	3.4
virginica	3.0

frame['petals']<-0
frame$petals2<-0
head(frame)

sepal_length	sepal_width	petal_length	petal_width	species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

mean.sepalLenth.setosa<-mean(frame[,'sepal_length'])

Slicing a Dataframe by Column

Remember the syntax of df[rows,columns]
Using dataframe$column provides one way of selecting a column.
We can also specify the index position: dataframe[,columnIndex]
We can also specify the column name: dataframe[,columnsName]

sepal_length1<-frame$sepal_length #Using Dollar Sign and the column name.
sepal_length2<- frame[,1]  #Using the Index Location
sepal_length3<- frame[,'sepal_length']
sepal_length4<- frame[,c('sepal_length','sepal_width')]

sepal_length1[1:5]  #Print the first 5  
sepal_length2[1:5]
sepal_length3[1:5]

5.1

4.9

4.7

4.6

</ol>

5.1

4.9

4.7

4.6

</ol>

5.1

4.9

4.7

4.6

</ol>

Selecting Rows

We can select rows from a dataframe using index position: dataframe[rowIndex,columnIndex].
Use c(row1, row2, row3) to select out specific rows.

frame2<-frame[1:20,]   
frame3<-frame[c(1,5,6),] #This selects out specific rows
nrow(frame2)
nrow(frame3)
frame3

	sepal_length	sepal_width	petal_length	petal_width	species
1	5.1	3.5	1.4	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa

Conditional Statements and Dataframes with Subset

We can select subsets of a dataframe by putting an equality in the row or subset.
Subset is also a dataframe.
Can optionally select columns with the select = c(col1, col2)

setosa.df <- subset(frame, species == 'setosa')

head(setosa.df)
class(setosa.df)
nrow(setosa.df)
mean.sepalLenth.setosa<-mean(setosa.df$sepal_length) #This creates a new vector
mean.sepalLenth.setosa
setosa.df.highseptalLength <- subset(setosa.df, sepal_length > mean.sepalLenth.setosa)
nrow(setosa.df.highseptalLength)
head(setosa.df.highseptalLength)
setosa.dfhighseptalLength2 <- subset(setosa.df, sepal_length > mean.sepalLenth.setosa, select = c(sepal_length, species))
head(setosa.dfhighseptalLength2)

sepal_length	sepal_width	petal_length	petal_width	species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

'data.frame'

5.006

	sepal_length	sepal_width	petal_length	petal_width	species
1	5.1	3.5	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa
11	5.4	3.7	1.5	0.2	setosa
15	5.8	4.0	1.2	0.2	setosa
16	5.7	4.4	1.5	0.4	setosa
17	5.4	3.9	1.3	0.4	setosa

	sepal_length	species
1	5.1	setosa
6	5.4	setosa
11	5.4	setosa
15	5.8	setosa
16	5.7	setosa
17	5.4	setosa

Subsetting Rows Using Indices

Just like pandas, we are using conditional statements to specify specific rows.
See here for good coverage and examples.

setosa.df <- frame[frame$species == "setosa",]
head(setosa.df)
class(setosa.df)
nrow(setosa.df)
mean.sepalLenth.setosa<-mean(setosa.df$sepal_length) #This creates a new vector
mean.sepalLenth.setosa
setosa.df.highseptalLength <- setosa.df[setosa.df$sepal_length > mean.sepalLenth.setosa,]
nrow(setosa.df.highseptalLength)
head(setosa.df.highseptalLength)

	sepal_length	sepal_width	petal_length	petal_width	species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa

'data.frame'

5.006

	sepal_length	sepal_width	petal_length	petal_width	species
1	5.1	3.5	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa
11	5.4	3.7	1.5	0.2	setosa
15	5.8	4	1.2	0.2	setosa
16	5.7	4.4	1.5	0.4	setosa
17	5.4	3.9	1.3	0.4	setosa

specific.df <- frame[frame$sepal_length %in% c(5.1,5.8),]
head(specific.df)

	sepal_length	sepal_width	petal_length	petal_width	species
1	5.1	3.5	1.4	0.2	setosa
15	5.8	4.0	1.2	0.2	setosa
18	5.1	3.5	1.4	0.3	setosa
20	5.1	3.8	1.5	0.3	setosa
22	5.1	3.7	1.5	0.4	setosa
24	5.1	3.3	1.7	0.5	setosa

Basics

Load the Titanic train.csv data into an R data frame.
Calculate the number of rows in the data frame.
Calcuated general descriptive statistics for the data frame.
Slice the data frame into 2 parts, selecting the first half of the rows.
Select just the columns passangerID and whether they survivied or not.

CREDITS

Copyright AnalyticsDojo 2016 This work is licensed under the Creative Commons Attribution 4.0 International license agreement. Adopted from Berkley R Bootcamp.