AnalyticsDojo

Introduction to R - DataFrames

rpi.analyticsdojo.com

Introduction to R DataFrames

  • Data frames are combinations of vectors of the same length, but can be of different types.
  • It is a special type of list.
  • Data frames are what is used for standard rectangular (record by field) datasets, similar to a spreadsheet
  • Data frames are a functionality that both sets R aside from some languages (e.g., Matlab) and provides functionality similar to some statistical packages (e.g., Stata, SAS) and Python’s Pandas Packages.
frame=read.csv(file="../../input/iris.csv", header=TRUE,sep=",")
class(frame)
head(frame) #The first few rows.
tail(frame) #The last few rows.
str(frame) #The Structure.



'data.frame'
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
1456.7 3.3 5.7 2.5 virginica
1466.7 3.0 5.2 2.3 virginica
1476.3 2.5 5.0 1.9 virginica
1486.5 3.0 5.2 2.0 virginica
1496.2 3.4 5.4 2.3 virginica
1505.9 3.0 5.1 1.8 virginica
'data.frame':	150 obs. of  5 variables:
 $ sepal_length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ sepal_width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ petal_length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ petal_width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
dim(frame) #Results in rows x columns
nrow(frame)  #The number of Rows
names(frame) #Provides the names
length(frame) #The number of columns
summary(frame) #Provides summary statistics.
is.matrix(frame) #Yields False because it has different types.  
is.list(frame) #Yields True
class(frame$sepal_length)
class(frame$species)
levels(frame$species)

<ol class=list-inline>
  • 150
  • 5
  • </ol>
    150
    <ol class=list-inline>
  • 'sepal_length'
  • 'sepal_width'
  • 'petal_length'
  • 'petal_width'
  • 'species'
  • </ol>
    5
      sepal_length    sepal_width     petal_length    petal_width   
     Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
     1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
     Median :5.800   Median :3.000   Median :4.350   Median :1.300  
     Mean   :5.843   Mean   :3.054   Mean   :3.759   Mean   :1.199  
     3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
     Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
           species  
     setosa    :50  
     versicolor:50  
     virginica :50  
                    
                    
                    
    
    FALSE
    TRUE
    'numeric'
    'factor'
    <ol class=list-inline>
  • 'setosa'
  • 'versicolor'
  • 'virginica'
  • </ol>
    frame[c("species","sepal_width")]
    
    
    speciessepal_width
    setosa3.5
    setosa3.0
    setosa3.2
    setosa3.1
    setosa3.6
    setosa3.9
    setosa3.4
    setosa3.4
    setosa2.9
    setosa3.1
    setosa3.7
    setosa3.4
    setosa3.0
    setosa3.0
    setosa4.0
    setosa4.4
    setosa3.9
    setosa3.5
    setosa3.8
    setosa3.8
    setosa3.4
    setosa3.7
    setosa3.6
    setosa3.3
    setosa3.4
    setosa3.0
    setosa3.4
    setosa3.5
    setosa3.4
    setosa3.2
    virginica3.2
    virginica2.8
    virginica2.8
    virginica2.7
    virginica3.3
    virginica3.2
    virginica2.8
    virginica3.0
    virginica2.8
    virginica3.0
    virginica2.8
    virginica3.8
    virginica2.8
    virginica2.8
    virginica2.6
    virginica3.0
    virginica3.4
    virginica3.1
    virginica3.0
    virginica3.1
    virginica3.1
    virginica3.1
    virginica2.7
    virginica3.2
    virginica3.3
    virginica3.0
    virginica2.5
    virginica3.0
    virginica3.4
    virginica3.0
    frame['petals']<-0
    frame$petals2<-0
    head(frame)
    
    
    sepal_lengthsepal_widthpetal_lengthpetal_widthspeciespetalspetals2
    5.1 3.5 1.4 0.2 setosa0 0
    4.9 3.0 1.4 0.2 setosa0 0
    4.7 3.2 1.3 0.2 setosa0 0
    4.6 3.1 1.5 0.2 setosa0 0
    5.0 3.6 1.4 0.2 setosa0 0
    5.4 3.9 1.7 0.4 setosa0 0
    mean.sepalLenth.setosa<-mean(frame[,'sepal_length'])
    
    

    Slicing a Dataframe by Column

    • Remember the syntax of df[rows,columns]
    • Using dataframe$column provides one way of selecting a column.
    • We can also specify the index position: dataframe[,columnIndex]
    • We can also specify the column name: dataframe[,columnsName]
    sepal_length1<-frame$sepal_length #Using Dollar Sign and the column name.
    sepal_length2<- frame[,1]  #Using the Index Location
    sepal_length3<- frame[,'sepal_length']
    sepal_length4<- frame[,c('sepal_length','sepal_width')]
    
    sepal_length1[1:5]  #Print the first 5  
    sepal_length2[1:5]
    sepal_length3[1:5]
    
    
    
    <ol class=list-inline>
  • 5.1
  • 4.9
  • 4.7
  • 4.6
  • 5
  • </ol>
    <ol class=list-inline>
  • 5.1
  • 4.9
  • 4.7
  • 4.6
  • 5
  • </ol>
    <ol class=list-inline>
  • 5.1
  • 4.9
  • 4.7
  • 4.6
  • 5
  • </ol>

    Selecting Rows

    • We can select rows from a dataframe using index position: dataframe[rowIndex,columnIndex].
    • Use c(row1, row2, row3) to select out specific rows.
    frame2<-frame[1:20,]   
    frame3<-frame[c(1,5,6),] #This selects out specific rows
    nrow(frame2)
    nrow(frame3)
    frame3
    
    
    20
    3
    sepal_lengthsepal_widthpetal_lengthpetal_widthspeciespetalspetals2
    15.1 3.5 1.4 0.2 setosa0 0
    55.0 3.6 1.4 0.2 setosa0 0
    65.4 3.9 1.7 0.4 setosa0 0

    Conditional Statements and Dataframes with Subset

    • We can select subsets of a dataframe by putting an equality in the row or subset.
    • Subset is also a dataframe.
    • Can optionally select columns with the select = c(col1, col2)
    setosa.df <- subset(frame, species == 'setosa')
    
    head(setosa.df)
    class(setosa.df)
    nrow(setosa.df)
    mean.sepalLenth.setosa<-mean(setosa.df$sepal_length) #This creates a new vector
    mean.sepalLenth.setosa
    setosa.df.highseptalLength <- subset(setosa.df, sepal_length > mean.sepalLenth.setosa)
    nrow(setosa.df.highseptalLength)
    head(setosa.df.highseptalLength)
    setosa.dfhighseptalLength2 <- subset(setosa.df, sepal_length > mean.sepalLenth.setosa, select = c(sepal_length, species))
    head(setosa.dfhighseptalLength2)
    
    
    sepal_lengthsepal_widthpetal_lengthpetal_widthspeciespetalspetals2
    5.1 3.5 1.4 0.2 setosa0 0
    4.9 3.0 1.4 0.2 setosa0 0
    4.7 3.2 1.3 0.2 setosa0 0
    4.6 3.1 1.5 0.2 setosa0 0
    5.0 3.6 1.4 0.2 setosa0 0
    5.4 3.9 1.7 0.4 setosa0 0
    'data.frame'
    50
    5.006
    22
    sepal_lengthsepal_widthpetal_lengthpetal_widthspeciespetalspetals2
    15.1 3.5 1.4 0.2 setosa0 0
    65.4 3.9 1.7 0.4 setosa0 0
    115.4 3.7 1.5 0.2 setosa0 0
    155.8 4.0 1.2 0.2 setosa0 0
    165.7 4.4 1.5 0.4 setosa0 0
    175.4 3.9 1.3 0.4 setosa0 0
    sepal_lengthspecies
    15.1 setosa
    65.4 setosa
    115.4 setosa
    155.8 setosa
    165.7 setosa
    175.4 setosa

    Subsetting Rows Using Indices

    • Just like pandas, we are using conditional statements to specify specific rows.
    • See here for good coverage and examples.
    setosa.df <- frame[frame$species == "setosa",]
    head(setosa.df)
    class(setosa.df)
    nrow(setosa.df)
    mean.sepalLenth.setosa<-mean(setosa.df$sepal_length) #This creates a new vector
    mean.sepalLenth.setosa
    setosa.df.highseptalLength <- setosa.df[setosa.df$sepal_length > mean.sepalLenth.setosa,]
    nrow(setosa.df.highseptalLength)
    head(setosa.df.highseptalLength)
    
    
    sepal_lengthsepal_widthpetal_lengthpetal_widthspeciespetals
    15.1 3.5 1.4 0.2 setosa0
    24.9 3 1.4 0.2 setosa0
    34.7 3.2 1.3 0.2 setosa0
    44.6 3.1 1.5 0.2 setosa0
    55 3.6 1.4 0.2 setosa0
    65.4 3.9 1.7 0.4 setosa0
    'data.frame'
    50
    5.006
    22
    sepal_lengthsepal_widthpetal_lengthpetal_widthspeciespetals
    15.1 3.5 1.4 0.2 setosa0
    65.4 3.9 1.7 0.4 setosa0
    115.4 3.7 1.5 0.2 setosa0
    155.8 4 1.2 0.2 setosa0
    165.7 4.4 1.5 0.4 setosa0
    175.4 3.9 1.3 0.4 setosa0
    specific.df <- frame[frame$sepal_length %in% c(5.1,5.8),]
    head(specific.df)
    
    
    
    sepal_lengthsepal_widthpetal_lengthpetal_widthspeciespetals
    15.1 3.5 1.4 0.2 setosa0
    155.8 4.0 1.2 0.2 setosa0
    185.1 3.5 1.4 0.3 setosa0
    205.1 3.8 1.5 0.3 setosa0
    225.1 3.7 1.5 0.4 setosa0
    245.1 3.3 1.7 0.5 setosa0

    Basics

    1. Load the Titanic train.csv data into an R data frame.
    2. Calculate the number of rows in the data frame.
    3. Calcuated general descriptive statistics for the data frame.
    4. Slice the data frame into 2 parts, selecting the first half of the rows.
    5. Select just the columns passangerID and whether they survivied or not.

    CREDITS

    Copyright AnalyticsDojo 2016 This work is licensed under the Creative Commons Attribution 4.0 International license agreement. Adopted from Berkley R Bootcamp.