AnalyticsDojo

Introduction to R - Datastructures

rpi.analyticsdojo.com

Overview

Common to R and Python

  • Vectors
  • Opearations on Numeric and String Variables
  • Lists

Vectors in R

  • The most basic form of an R object is a vector.
  • In fact, individual (scalar) values (variables) are vectors of length one.
  • An R vector is a single set of values in a particular order of the same type.
  • We can concatenate values into a vector with c(): ages<-c(18,19,18,23)
  • Comparable Python objects include Panda Series and single dimensional numpy array.
  • While Python arrays start at 0, R arrays start at index position 1.

ages<-c(18,19,18,23)
ages
ages[1]
ages[2:4]



<ol class=list-inline>
  • 18
  • 19
  • 18
  • 23
  • </ol>
    18
    <ol class=list-inline>
  • 19
  • 18
  • 23
  • </ol>

    Vectors Type in R

    • Items in a vector must be of the same type.
    • Character. These are the clear character vectors. (Typically use quotes to add to these vectors.)
    • Numeric. Numbers in a set. Note there is not a different type.
    • Boolean. TRUE or FALSE values in a set.
    • Factor. A situation in which there is a select set of options. Things such as states or zip codes. These are typically things which are related to dummy variables, a topic we will discuss later.
    • Determine the data type by using the str command: str(teachers)
    names<-c("Sally", "Jason", "Bob", "Susy") #Text
    female<-c(TRUE, FALSE, FALSE, TRUE)  #While Python uses True and False, R uses TRUE and FALSE.
    teachers<-c("Smith", "Johnson", "Johnson", "Smith")
    teachers.f<-factor(teachers)
    grades<-c(20, 15, 13, 19) #25 points possible
    gradesdec<-c(20.32, 15.32, 13.12, 19.32) #25 points possible
    
    str(names)
    str(female)
    str(teachers)  
    str(teachers.f) 
    str(grades)    #Note that the grades and gradesdec are both numeric.
    str(gradesdec) #Note that the grades and gradesdec are both numeric.
    
    
    
     chr [1:4] "Sally" "Jason" "Bob" "Susy"
     logi [1:4] TRUE FALSE FALSE TRUE
     chr [1:4] "Smith" "Johnson" "Johnson" "Smith"
     Factor w/ 2 levels "Johnson","Smith": 2 1 1 2
     num [1:4] 20 15 13 19
     num [1:4] 20.3 15.3 13.1 19.3
    

    Strings in R

    • Lot’s of different types of operations we can perform on Strings.
    chars <- c('hi', 'hallo', "mother's", 'father\'s', "He said, \'hi\'" )
    length(chars)
    nchar(chars)
    paste("bill", "clinton", sep = " ")  # paste together a set of strings
    paste(chars, collapse = ' ')  # paste together things from a vector
    
    strlist<-strsplit("This is the Analytics Dojo", split = " ") #This taks a string ant splits to a list
    strlist
    substring(chars, 2, 3) #this takes the 2nd-3rd character from the sentance above 
    chars2 <- chars
    substring(chars2, 2, 3) <- "ZZ"  #this takes the 2nd-3rd character from the sentance above 
    chars2
    
    
    5
    <ol class=list-inline>
  • 2
  • 5
  • 8
  • 8
  • 13
  • </ol>
    'bill clinton'
    'hi hallo mother\'s father\'s He said, \'hi\''
    1. <ol class=list-inline>
    2. 'This'
    3. 'is'
    4. 'the'
    5. 'Analytics'
    6. 'Dojo'
    7. </ol>
    <ol class=list-inline>
  • 'i'
  • 'al'
  • 'ot'
  • 'at'
  • 'e '
  • </ol>
    <ol class=list-inline>
  • 'hZ'
  • 'hZZlo'
  • 'mZZher\'s'
  • 'fZZher\'s'
  • 'HZZsaid, \'hi\''
  • </ol>

    Factors in R

    • A factor is a special data type in R used for categorical data. In some cases it works like magic and in others it is incredibly frustrating.
    class(teachers.f) # What order are the factors in?
    levels(teachers.f)  # note alternate way to get the variable
    summary(teachers.f) #gives the count for each level. 
    
    
    
    'factor'
    <ol class=list-inline>
  • 'Johnson'
  • 'Smith'
  • </ol>
    <dl class=dl-horizontal>
    Johnson
    2
    Smith
    2
    </dl>

    Creating Vectors in R

    • Concatenate fields to a vector: nums <- c(1.1, 3, -5.7)
    • Generate random values from normal distribution with devs <- rnorm(5)
    • idevs <- sample(ints, 100, replace = TRUE)
    # numeric vector
    nums <- c(1.1, 3, -5.7)
    devs <- rnorm(5)
    devs
    
    # integer vector
    ints <- c(1L, 5L, -3L) # force storage as integer not decimal number
    # "L" is for "long integer" (historical)
    
    idevs <- sample(ints, 100, replace = TRUE)
    
    # character vector
    chars <- c("hi", "hallo", "mother's", "father\'s", 
       "She said", "hi", "He said, \'hi\'" )
    chars
    cat(chars, sep = "\n")
    
    # logical vector
    bools <- c(TRUE, FALSE, TRUE)
    bools
    
    
    <ol class=list-inline>
  • 0.620478857748794
  • 0.355719819931768
  • -0.482420730604138
  • 1.9607784989951
  • -1.2218590305962
  • </ol>
    <ol class=list-inline>
  • 'hi'
  • 'hallo'
  • 'mother\'s'
  • 'father\'s'
  • 'She said'
  • 'hi'
  • 'He said, \'hi\''
  • </ol>
    hi
    hallo
    mother's
    father's
    She said
    hi
    He said, 'hi'
    
    <ol class=list-inline>
  • TRUE
  • FALSE
  • TRUE
  • </ol>

    Variable Type

    • In R when we write b = 30 this means the value of 30 is assigned to the b object.
    • R is a dynamically typed.
    • Unlike some languages, we don”t have to declare the type of a variable before using it.
    • Variable type can also change with the reassignment of a variable.
    • We can query the class a value using the class function.
    • The str function gives additional details for complex objects like dataframes.
    a <- 1L
    print (c("The value of a is ", a))
    print (c("The value of a is ", a), quote=FALSE)
    class(a)
    str(a)
    
    a <- 2.5
    print (c("Now the value of a is ", a),quote=FALSE)
    class(a)
    str(a)
    
    a <- "hello there"
    print (c("Now the value of a is ", a ),quote=FALSE)
    class(a)
    str(a)
    
    
    [1] "The value of a is " "1"                 
    [1] The value of a is  1                 
    
    'integer'
     int 1
    [1] Now the value of a is  2.5                   
    
    'numeric'
     num 2.5
    [1] Now the value of a is  hello there           
    
    'character'
     chr "hello there"
    

    Converting Values Between Types

    • We can convert values between different types.
    • To convert to string use the as.character function.
    • To convert to numeric use the as.integer function.
    • To convert to an integer use the as.integer function.
    • To convert to a boolean use the as.logical function.
    #This is a way of specifying a long integer.
    a <- 1L
    a
    class(a)
    str(a)
    a<-as.character(a)
    a
    class(a)
    str(a)
    a<-as.numeric(a)
    a
    class(a)
    str(a)
    a<-as.logical(a)
    a
    class(a)
    str(a)
    
    
    
    1
    'integer'
     int 1
    
    '1'
    'character'
     chr "1"
    
    1
    'numeric'
     num 1
    
    TRUE
    'logical'
     logi TRUE
    

    Quotes

    • Double Quotes are preferred in R, though both will work as long as they aren’t mixed.
    #Double Quotes are preferred in R, though both will work as long as they aren't mixed. 
    a <- "hello"
    class(a)
    str(a)
    a <- 'hello'
    class(a)
    str(a)
    
    
    'character'
     chr "hello"
    
    'character'
     chr "hello"
    

    Null Values

    • Since it was designed by statisticians, R handles missing values very well relative to other languages.
    • NA is a missing value
    #Notice nothing is printed.
    a<-NA
    a
    vec <- rnorm(12)    #This creates a vector with randomly distributed values
    vec[c(3, 5)] <- NA  #This sets values 3 and 5 as NA
    vec                 #This prints the Vector
    sum(vec)            #What is the Sum of a vector that has NA?  
    sum(vec, na.rm = TRUE)   #This Sums the vector with the NA removed. 
    is.na(vec)          #This returns a vector of whether specific values are equal to NA.
    
    
    <NA>
    <ol class=list-inline>
  • -1.33644546389391
  • 1.87421154996928
  • <NA>
  • -0.217346245734894
  • <NA>
  • 0.435770349019708
  • -1.14025525433378
  • -0.48345946330215
  • -0.900282592359427
  • -0.61861874592141
  • 1.04707474251708
  • -1.50789510144605
  • </ol>
    <NA>
    -2.84724622548555
    <ol class=list-inline>
  • FALSE
  • FALSE
  • TRUE
  • FALSE
  • TRUE
  • FALSE
  • FALSE
  • FALSE
  • FALSE
  • FALSE
  • FALSE
  • FALSE
  • </ol>

    Logical/Boolean Vectors

    • Here we can see that summing and averaging boolean vectors treats TRUE=1 & FALSE=0
    answers <- c(TRUE, TRUE, FALSE, FALSE)
    update <- c(TRUE, FALSE, TRUE, FALSE)
    
    # Here we see that True coul
    sum(answers)
    mean(answers)
    total<-answers + update
    total
    class(total)
    
    
    2
    0.5
    <ol class=list-inline>
  • 2
  • 1
  • 1
  • 0
  • </ol>
    'integer'

    R Calculations

    • R can act as a basic calculator.
    2 + 2 # add numbers
    2 * pi # multiply by a constant
    7 + runif(1) # add a random number
    3^4 # powers
    sqrt(4^4) # functions
    log(10)
    log(100, base = 10)
    23 %/% 2 
    23 %% 2
    
    # scientific notation
    5000000000 * 1000
    5e9 * 1e3
    
    
    4
    6.28318530717959
    7.38707033358514
    81
    16
    2.30258509299405
    2
    11
    1
    5e+12
    5e+12

    Operations on Vectors

    • R can be used as a basic calculator.
    • We can do calculations on vectors easily.
    • Direct operations are much faster easier than looping.
    #vals <- rnorm(10)
    #squared2vals <- vals^2
    #sum_squared2vals <- sum(chi2vals)
    #ount_squared2vals<-length(squared2vals)  
    #vals
    #squared2vals
    #sum_df1000
    #count_squared2vals
    
    
    

    R is a Functional Language

    • Operations are carried out with functions. Functions take objects as inputs and return objects as outputs.
    • An analysis can be considered a pipeline of function calls, with output from a function used later in a subsequent operation as input to another function.
    • Functions themselves are objects.
    • We can get help on functions with help(lm) or ?lm
    vals <- rnorm(10)
    median(vals)
    class(median)
    median(vals, na.rm = TRUE)
    mean(vals, na.rm = TRUE)
    help(lm)
    ?lm
    ?log
    
    
    
    
    0.644546626183949
    'function'
    0.644546626183949
    0.631539617203971
    lm {stats}R Documentation

    Fitting Linear Models

    Description

    lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although aov may provide a more convenient interface for these).

    Usage

    lm(formula, data, subset, weights, na.action,
       method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
       singular.ok = TRUE, contrasts = NULL, offset, ...)
    

    Arguments

    formula

    an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.

    data

    an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.

    subset

    an optional vector specifying a subset of observations to be used in the fitting process.

    weights

    an optional vector of weights to be used in the fitting process. Should be NULL or a numeric vector. If non-NULL, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used. See also ‘Details’,

    na.action

    a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The ‘factory-fresh’ default is na.omit. Another possible value is NULL, no action. Value na.exclude can be useful.

    method

    the method to be used; for fitting, currently only method = "qr" is supported; method = "model.frame" returns the model frame (the same as with model = TRUE, see below).

    model, x, y, qr

    logicals. If TRUE the corresponding components of the fit (the model frame, the model matrix, the response, the QR decomposition) are returned.

    singular.ok

    logical. If FALSE (the default in S but not in R) a singular fit is an error.

    contrasts

    an optional list. See the contrasts.arg of model.matrix.default.

    offset

    this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be NULL or a numeric vector of length equal to the number of cases. One or more offset terms can be included in the formula instead or as well, and if more than one are specified their sum is used. See model.offset.

    ...

    additional arguments to be passed to the low level regression fitting functions (see below).

    Details

    Models for lm are specified symbolically. A typical model has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response. A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed. A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second.

    If the formula includes an offset, this is evaluated and subtracted from the response.

    If response is a matrix a linear model is fitted separately by least-squares to each column of the matrix.

    See model.matrix for some further details. The terms in the formula will be re-ordered so that main effects come first, followed by the interactions, all second-order, all third-order and so on: to avoid this pass a terms object as the formula (see aov and demo(glm.vr) for an example).

    A formula has an implied intercept term. To remove this use either y ~ x - 1 or y ~ 0 + x. See formula for more details of allowed formulae.

    Non-NULL weights can be used to indicate that different observations have different variances (with the values in weights being inversely proportional to the variances); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (including the case that there are w_i observations equal to y_i and the data have been summarized). However, in the latter case, notice that within-group variation is not used. Therefore, the sigma estimate and residual degrees of freedom may be suboptimal; in the case of replication weights, even wrong. Hence, standard errors and analysis of variance tables should be treated with care.

    lm calls the lower level functions lm.fit, etc, see below, for the actual numerical computations. For programming only, you may consider doing likewise.

    All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula.

    Value

    lm returns an object of class "lm" or for multiple responses of class c("mlm", "lm").

    The functions summary and anova are used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm.

    An object of class "lm" is a list containing at least the following components:

    coefficients

    a named vector of coefficients

    residuals

    the residuals, that is response minus fitted values.

    fitted.values

    the fitted mean values.

    rank

    the numeric rank of the fitted linear model.

    weights

    (only for weighted fits) the specified weights.

    df.residual

    the residual degrees of freedom.

    call

    the matched call.

    terms

    the terms object used.

    contrasts

    (only where relevant) the contrasts used.

    xlevels

    (only where relevant) a record of the levels of the factors used in fitting.

    offset

    the offset used (missing if none were used).

    y

    if requested, the response used.

    x

    if requested, the model matrix used.

    model

    if requested (the default), the model frame used.

    na.action

    (where relevant) information returned by model.frame on the special handling of NAs.

    In addition, non-null fits will have components assign, effects and (unless not requested) qr relating to the linear fit, for use by extractor functions such as summary and effects.

    Using time series

    Considerable care is needed when using lm with time series.

    Unless na.action = NULL, the time series attributes are stripped from the variables before the regression is done. (This is necessary as omitting NAs would invalidate the time series attributes, and if NAs are omitted in the middle of the series the result would no longer be a regular time series.)

    Even if the time series attributes are retained, they are not used to line up series, so that the time shift of a lagged or differenced regressor would be ignored. It is good practice to prepare a data argument by ts.intersect(..., dframe = TRUE), then apply a suitable na.action to that data frame and call lm with na.action = NULL so that residuals and fitted values are time series.

    Note

    Offsets specified by offset will not be included in predictions by predict.lm, whereas those specified by an offset term in the formula will be.

    Author(s)

    The design was inspired by the S function of the same name described in Chambers (1992). The implementation of model formula by Ross Ihaka was based on Wilkinson & Rogers (1973).

    References

    Chambers, J. M. (1992) Linear models. Chapter 4 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.

    Wilkinson, G. N. and Rogers, C. E. (1973). Symbolic descriptions of factorial models for analysis of variance. Applied Statistics, 22, 392–399. doi: 10.2307/2346786.

    See Also

    summary.lm for summaries and anova.lm for the ANOVA table; aov for a different interface.

    The generic functions coef, effects, residuals, fitted, vcov.

    predict.lm (via predict) for prediction, including confidence and prediction intervals; confint for confidence intervals of parameters.

    lm.influence for regression diagnostics, and glm for generalized linear models.

    The underlying low level functions, lm.fit for plain, and lm.wfit for weighted regression fitting.

    More lm() examples are available e.g., in anscombe, attitude, freeny, LifeCycleSavings, longley, stackloss, swiss.

    biglm in package biglm for an alternative way to fit linear models to large datasets (especially those with many cases).

    Examples

    require(graphics)
    
    ## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
    ## Page 9: Plant Weight Data.
    ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
    trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
    group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
    weight <- c(ctl, trt)
    lm.D9 <- lm(weight ~ group)
    lm.D90 <- lm(weight ~ group - 1) # omitting intercept
    
    anova(lm.D9)
    summary(lm.D90)
    
    opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
    plot(lm.D9, las = 1)      # Residuals, Fitted, ...
    par(opar)
    
    ### less simple examples in "See Also" above
    

    [Package stats version 3.5.1 ]

    Matrix

    • Multiple column vector
    • Matrix is useful for linear algebra
    • Matrix must be all of the same type
    • Could relatively easily do regression calculations using undlerlying matrix
    #This is setup a matrix(vector, nrow,ncol)
    mat <- matrix(rnorm(12), nrow = 3, ncol = 4)
    mat
    # This is setup a matrix(vector, rows)
    A <- matrix(1:12, 3)
    B <- matrix(1:12, 4)
    C <- matrix(seq(4,36, by = 4), 3)
    A
    B
    C
    
    
    -0.83526885 0.06464287 0.39680157 0.89122461
    2.0589154-0.4442069 0.4423272 0.1642597
    1.330993-2.328589-1.856013 2.048782
    1 4 710
    2 5 811
    3 6 912
    159
    2 610
    3 711
    4 812
    41628
    82032
    122436

    Slicing Vectors and Matrixs

    • Can use matrix[rows,columns] with specificating of row/column index, name, range.
    vec <- rnorm(12)
    mat <- matrix(vec, 4, 3)
    rownames(mat) <- letters[1:4] #This assigns a row name
    vec
    mat
    
    
    
    
    <ol class=list-inline>
  • 0.235847096846066
  • -1.22658917279729
  • -1.18402683885278
  • 1.50615064907639
  • -0.206182106051588
  • -0.412355878955171
  • -0.0799934537639763
  • -1.72561819147371
  • 0.76499317236754
  • 1.25314418417645
  • -1.12655889886209
  • -2.45245615420083
  • </ol>
    a 0.2358471 -0.20618211 0.7649932
    b-1.2265892 -0.41235588 1.2531442
    c-1.1840268 -0.07999345-1.1265589
    d 1.5061506 -1.72561819-2.4524562
    #Slicing Vector
    vec[c(3, 5, 8:10)] # This gives position 3, 5, and 8-10
    
    
    <ol class=list-inline>
  • -1.18402683885278
  • -0.206182106051588
  • -1.72561819147371
  • 0.76499317236754
  • 1.25314418417645
  • </ol>
    # matrix[rows,columns]  leaving blank means all columns/rows
    mat[c('a', 'd'), ]
    mat[c(1,4), ]
    mat[c(1,4), 1:2]
    mat[c(1,4), c(1,3)] #Notice when providing a list we surround with c
    mat[, 1:2]          #When providing a range we use a colon.
    
    
    a0.2358471 -0.2061821 0.7649932
    d1.5061506 -1.7256182-2.4524562
    a0.2358471 -0.2061821 0.7649932
    d1.5061506 -1.7256182-2.4524562
    a0.2358471 -0.2061821
    d1.5061506 -1.7256182
    a0.2358471 0.7649932
    d1.5061506 -2.4524562
    a 0.2358471 -0.20618211
    b-1.2265892 -0.41235588
    c-1.1840268 -0.07999345
    d 1.5061506 -1.72561819

    Lists

    • Collections of disparate or complicated objects.
    • Can be of multiple different types.
    • Here we assign individual values with the list with =.
    • Slice the list with the index position or the name.
    #myList <- list(stuff = 3, mat = matrix(1:4, nrow = 2), moreStuff = c('china', 'japan'), list(5, 'bear'))
    myList<-list(stuff=3,mat = matrix(1:4, nrow = 2),vector=c(1,2,3,4),morestuff=c("Albany","New York", "San Francisco"))
    myList
    
    #
    myList['stuff']
    myList[2]
    myList[2:3]
    myList[c(1,4)]
    
    
    
    $stuff
    3
    $mat
    13
    24
    $vector
    <ol class=list-inline>
  • 1
  • 2
  • 3
  • 4
  • </ol>
    $morestuff
    <ol class=list-inline>
  • 'Albany'
  • 'New York'
  • 'San Francisco'
  • </ol>
    $stuff = 3
    $mat =
    13
    24
    $mat
    13
    24
    $vector
    <ol class=list-inline>
  • 1
  • 2
  • 3
  • 4
  • </ol>
    $stuff
    3
    $morestuff
    <ol class=list-inline>
  • 'Albany'
  • 'New York'
  • 'San Francisco'
  • </ol>

    CREDITS

    Copyright AnalyticsDojo 2016 This work is licensed under the Creative Commons Attribution 4.0 International license agreement.

    Adopted from the Berkley R Bootcamp.