Boston Housing - Feature Selection and Importance

rpi.analyticsdojo.com

Overview

Getting the Data
Reviewing Data
Modeling
Model Evaluation
Using Model
Storing Model

Getting Data

Available in the sklearn package as a Bunch object (dictionary).
From FAQ: “Don’t make a bunch object! They are not part of the scikit-learn API. Bunch objects are just a way to package some numpy arrays. As a scikit-learn user you only ever need numpy arrays to feed your model with data.”
Available in the UCI data repository.
Better to convert to Pandas dataframe.

#From sklearn tutorial.
from sklearn.datasets import load_boston
boston = load_boston()
print( "Type of boston dataset:", type(boston))

Type of boston dataset: <class 'sklearn.utils.Bunch'>

#A bunch is you remember is a dictionary based dataset.  Dictionaries are addressed by keys. 
#Let's look at the keys. 
print(boston.keys())

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

#DESCR sounds like it could be useful. Let's print the description.
print(boston['DESCR'])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

# Let's change the data to a Panda's Dataframe
import pandas as pd
boston_df = pd.DataFrame(boston['data'] )
boston_df.head()

	0	1	2	4	5	6	7	8	9	10	11	12
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

#Now add the column names.
boston_df.columns = boston['feature_names']
boston_df.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

#Add the target as PRICE. 
boston_df['PRICE']= boston['target']
boston_df.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

What type of data are there?

First let’s focus on the dependent variable, as the nature of the DV is critical to selection of model.
Median value of owner-occupied homes in $1000’s is the Dependent Variable (continuous variable).
It is relevant to look at the distribution of the dependent variable, so let’s do that first.
Here there is a normal distribution for the most part, with some at the top end of the distribution we could explore later.

Preparing to Model

It is common to separate y as the dependent variable and X as the matrix of independent variables.
Here we are using train_test_split to split the test and train.
This creates 4 subsets, with IV and DV separted: X_train, X_test, y_train, y_test

#This will throw and error at import if haven't upgraded. 
# from sklearn.cross_validation  import train_test_split  
from sklearn.model_selection  import train_test_split
#y is the dependent variable.
y = boston_df['PRICE']
#As we know, iloc is used to slice the array by index number. Here this is the matrix of 
#independent variables.
X = boston_df.iloc[:,0:13]

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(354, 13) (152, 13) (354,) (152,)

Modeling

First import the package: from sklearn.linear_model import LinearRegression
Then create the model object.
Then fit the data.
This creates a trained model (an object) of class regression.
The variety of methods and attributes available for regression are shown here.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit( X_train, y_train )

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Evaluating the Model Results

You have fit a model.
You can now store this model, save the object to disk, or evaluate it with different outcomes.
Trained regression objects have coefficients (coef_) and intercepts (intercept_) as attributes.
R-Squared is determined from the score method of the regression object.
For Regression, we are going to use the coefficient of determination as our way of evaluating the results, also referred to as R-Squared

print('labels\n',X.columns)
print('Coefficients: \n', lm.coef_)
print('Intercept: \n', lm.intercept_)
print('R2 for Train)', lm.score( X_train, y_train ))
print('R2 for Test (cross validation)', lm.score(X_test, y_test))

labels
 Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],
      dtype='object')
Coefficients: 
 [-1.21310401e-01  4.44664254e-02  1.13416945e-02  2.51124642e+00
 -1.62312529e+01  3.85906801e+00 -9.98516565e-03 -1.50026956e+00
  2.42143466e-01 -1.10716124e-02 -1.01775264e+00  6.81446545e-03
 -4.86738066e-01]
Intercept: 
 37.93710774183255
R2 for Train) 0.7645451026942549
R2 for Test (cross validation) 0.6733825506400194

#Alternately, we can show the results in a dataframe using the zip command.
pd.DataFrame( list(zip(X.columns, lm.coef_)),
            columns=['features', 'estimatedCoeffs'])

	features	estimatedCoeffs
0	CRIM	-0.121310
1	ZN	0.044466
2	INDUS	0.011342
3	CHAS	2.511246
4	NOX	-16.231253
5	RM	3.859068
6	AGE	-0.009985
7	DIS	-1.500270
8	RAD	0.242143
9	TAX	-0.011072
10	PTRATIO	-1.017753
11	B	0.006814
12	LSTAT	-0.486738

L1 Regularized Regression

By increasing the alpha, we can zero in on the variables which are more important in the analysis.

from sklearn import linear_model
reg = linear_model.Ridge(alpha=5000)
reg.fit(X_train, y_train ) 
print('R2 for Train)', reg.score( X_train, y_train ))
print('R2 for Test (cross validation)', reg.score(X_test, y_test))

R2 for Train) 0.6099053511822028
R2 for Test (cross validation) 0.5339221870748787

#Alternately, we can show the results in a dataframe using the zip command.
pd.DataFrame( list(zip(X.columns, reg.coef_)),
            columns=['features', 'estimatedCoeffs'])

	features	estimatedCoeffs
0	CRIM	-0.080361
1	ZN	0.059331
2	INDUS	-0.052146
3	CHAS	0.018686
4	NOX	0.000412
5	RM	0.133028
6	AGE	0.029760
7	DIS	-0.168325
8	RAD	0.141547
9	TAX	-0.013997
10	PTRATIO	-0.275157
11	B	0.007908
12	LSTAT	-0.571049

Feature Importance With Random Forrest Regression

from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=99)
forest.fit(X_train, y_train) 
importances = forest.feature_importances_

print('R2 for Train)', forest.score( X_train, y_train ))
print('R2 for Test (cross validation)', forest.score(X_test, y_test))

R2 for Train) 0.9700450911248801
R2 for Test (cross validation) 0.8141525132875429

Feature Selection

“SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument.”

from sklearn.feature_selection import SelectFromModel
model = SelectFromModel(forest, prefit=True, max_features=3)
feature_idx = model.get_support()
feature_names = X.columns[feature_idx]
X_NEW = model.transform(X)
pd.DataFrame(X_NEW, columns= feature_names)

	RM	LSTAT
0	6.575	4.98
1	6.421	9.14
2	7.185	4.03
3	6.998	2.94
4	7.147	5.33
...	...	...
501	6.593	9.67
502	6.120	9.08
503	6.976	5.64
504	6.794	6.48
505	6.030	7.88

506 rows × 2 columns

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X_NEW, y, test_size=0.3, random_state=0)
lm = LinearRegression()
lm.fit( X_train, y_train )
print('R2 for Train)', lm.score( X_train, y_train ))
print('R2 for Test (cross validation)', lm.score(X_test, y_test))

R2 for Train) 0.6622109123027915
R2 for Test (cross validation) 0.5445178479963528

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit( X_train, y_train )

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

from sklearn.metrics import r2_score
r2_train_reg = r2_score(y_train, lm.predict(X_train))
r2_test_reg = r2_score(y_test, lm.predict(X_test))
print(r2_train_reg,r2_test_reg  )

0.6622109123027915 0.5445178479963528

Copyright AnalyticsDojo 2016. This work is licensed under the Creative Commons Attribution 4.0 International license agreement.