AnalyticsDojo

Basic Text Feature Creation in Python

rpi.analyticsdojo.com

!wget https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv
!wget https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv

--2019-03-11 14:58:22--  https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61194 (60K) [text/plain]
Saving to: ‘train.csv.1’

train.csv.1         100%[===================>]  59.76K  --.-KB/s    in 0.03s   

2019-03-11 14:58:23 (2.32 MB/s) - ‘train.csv.1’ saved [61194/61194]

--2019-03-11 14:58:23--  https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28629 (28K) [text/plain]
Saving to: ‘test.csv.1’

test.csv.1          100%[===================>]  27.96K  --.-KB/s    in 0.01s   

2019-03-11 14:58:24 (2.27 MB/s) - ‘test.csv.1’ saved [28629/28629]

import numpy as np
import pandas as pd
import pandas as pd

train= pd.read_csv('train.csv')
test = pd.read_csv('test.csv')



#Print to standard output, and see the results in the "log" section below after running your script
train.head()

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
#Print to standard output, and see the results in the "log" section below after running your script
train.describe()

PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
#Let's look at the age field.  We can see "NaN" (which indicates missing values).s
train["Age"]

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5       NaN
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17      NaN
18     31.0
19      NaN
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26      NaN
27     19.0
28      NaN
29      NaN
       ... 
861    21.0
862    48.0
863     NaN
864    24.0
865    42.0
866    27.0
867    31.0
868     NaN
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878     NaN
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64
#Now let's recode. 
medianAge=train["Age"].median()
print ("The Median age is:", medianAge, " years old.")
train["Age"] = train["Age"].fillna(medianAge)

#Option 2 all in one shot! 
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Age"] 

The Median age is: 28.0  years old.
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5      28.0
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17     28.0
18     31.0
19     28.0
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26     28.0
27     19.0
28     28.0
29     28.0
       ... 
861    21.0
862    48.0
863    28.0
864    24.0
865    42.0
866    27.0
867    31.0
868    28.0
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878    28.0
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888    28.0
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64
#For Recoding Data, we can use what we know of selecting rows and columns
train["Embarked"] = train["Embarked"].fillna("S")
train.loc[train["Embarked"] == "S", "EmbarkedRecode"] = 0
train.loc[train["Embarked"] == "C", "EmbarkedRecode"] = 1
train.loc[train["Embarked"] == "Q", "EmbarkedRecode"] = 2

# We can also use something called a lambda function 
# You can read more about the lambda function here.
#http://www.python-course.eu/lambda.php 
gender_fn = lambda x: 0 if x == 'male' else 1
train['Gender'] = train['Sex'].map(gender_fn)

#or we can do in one shot
train['NameLength'] = train['Name'].map(lambda x: len(x))
train['Age2'] = train['Age'].map(lambda x: x*x)
train

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked EmbarkedRecode Gender NameLength Age2
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 0.0 0 23 484.0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 1.0 1 51 1444.0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0.0 1 22 676.0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 0.0 1 44 1225.0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 0.0 0 24 1225.0
5 6 0 3 Moran, Mr. James male 28.0 0 0 330877 8.4583 NaN Q 2.0 0 16 784.0
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S 0.0 0 23 2916.0
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S 0.0 0 30 4.0
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S 0.0 1 49 729.0
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C 1.0 1 35 196.0
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S 0.0 1 31 16.0
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S 0.0 1 24 3364.0
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S 0.0 0 30 400.0
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S 0.0 0 27 1521.0
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S 0.0 1 36 196.0
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S 0.0 1 32 3025.0
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q 2.0 0 20 4.0
17 18 1 2 Williams, Mr. Charles Eugene male 28.0 0 0 244373 13.0000 NaN S 0.0 0 28 784.0
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S 0.0 1 55 961.0
19 20 1 3 Masselmani, Mrs. Fatima female 28.0 0 0 2649 7.2250 NaN C 1.0 1 23 784.0
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S 0.0 0 20 1225.0
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S 0.0 0 21 1156.0
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q 2.0 1 27 225.0
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S 0.0 0 28 784.0
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S 0.0 1 29 64.0
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S 0.0 1 57 1444.0
26 27 0 3 Emir, Mr. Farred Chehab male 28.0 0 0 2631 7.2250 NaN C 1.0 0 23 784.0
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S 0.0 0 30 361.0
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female 28.0 0 0 330959 7.8792 NaN Q 2.0 1 29 784.0
29 30 0 3 Todoroff, Mr. Lalio male 28.0 0 0 349216 7.8958 NaN S 0.0 0 19 784.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S 0.0 0 27 441.0
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S 0.0 1 51 2304.0
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female 28.0 8 2 CA. 2343 69.5500 NaN S 0.0 1 33 784.0
864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S 0.0 0 22 576.0
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S 0.0 1 24 1764.0
866 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 NaN C 1.0 1 28 729.0
867 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 A24 S 0.0 0 36 961.0
868 869 0 3 van Melkebeke, Mr. Philemon male 28.0 0 0 345777 9.5000 NaN S 0.0 0 27 784.0
869 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 NaN S 0.0 0 31 16.0
870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S 0.0 0 17 676.0
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S 0.0 1 48 2209.0
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S 0.0 0 24 1089.0
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S 0.0 0 27 2209.0
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 NaN C 1.0 1 37 784.0
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 NaN C 1.0 1 32 225.0
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S 0.0 0 29 400.0
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S 0.0 0 20 361.0
878 879 0 3 Laleff, Mr. Kristo male 28.0 0 0 349217 7.8958 NaN S 0.0 0 18 784.0
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C 1.0 1 45 3136.0
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S 0.0 1 44 625.0
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S 0.0 0 18 1089.0
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S 0.0 1 28 484.0
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S 0.0 0 29 784.0
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S 0.0 0 22 625.0
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q 2.0 1 36 1521.0
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S 0.0 0 21 729.0
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S 0.0 1 28 361.0
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 W./C. 6607 23.4500 NaN S 0.0 1 40 784.0
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C 1.0 0 21 676.0
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q 2.0 0 19 1024.0

891 rows × 16 columns


#We can start to create little small functions that will find a string.
def has_title(name):
    for s in ['Mr.', 'Mrs.', 'Miss.', 'Dr.', 'Sir.']:
        if name.find(s) >= 0:
            return True
    return False

#Now we are using that separate function in another function.  
title_fn = lambda x: 1 if has_title(x) else 0
#Finally, we call the function for name
train['Title'] = train['Name'].map(title_fn)
test['Title']= train['Name'].map(title_fn)


test

PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q 1
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S 1
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q 1
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S 1
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S 1
5 897 3 Svensson, Mr. Johan Cervin male 14.0 0 0 7538 9.2250 NaN S 1
6 898 3 Connolly, Miss. Kate female 30.0 0 0 330972 7.6292 NaN Q 1
7 899 2 Caldwell, Mr. Albert Francis male 26.0 1 1 248738 29.0000 NaN S 0
8 900 3 Abrahim, Mrs. Joseph (Sophie Halaut Easu) female 18.0 0 0 2657 7.2292 NaN C 1
9 901 3 Davies, Mr. John Samuel male 21.0 2 0 A/4 48871 24.1500 NaN S 1
10 902 3 Ilieff, Mr. Ylio male NaN 0 0 349220 7.8958 NaN S 1
11 903 1 Jones, Mr. Charles Cresson male 46.0 0 0 694 26.0000 NaN S 1
12 904 1 Snyder, Mrs. John Pillsbury (Nelle Stevenson) female 23.0 1 0 21228 82.2667 B45 S 1
13 905 2 Howard, Mr. Benjamin male 63.0 1 0 24065 26.0000 NaN S 1
14 906 1 Chaffee, Mrs. Herbert Fuller (Carrie Constance... female 47.0 1 0 W.E.P. 5734 61.1750 E31 S 1
15 907 2 del Carlo, Mrs. Sebastiano (Argenia Genovesi) female 24.0 1 0 SC/PARIS 2167 27.7208 NaN C 1
16 908 2 Keane, Mr. Daniel male 35.0 0 0 233734 12.3500 NaN Q 0
17 909 3 Assaf, Mr. Gerios male 21.0 0 0 2692 7.2250 NaN C 1
18 910 3 Ilmakangas, Miss. Ida Livija female 27.0 1 0 STON/O2. 3101270 7.9250 NaN S 1
19 911 3 Assaf Khalil, Mrs. Mariana (Miriam")" female 45.0 0 0 2696 7.2250 NaN C 1
20 912 1 Rothschild, Mr. Martin male 55.0 1 0 PC 17603 59.4000 NaN C 1
21 913 3 Olsen, Master. Artur Karl male 9.0 0 1 C 17368 3.1708 NaN S 1
22 914 1 Flegenheim, Mrs. Alfred (Antoinette) female NaN 0 0 PC 17598 31.6833 NaN S 1
23 915 1 Williams, Mr. Richard Norris II male 21.0 0 1 PC 17597 61.3792 NaN C 1
24 916 1 Ryerson, Mrs. Arthur Larned (Emily Maria Borie) female 48.0 1 3 PC 17608 262.3750 B57 B59 B63 B66 C 1
25 917 3 Robins, Mr. Alexander A male 50.0 1 0 A/5. 3337 14.5000 NaN S 1
26 918 1 Ostby, Miss. Helene Ragnhild female 22.0 0 1 113509 61.9792 B36 C 1
27 919 3 Daher, Mr. Shedid male 22.5 0 0 2698 7.2250 NaN C 1
28 920 1 Brady, Mr. John Bertram male 41.0 0 0 113054 30.5000 A21 S 1
29 921 3 Samaan, Mr. Elias male NaN 2 0 2662 21.6792 NaN C 1
... ... ... ... ... ... ... ... ... ... ... ... ...
388 1280 3 Canavan, Mr. Patrick male 21.0 0 0 364858 7.7500 NaN Q 1
389 1281 3 Palsson, Master. Paul Folke male 6.0 3 1 349909 21.0750 NaN S 1
390 1282 1 Payne, Mr. Vivian Ponsonby male 23.0 0 0 12749 93.5000 B24 S 1
391 1283 1 Lines, Mrs. Ernest H (Elizabeth Lindsey James) female 51.0 0 1 PC 17592 39.4000 D28 S 1
392 1284 3 Abbott, Master. Eugene Joseph male 13.0 0 2 C.A. 2673 20.2500 NaN S 1
393 1285 2 Gilbert, Mr. William male 47.0 0 0 C.A. 30769 10.5000 NaN S 1
394 1286 3 Kink-Heilmann, Mr. Anton male 29.0 3 1 315153 22.0250 NaN S 1
395 1287 1 Smith, Mrs. Lucien Philip (Mary Eloise Hughes) female 18.0 1 0 13695 60.0000 C31 S 1
396 1288 3 Colbert, Mr. Patrick male 24.0 0 0 371109 7.2500 NaN Q 1
397 1289 1 Frolicher-Stehli, Mrs. Maxmillian (Margaretha ... female 48.0 1 1 13567 79.2000 B41 C 1
398 1290 3 Larsson-Rondberg, Mr. Edvard A male 22.0 0 0 347065 7.7750 NaN S 1
399 1291 3 Conlon, Mr. Thomas Henry male 31.0 0 0 21332 7.7333 NaN Q 1
400 1292 1 Bonnell, Miss. Caroline female 30.0 0 0 36928 164.8667 C7 S 1
401 1293 2 Gale, Mr. Harry male 38.0 1 0 28664 21.0000 NaN S 1
402 1294 1 Gibson, Miss. Dorothy Winifred female 22.0 0 1 112378 59.4000 NaN C 1
403 1295 1 Carrau, Mr. Jose Pedro male 17.0 0 0 113059 47.1000 NaN S 1
404 1296 1 Frauenthal, Mr. Isaac Gerald male 43.0 1 0 17765 27.7208 D40 C 1
405 1297 2 Nourney, Mr. Alfred (Baron von Drachstedt")" male 20.0 0 0 SC/PARIS 2166 13.8625 D38 C 1
406 1298 2 Ware, Mr. William Jeffery male 23.0 1 0 28666 10.5000 NaN S 1
407 1299 1 Widener, Mr. George Dunton male 50.0 1 1 113503 211.5000 C80 C 0
408 1300 3 Riordan, Miss. Johanna Hannah"" female NaN 0 0 334915 7.7208 NaN Q 1
409 1301 3 Peacock, Miss. Treasteall female 3.0 1 1 SOTON/O.Q. 3101315 13.7750 NaN S 1
410 1302 3 Naughton, Miss. Hannah female NaN 0 0 365237 7.7500 NaN Q 1
411 1303 1 Minahan, Mrs. William Edward (Lillian E Thorpe) female 37.0 1 0 19928 90.0000 C78 Q 1
412 1304 3 Henriksson, Miss. Jenny Lovisa female 28.0 0 0 347086 7.7750 NaN S 1
413 1305 3 Spector, Mr. Woolf male NaN 0 0 A.5. 3236 8.0500 NaN S 1
414 1306 1 Oliva y Ocana, Dona. Fermina female 39.0 0 0 PC 17758 108.9000 C105 C 1
415 1307 3 Saether, Mr. Simon Sivertsen male 38.5 0 0 SOTON/O.Q. 3101262 7.2500 NaN S 1
416 1308 3 Ware, Mr. Frederick male NaN 0 0 359309 8.0500 NaN S 1
417 1309 3 Peter, Master. Michael J male NaN 1 1 2668 22.3583 NaN C 1

418 rows × 12 columns

#Writing to File
submission=pd.DataFrame(test.loc[:,['PassengerId','Survived']])

#Any files you save will be available in the output tab below
submission.to_csv('submission.csv', index=False)