Coronavirus Data Modeling

Background

From Wikipedia…

“The 2019–20 coronavirus pandemic is an ongoing global pandemic of coronavirus disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The virus was first reported in Wuhan, Hubei, China, in December 2019.[5][6] On March 11, 2020, the World Health Organization declared the outbreak a pandemic.[7] As of March 12, 2020, over 134,000 cases have been confirmed in more than 120 countries and territories, with major outbreaks in mainland China, Italy, South Korea, and Iran.[3] Around 5,000 people, with about 3200 from China, have died from the disease. More than 69,000 have recovered.[4]

The virus spreads between people in a way similar to influenza, via respiratory droplets from coughing.[8][9][10] The time between exposure and symptom onset is typically five days, but may range from two to fourteen days.[10][11] Symptoms are most often fever, cough, and shortness of breath.[10][11] Complications may include pneumonia and acute respiratory distress syndrome. There is currently no vaccine or specific antiviral treatment, but research is ongoing. Efforts are aimed at managing symptoms and supportive therapy. Recommended preventive measures include handwashing, maintaining distance from other people (particularly those who are sick), and monitoring and self-isolation for fourteen days for people who suspect they are infected.[9][10][12]

Public health responses around the world have included travel restrictions, quarantines, curfews, event cancellations, and school closures. They have included the quarantine of all of Italy and the Chinese province of Hubei; various curfew measures in China and South Korea;[13][14][15] screening methods at airports and train stations;[16] and travel advisories regarding regions with community transmission.[17][18][19][20] Schools have closed nationwide in 22 countries or locally in 17 countries, affecting more than 370 million students.[21]”

https://en.wikipedia.org/wiki/2019–20_coronavirus_pandemic

For ADDITIONAL BACKGROUND, see JHU’s COVID-19 Resource Center: https://coronavirus.jhu.edu/

#RPI IDEA

Check out these resources that IDEA has put together.

https://idea.rpi.edu/covid-19-resources

The Assignment

Our lives have been seriously disrupted by the coronavirus pandemic, and there is every indication that this is going to be a global event which requires colloration in a global community to solve. Studying the data provides an opportunity to connect the pandemic to the variety of themes from the class.

A number of folks have already been examining this data. https://ourworldindata.org/coronavirus-source-data

  1. Discussion. What is the role of open data? Why is it important in this case?

```###Answer here. q1=”””

”””

</div>

</div>



2. Read this. 
https://medium.com/@tomaspueyo/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca


What is the role of bias in the data?  Identify 2 different ways that the data could be biased.  



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```###Answer here. 
q2="""

"""

```#Load some data import pandas as pd df=pd.read_csv(‘https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/03-22-2020.csv’) df

</div>

<div class="output_wrapper" markdown="1">
<div class="output_subarea" markdown="1">



<div markdown="0" class="output output_html">
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Province/State</th>
      <th>Country/Region</th>
      <th>Last Update</th>
      <th>Confirmed</th>
      <th>Deaths</th>
      <th>Recovered</th>
      <th>Latitude</th>
      <th>Longitude</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Hubei</td>
      <td>China</td>
      <td>2020-03-22T09:43:06</td>
      <td>67800</td>
      <td>3144</td>
      <td>59433</td>
      <td>30.9756</td>
      <td>112.2707</td>
    </tr>
    <tr>
      <th>1</th>
      <td>NaN</td>
      <td>Italy</td>
      <td>2020-03-22T18:13:20</td>
      <td>59138</td>
      <td>5476</td>
      <td>7024</td>
      <td>41.8719</td>
      <td>12.5674</td>
    </tr>
    <tr>
      <th>2</th>
      <td>NaN</td>
      <td>Spain</td>
      <td>2020-03-22T23:13:18</td>
      <td>28768</td>
      <td>1772</td>
      <td>2575</td>
      <td>40.4637</td>
      <td>-3.7492</td>
    </tr>
    <tr>
      <th>3</th>
      <td>NaN</td>
      <td>Germany</td>
      <td>2020-03-22T23:43:02</td>
      <td>24873</td>
      <td>94</td>
      <td>266</td>
      <td>51.1657</td>
      <td>10.4515</td>
    </tr>
    <tr>
      <th>4</th>
      <td>NaN</td>
      <td>Iran</td>
      <td>2020-03-22T14:13:06</td>
      <td>21638</td>
      <td>1685</td>
      <td>7931</td>
      <td>32.4279</td>
      <td>53.6880</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>304</th>
      <td>NaN</td>
      <td>Jersey</td>
      <td>2020-03-17T18:33:03</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>49.1900</td>
      <td>-2.1100</td>
    </tr>
    <tr>
      <th>305</th>
      <td>NaN</td>
      <td>Puerto Rico</td>
      <td>2020-03-22T22:43:02</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>18.2000</td>
      <td>-66.5000</td>
    </tr>
    <tr>
      <th>306</th>
      <td>NaN</td>
      <td>Republic of the Congo</td>
      <td>2020-03-17T21:33:03</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>-1.4400</td>
      <td>15.5560</td>
    </tr>
    <tr>
      <th>307</th>
      <td>NaN</td>
      <td>The Bahamas</td>
      <td>2020-03-19T12:13:38</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>24.2500</td>
      <td>-76.0000</td>
    </tr>
    <tr>
      <th>308</th>
      <td>NaN</td>
      <td>The Gambia</td>
      <td>2020-03-18T14:13:56</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>13.4667</td>
      <td>-16.6000</td>
    </tr>
  </tbody>
</table>
<p>309 rows × 8 columns</p>
</div>
</div>


</div>
</div>
</div>



### Preprocessing
We have to deal with missing values first.

First let's check the missing values for each column. 



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```df.isnull().sum() 

Province/State    174
Country/Region      0
Last Update         0
Confirmed           0
Deaths              0
Recovered           0
Latitude            0
Longitude           0
dtype: int64

Province/State Country/Region Last Update Confirmed Deaths Recovered Latitude Longitude
1 NaN Italy 2020-03-22T18:13:20 59138 5476 7024 41.8719 12.5674
2 NaN Spain 2020-03-22T23:13:18 28768 1772 2575 40.4637 -3.7492
3 NaN Germany 2020-03-22T23:43:02 24873 94 266 51.1657 10.4515
4 NaN Iran 2020-03-22T14:13:06 21638 1685 7931 32.4279 53.6880
7 NaN Korea, South 2020-03-22T11:13:17 8897 104 2909 35.9078 127.7669
... ... ... ... ... ... ... ... ...
304 NaN Jersey 2020-03-17T18:33:03 0 0 0 49.1900 -2.1100
305 NaN Puerto Rico 2020-03-22T22:43:02 0 1 0 18.2000 -66.5000
306 NaN Republic of the Congo 2020-03-17T21:33:03 0 0 0 -1.4400 15.5560
307 NaN The Bahamas 2020-03-19T12:13:38 0 0 0 24.2500 -76.0000
308 NaN The Gambia 2020-03-18T14:13:56 0 0 0 13.4667 -16.6000

174 rows × 8 columns


Province/State Country/Region Last Update Confirmed Deaths Recovered Latitude Longitude
0 Hubei China 2020-03-22T09:43:06 67800 3144 59433 30.9756 112.2707
5 France France 2020-03-22T23:43:02 16018 674 2200 46.2276 2.2137
6 New York US 2020-03-22T22:13:32 15793 117 0 42.1657 -74.9481
9 United Kingdom United Kingdom 2020-03-22T22:43:03 5683 281 65 55.3781 -3.4360
10 Netherlands Netherlands 2020-03-22T14:13:10 4204 179 2 52.1326 5.2913
... ... ... ... ... ... ... ... ...
294 From Diamond Princess Australia 2020-03-14T02:33:04 0 0 0 35.4437 139.6380
297 French Guiana France 2020-03-18T14:33:15 0 0 0 4.0000 -53.0000
298 Guadeloupe France 2020-03-18T14:33:15 0 0 0 16.2500 -61.5833
299 Mayotte France 2020-03-18T14:33:15 0 0 0 -12.8431 45.1383
300 Reunion France 2020-03-18T14:33:15 0 0 0 -21.1351 55.2471

135 rows × 8 columns

Data Reporting

#TBD

For the last update value, we could create a feature that as equal to the number of days since the last report. We might eliminate data that is too old.

```#TBD For the last update value, we could create a feature that as equal to the number of

</div>

</div>



### Missing Values and data
3. How might we deal with missing values? How is the data structured such that aggregation might be relevant.  





<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```###Answer here. 
q3="""

"""

```#Note the country is then the index here. country=pd.pivot_table(df, values=[‘Confirmed’, ‘Deaths’, ‘Recovered’], index=’Country/Region’, aggfunc=’sum’)

</div>

</div>



### Clustering

Here is and example of the elbow method, which is used to understand the number of clusters. 

https://scikit-learn.org/stable/modules/clustering.html#k-means

The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion.

By looking at the total inertia at different numbers of clusters, we can get an idea of the appropriate number of clusters.





<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#This indicates the 

from sklearn.cluster import KMeans
sum_sq = {}
for k in range(1,30):
    kmeans = KMeans(n_clusters = k).fit(country)
    # Inertia: Sum of distances of samples to their closest cluster center
    sum_sq[k] = kmeans.inertia_
  
  

```#ineria at different levels of K sum_sq

</div>

<div class="output_wrapper" markdown="1">
<div class="output_subarea" markdown="1">


{:.output_data_text}

{1: 18388732383.6612, 2: 5781628112.668509, 3: 1437012534.1559324, 4: 437453272.5568181, 5: 249173713.07080925, 6: 133720143.75294116, 7: 101140484.75294116, 8: 70970124.11542442, 9: 41676542.30886076, 10: 30017447.30886076, 11: 16760158.796828683, 12: 8796949.343951093, 13: 4947365.970771144, 14: 3797381.970771144, 15: 2651536.5041044774, 16: 1877813.419029374, 17: 1363350.3819298667, 18: 1024544.2380769933, 19: 722546.6442815806, 20: 643241.4546326294, 21: 509133.6414066776, 22: 418223.49140667764, 23: 317512.6517577265, 24: 261568.8043432082, 25: 216271.43865631748, 26: 184034.2586798829, 27: 153780.61576464615, 28: 127928.16630164167, 29: 109987.48092580127}



</div>
</div>
</div>



## The Elbow Method

Not a type of criteria like p<0.05, but the elbow method you look for where the change in the variance explained from adding more clusters drops extensively. 



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```# plot elbow graph
import matplotlib
from matplotlib import pyplot as plt
plt.plot(list(sum_sq.keys()),
         list(sum_sq.values()),
        linestyle = '-',
        marker = 'H',
        markersize = 2,
        markerfacecolor = 'red')

[<matplotlib.lines.Line2D at 0x7fdd6261ae48>]

png

#Looks like we can justify 4 clusters.

See how adding the 5th doesn’t really impact the total variance as much? It might be interesting to do the analysis both at 4 and 5 and try to interpret.

```kmeans = KMeans(n_clusters=4) kmeans.fit(country) y_kmeans = kmeans.predict(country)

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```y_kmeans

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 3, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0], dtype=int32)

Looks like they are mostly 0s. Let’s merge our data back together so we could get a clearer picture.

```loc=pd.pivot_table(df, values=[‘Latitude’, ‘Longitude’], index=’Country/Region’, aggfunc=’mean’) loc[‘cluster’]=y_kmeans loc

</div>

<div class="output_wrapper" markdown="1">
<div class="output_subarea" markdown="1">



<div markdown="0" class="output output_html">
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Latitude</th>
      <th>Longitude</th>
      <th>cluster</th>
    </tr>
    <tr>
      <th>Country/Region</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Afghanistan</th>
      <td>33.9391</td>
      <td>67.7100</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Albania</th>
      <td>41.1533</td>
      <td>20.1683</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Algeria</th>
      <td>28.0339</td>
      <td>1.6596</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Andorra</th>
      <td>42.5063</td>
      <td>1.5218</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Angola</th>
      <td>-11.2027</td>
      <td>17.8739</td>
      <td>0</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>Uzbekistan</th>
      <td>41.3775</td>
      <td>64.5853</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Venezuela</th>
      <td>6.4238</td>
      <td>-66.5897</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Vietnam</th>
      <td>14.0583</td>
      <td>108.2772</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Zambia</th>
      <td>-13.1339</td>
      <td>27.8493</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Zimbabwe</th>
      <td>-19.0154</td>
      <td>29.1549</td>
      <td>0</td>
    </tr>
  </tbody>
</table>
<p>183 rows × 3 columns</p>
</div>
</div>


</div>
</div>
</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#join in our dataframes
alldata= country.join(loc)
alldata.to_csv("alldata.csv")  
alldata

Confirmed Deaths Recovered Latitude Longitude cluster
Country/Region
Afghanistan 40 1 1 33.9391 67.7100 0
Albania 89 2 2 41.1533 20.1683 0
Algeria 201 17 65 28.0339 1.6596 0
Andorra 113 1 1 42.5063 1.5218 0
Angola 2 0 0 -11.2027 17.8739 0
... ... ... ... ... ... ...
Uzbekistan 43 0 0 41.3775 64.5853 0
Venezuela 70 0 15 6.4238 -66.5897 0
Vietnam 113 0 17 14.0583 108.2772 0
Zambia 3 0 0 -13.1339 27.8493 0
Zimbabwe 3 0 0 -19.0154 29.1549 0

183 rows × 6 columns

from google.colab import files
files.download("alldata.csv")

```alldata.sort_values(‘cluster’, inplace=True)

</div>

</div>



#How do we interpret our clusters?



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```alldata[alldata.cluster!=0]

Confirmed Deaths Recovered Latitude Longitude cluster
Country/Region
China 81397 3265 72362 32.729748 111.684242 1
Germany 24873 94 266 51.165700 10.451500 2
US 33276 417 178 38.112296 -84.664082 2
France 16044 674 2200 3.320689 -13.517378 2
Iran 21638 1685 7931 32.427900 53.688000 2
Spain 28768 1772 2575 40.463700 -3.749200 2
Italy 59138 5476 7024 41.871900 12.567400 3
pd.set_option('display.max_rows', 500)  #this allows us to see all rows. 
alldata[alldata.cluster==0]

Confirmed Deaths Recovered Latitude Longitude cluster
Country/Region
Afghanistan 40 1 1 33.939100 67.710000 0
Albania 89 2 2 41.153300 20.168300 0
Algeria 201 17 65 28.033900 1.659600 0
Andorra 113 1 1 42.506300 1.521800 0
Angola 2 0 0 -11.202700 17.873900 0
Antigua and Barbuda 1 0 0 17.060800 -61.796400 0
Argentina 225 4 3 -38.416100 -63.616700 0
Armenia 194 0 2 40.069100 45.038200 0
Australia 1314 7 88 -24.502867 141.055589 0
Austria 3244 16 9 47.516200 14.550100 0
Azerbaijan 65 1 10 40.143100 47.576900 0
Bahamas, The 4 0 0 25.034300 -77.396300 0
Bahrain 332 2 149 26.066700 50.557700 0
Bangladesh 27 2 3 23.685000 90.356300 0
Barbados 14 0 0 13.193900 -59.543200 0
Belarus 76 0 15 53.709800 27.953400 0
Belgium 3401 75 263 50.503900 4.469900 0
Benin 2 0 0 9.307700 2.315800 0
Bhutan 2 0 0 27.514200 90.433600 0
Bolivia 24 0 0 -16.290200 -63.588700 0
Bosnia and Herzegovina 126 1 2 43.915900 17.679100 0
Brazil 1593 25 2 -14.235000 -51.925300 0
Brunei 88 0 2 4.535300 114.727700 0
Bulgaria 187 3 3 42.733900 25.485800 0
Burkina Faso 75 4 5 12.238300 -1.561600 0
Cabo Verde 3 0 0 16.538800 -23.041800 0
Cambodia 84 0 1 12.565700 104.991000 0
Cameroon 40 0 0 3.848000 11.502100 0
Canada 1465 21 10 50.993533 -92.262983 0
Cape Verde 0 0 0 15.111100 -23.616700 0
Central African Republic 3 0 0 6.611100 20.939400 0
Chad 1 0 0 15.454200 18.732200 0
Chile 632 1 8 -35.675100 -71.543000 0
Colombia 231 2 3 4.570900 -74.297300 0
Congo (Brazzaville) 3 0 0 -0.228000 15.827700 0
Congo (Kinshasa) 30 1 0 -4.038300 21.758700 0
Costa Rica 134 2 2 9.748900 -83.753400 0
Cote d'Ivoire 14 0 1 7.540000 -5.547100 0
Croatia 254 1 5 45.100000 15.200000 0
Cruise Ship 712 8 325 35.449800 139.664900 0
Cuba 35 1 0 21.521800 -77.781200 0
Cyprus 95 1 3 35.126400 33.429900 0
Czechia 1120 1 6 49.817500 15.473000 0
Denmark 1514 13 1 63.287800 -13.338100 0
Djibouti 1 0 0 11.825100 42.590300 0
Dominica 1 0 0 15.415000 -61.371000 0
Dominican Republic 202 3 0 18.735700 -70.162700 0
East Timor 0 0 0 -8.550000 125.560000 0
Ecuador 789 14 3 -1.831200 -78.183400 0
Egypt 327 14 56 26.820600 30.802500 0
El Salvador 3 0 0 13.794200 -88.896500 0
Equatorial Guinea 6 0 0 1.650800 10.267900 0
Eritrea 1 0 0 15.179400 39.782300 0
Estonia 326 0 2 58.595300 25.013600 0
Eswatini 4 0 0 -26.522500 31.465900 0
Ethiopia 11 0 4 9.145000 40.489700 0
Fiji 2 0 0 -17.713400 178.065000 0
Finland 626 1 10 61.924100 25.748200 0
French Guiana 18 0 6 3.933900 -53.125800 0
Gabon 5 1 0 -0.803700 11.609400 0
Gambia, The 1 0 0 13.443200 -15.310100 0
Georgia 54 0 3 42.315400 43.356900 0
Ghana 24 1 0 7.946500 -1.023200 0
Greece 624 15 19 39.074200 21.824300 0
Greenland 0 0 0 72.000000 -40.000000 0
Grenada 1 0 0 12.116500 -61.679000 0
Guadeloupe 56 0 0 16.265000 -61.551000 0
Guam 0 1 0 13.444300 144.793700 0
Guatemala 19 1 0 15.783500 -90.230800 0
Guernsey 0 0 0 49.450000 -2.580000 0
Guinea 2 0 0 9.945600 -9.696600 0
Guyana 7 1 0 4.860400 -58.930200 0
Haiti 2 0 0 18.971200 -72.285200 0
Holy See 1 0 0 41.902900 12.453400 0
Honduras 26 0 0 15.200000 -86.241900 0
Hungary 131 6 16 47.162500 19.503300 0
Iceland 568 1 36 64.963100 -19.020800 0
India 396 7 27 20.593700 78.962900 0
Indonesia 514 48 29 -0.789300 113.921300 0
Iraq 233 20 57 33.223200 43.679300 0
Ireland 906 4 5 53.142400 -7.692100 0
Israel 1071 1 37 31.046100 34.851600 0
Jamaica 16 1 2 18.109600 -77.297500 0
Japan 1086 40 235 36.204800 138.252900 0
Jersey 0 0 0 49.190000 -2.110000 0
Jordan 112 0 1 30.585200 36.238400 0
Kazakhstan 60 0 0 48.019600 66.923700 0
Kenya 15 0 0 -0.023600 37.906200 0
Korea, South 8897 104 2909 35.907800 127.766900 0
Kosovo 2 0 0 42.602600 20.903000 0
Kuwait 188 0 27 29.311700 47.481800 0
Kyrgyzstan 14 0 0 41.204400 74.766100 0
Latvia 139 0 1 56.879600 24.603200 0
Lebanon 248 4 8 33.854700 35.862300 0
Liberia 3 0 0 6.428100 -9.429500 0
Liechtenstein 37 0 0 47.166000 9.555400 0
Lithuania 131 1 1 55.169400 23.881300 0
Luxembourg 798 8 6 49.815300 6.129600 0
Madagascar 3 0 0 -18.766900 46.869100 0
Malaysia 1306 10 139 4.210500 101.975800 0
Maldives 13 0 0 3.202800 73.220700 0
Malta 90 0 2 35.937500 14.375400 0
Martinique 37 1 0 14.641500 -61.024200 0
Mauritania 2 0 0 21.007900 -10.940800 0
Mauritius 18 1 0 -20.348400 57.552200 0
Mayotte 11 0 0 -12.827500 45.166200 0
Mexico 251 2 4 23.634500 -102.552800 0
Moldova 94 1 1 47.411600 28.369900 0
Monaco 23 0 1 43.738400 7.424600 0
Mongolia 10 0 0 46.862500 103.846700 0
Montenegro 21 0 0 42.708700 19.374400 0
Morocco 115 4 3 31.791700 -7.092600 0
Mozambique 1 0 0 -18.665700 35.529600 0
Namibia 3 0 0 -22.957600 18.490400 0
Nepal 2 0 1 28.394900 84.124000 0
Netherlands 4216 180 2 23.716450 -49.180450 0
New Zealand 66 0 0 -40.900600 174.886000 0
Nicaragua 2 0 0 12.865400 -85.207200 0
Niger 2 0 0 17.607800 8.081700 0
Nigeria 30 0 2 9.082000 8.675300 0
North Macedonia 114 1 1 41.608600 21.745300 0
Norway 2383 7 1 60.472000 8.468900 0
Oman 55 0 17 21.473500 55.975400 0
Pakistan 776 5 5 30.375300 69.345100 0
Panama 245 3 0 8.538000 -80.782100 0
Papua New Guinea 1 0 0 -6.315000 143.955500 0
Paraguay 22 1 0 -23.442500 -58.443800 0
Peru 363 5 1 -9.190000 -75.015200 0
Philippines 380 25 17 12.879700 121.774000 0
Poland 634 7 1 51.919400 19.145100 0
Portugal 1600 14 5 39.399900 -8.224500 0
Puerto Rico 0 1 0 18.200000 -66.500000 0
Qatar 494 0 33 25.354800 51.183900 0
Republic of the Congo 0 0 0 -1.440000 15.556000 0
Reunion 47 0 0 -21.115100 55.536400 0
Romania 433 3 64 45.943200 24.966800 0
Russia 367 0 16 61.524000 105.318800 0
Rwanda 19 0 0 -1.940300 29.873900 0
Saint Lucia 2 0 0 13.909400 -60.978900 0
Saint Vincent and the Grenadines 1 0 0 12.984300 -61.287200 0
San Marino 160 20 4 43.942400 12.457800 0
Saudi Arabia 511 0 16 23.885900 45.079200 0
Senegal 67 0 5 14.497400 -14.452400 0
Serbia 222 2 1 44.016500 21.005900 0
Seychelles 7 0 0 -4.679600 55.492000 0
Singapore 455 2 144 1.352100 103.819800 0
Slovakia 185 1 7 48.669000 19.699000 0
Slovenia 414 2 0 46.151200 14.995500 0
Somalia 1 0 0 5.152100 46.199600 0
South Africa 274 0 0 -30.559500 22.937500 0
Sri Lanka 82 0 3 7.873100 80.771800 0
Sudan 2 1 0 12.862800 30.217600 0
Suriname 5 0 0 3.919300 -56.027800 0
Sweden 1934 21 16 60.128200 18.643500 0
Switzerland 7245 98 131 46.818200 8.227500 0
Syria 1 0 0 34.802100 38.996800 0
Taiwan* 169 2 28 23.700000 121.000000 0
Tanzania 12 0 0 -6.369000 34.888800 0
Thailand 599 1 44 15.870000 100.992500 0
The Bahamas 0 0 0 24.250000 -76.000000 0
The Gambia 0 0 0 13.466700 -16.600000 0
Timor-Leste 1 0 0 -8.874200 125.727500 0
Togo 16 0 1 8.619500 0.824800 0
Trinidad and Tobago 50 0 1 10.691800 -61.222500 0
Tunisia 75 3 1 33.886900 9.537500 0
Turkey 1236 30 0 38.963700 35.243300 0
Uganda 1 0 0 1.373300 32.290300 0
Ukraine 73 3 1 48.379400 31.165600 0
United Arab Emirates 153 2 38 23.424100 53.847800 0
United Kingdom 5741 282 67 37.641557 -31.984943 0
Uruguay 135 0 0 -32.522800 -55.765800 0
Uzbekistan 43 0 0 41.377500 64.585300 0
Venezuela 70 0 15 6.423800 -66.589700 0
Vietnam 113 0 17 14.058300 108.277200 0
Zambia 3 0 0 -13.133900 27.849300 0
Zimbabwe 3 0 0 -19.015400 29.154900 0

Try some EDA of your own. This is an 10 point in class assignment. LMS (Section 1: In class assignment Clustering) by next Monday 3/30.

Using the Covid-19 Clustering example, try something different as part of the EDA.

Turn in ~1/2 page writeup (NOT A JUPYTER NOTEBOOK) describing what you did.

Examples:

Try visualizing the data differently.

Try running a different clustering algorithm.

Try a different number of clusters. How would it be different if we created ratios that controlled for total population? We tried a different clustering algorithm?