Coronavirus Data Modeling

Background

From Wikipedia…

“The 2019–20 coronavirus pandemic is an ongoing global pandemic of coronavirus disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The virus was first reported in Wuhan, Hubei, China, in December 2019.[5][6] On March 11, 2020, the World Health Organization declared the outbreak a pandemic.[7] As of March 12, 2020, over 134,000 cases have been confirmed in more than 120 countries and territories, with major outbreaks in mainland China, Italy, South Korea, and Iran.[3] Around 5,000 people, with about 3200 from China, have died from the disease. More than 69,000 have recovered.[4]

The virus spreads between people in a way similar to influenza, via respiratory droplets from coughing.[8][9][10] The time between exposure and symptom onset is typically five days, but may range from two to fourteen days.[10][11] Symptoms are most often fever, cough, and shortness of breath.[10][11] Complications may include pneumonia and acute respiratory distress syndrome. There is currently no vaccine or specific antiviral treatment, but research is ongoing. Efforts are aimed at managing symptoms and supportive therapy. Recommended preventive measures include handwashing, maintaining distance from other people (particularly those who are sick), and monitoring and self-isolation for fourteen days for people who suspect they are infected.[9][10][12]

Public health responses around the world have included travel restrictions, quarantines, curfews, event cancellations, and school closures. They have included the quarantine of all of Italy and the Chinese province of Hubei; various curfew measures in China and South Korea;[13][14][15] screening methods at airports and train stations;[16] and travel advisories regarding regions with community transmission.[17][18][19][20] Schools have closed nationwide in 22 countries or locally in 17 countries, affecting more than 370 million students.[21]”

https://en.wikipedia.org/wiki/2019–20_coronavirus_pandemic

For ADDITIONAL BACKGROUND, see JHU’s COVID-19 Resource Center: https://coronavirus.jhu.edu/

#RPI IDEA

Check out these resources that IDEA has put together.

https://idea.rpi.edu/covid-19-resources

The Assignment

Our lives have been seriously disrupted by the coronavirus pandemic, and there is every indication that this is going to be a global event which requires colloration in a global community to solve. Studying the data provides an opportunity to connect the pandemic to the variety of themes from the class.

A number of folks have already been examining this data. https://ourworldindata.org/coronavirus-source-data

Discussion. What is the role of open data? Why is it important in this case?

```###Answer here. q1=”””

”””

</div>

</div>

2. Read this. 
https://medium.com/@tomaspueyo/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca

What is the role of bias in the data?  Identify 2 different ways that the data could be biased.  

<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```###Answer here. 
q2="""

"""

```#Load some data import pandas as pd df=pd.read_csv(‘https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/03-22-2020.csv’) df

</div>

<div class="output_wrapper" markdown="1">
<div class="output_subarea" markdown="1">



<div markdown="0" class="output output_html">
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Province/State</th>
      <th>Country/Region</th>
      <th>Last Update</th>
      <th>Confirmed</th>
      <th>Deaths</th>
      <th>Recovered</th>
      <th>Latitude</th>
      <th>Longitude</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Hubei</td>
      <td>China</td>
      <td>2020-03-22T09:43:06</td>
      <td>67800</td>
      <td>3144</td>
      <td>59433</td>
      <td>30.9756</td>
      <td>112.2707</td>
    </tr>
    <tr>
      <th>1</th>
      <td>NaN</td>
      <td>Italy</td>
      <td>2020-03-22T18:13:20</td>
      <td>59138</td>
      <td>5476</td>
      <td>7024</td>
      <td>41.8719</td>
      <td>12.5674</td>
    </tr>
    <tr>
      <th>2</th>
      <td>NaN</td>
      <td>Spain</td>
      <td>2020-03-22T23:13:18</td>
      <td>28768</td>
      <td>1772</td>
      <td>2575</td>
      <td>40.4637</td>
      <td>-3.7492</td>
    </tr>
    <tr>
      <th>3</th>
      <td>NaN</td>
      <td>Germany</td>
      <td>2020-03-22T23:43:02</td>
      <td>24873</td>
      <td>94</td>
      <td>266</td>
      <td>51.1657</td>
      <td>10.4515</td>
    </tr>
    <tr>
      <th>4</th>
      <td>NaN</td>
      <td>Iran</td>
      <td>2020-03-22T14:13:06</td>
      <td>21638</td>
      <td>1685</td>
      <td>7931</td>
      <td>32.4279</td>
      <td>53.6880</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>304</th>
      <td>NaN</td>
      <td>Jersey</td>
      <td>2020-03-17T18:33:03</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>49.1900</td>
      <td>-2.1100</td>
    </tr>
    <tr>
      <th>305</th>
      <td>NaN</td>
      <td>Puerto Rico</td>
      <td>2020-03-22T22:43:02</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>18.2000</td>
      <td>-66.5000</td>
    </tr>
    <tr>
      <th>306</th>
      <td>NaN</td>
      <td>Republic of the Congo</td>
      <td>2020-03-17T21:33:03</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>-1.4400</td>
      <td>15.5560</td>
    </tr>
    <tr>
      <th>307</th>
      <td>NaN</td>
      <td>The Bahamas</td>
      <td>2020-03-19T12:13:38</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>24.2500</td>
      <td>-76.0000</td>
    </tr>
    <tr>
      <th>308</th>
      <td>NaN</td>
      <td>The Gambia</td>
      <td>2020-03-18T14:13:56</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>13.4667</td>
      <td>-16.6000</td>
    </tr>
  </tbody>
</table>
<p>309 rows × 8 columns</p>
</div>
</div>


</div>
</div>
</div>



### Preprocessing
We have to deal with missing values first.

First let's check the missing values for each column. 



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```df.isnull().sum() 

Province/State    174
Country/Region      0
Last Update         0
Confirmed           0
Deaths              0
Recovered           0
Latitude            0
Longitude           0
dtype: int64

	Province/State	Country/Region	Last Update	Confirmed	Deaths	Recovered	Latitude	Longitude
1	NaN	Italy	2020-03-22T18:13:20	59138	5476	7024	41.8719	12.5674
2	NaN	Spain	2020-03-22T23:13:18	28768	1772	2575	40.4637	-3.7492
3	NaN	Germany	2020-03-22T23:43:02	24873	94	266	51.1657	10.4515
4	NaN	Iran	2020-03-22T14:13:06	21638	1685	7931	32.4279	53.6880
7	NaN	Korea, South	2020-03-22T11:13:17	8897	104	2909	35.9078	127.7669
...	...	...	...	...	...	...	...	...
304	NaN	Jersey	2020-03-17T18:33:03	0	0	0	49.1900	-2.1100
305	NaN	Puerto Rico	2020-03-22T22:43:02	0	1	0	18.2000	-66.5000
306	NaN	Republic of the Congo	2020-03-17T21:33:03	0	0	0	-1.4400	15.5560
307	NaN	The Bahamas	2020-03-19T12:13:38	0	0	0	24.2500	-76.0000
308	NaN	The Gambia	2020-03-18T14:13:56	0	0	0	13.4667	-16.6000

174 rows × 8 columns

	Province/State	Country/Region	Last Update	Confirmed	Deaths	Recovered	Latitude	Longitude
0	Hubei	China	2020-03-22T09:43:06	67800	3144	59433	30.9756	112.2707
5	France	France	2020-03-22T23:43:02	16018	674	2200	46.2276	2.2137
6	New York	US	2020-03-22T22:13:32	15793	117	0	42.1657	-74.9481
9	United Kingdom	United Kingdom	2020-03-22T22:43:03	5683	281	65	55.3781	-3.4360
10	Netherlands	Netherlands	2020-03-22T14:13:10	4204	179	2	52.1326	5.2913
...	...	...	...	...	...	...	...	...
294	From Diamond Princess	Australia	2020-03-14T02:33:04	0	0	0	35.4437	139.6380
297	French Guiana	France	2020-03-18T14:33:15	0	0	0	4.0000	-53.0000
298	Guadeloupe	France	2020-03-18T14:33:15	0	0	0	16.2500	-61.5833
299	Mayotte	France	2020-03-18T14:33:15	0	0	0	-12.8431	45.1383
300	Reunion	France	2020-03-18T14:33:15	0	0	0	-21.1351	55.2471

135 rows × 8 columns

Data Reporting

#TBD

For the last update value, we could create a feature that as equal to the number of days since the last report. We might eliminate data that is too old.

```#TBD For the last update value, we could create a feature that as equal to the number of

</div>

</div>

### Missing Values and data
3. How might we deal with missing values? How is the data structured such that aggregation might be relevant.  

<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```###Answer here. 
q3="""

"""

```#Note the country is then the index here. country=pd.pivot_table(df, values=[‘Confirmed’, ‘Deaths’, ‘Recovered’], index=’Country/Region’, aggfunc=’sum’)

</div>

</div>

### Clustering

Here is and example of the elbow method, which is used to understand the number of clusters. 

https://scikit-learn.org/stable/modules/clustering.html#k-means

The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion.

By looking at the total inertia at different numbers of clusters, we can get an idea of the appropriate number of clusters.

<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#This indicates the 

from sklearn.cluster import KMeans
sum_sq = {}
for k in range(1,30):
    kmeans = KMeans(n_clusters = k).fit(country)
    # Inertia: Sum of distances of samples to their closest cluster center
    sum_sq[k] = kmeans.inertia_

```#ineria at different levels of K sum_sq

</div>

<div class="output_wrapper" markdown="1">
<div class="output_subarea" markdown="1">

{:.output_data_text}

{1: 18388732383.6612, 2: 5781628112.668509, 3: 1437012534.1559324, 4: 437453272.5568181, 5: 249173713.07080925, 6: 133720143.75294116, 7: 101140484.75294116, 8: 70970124.11542442, 9: 41676542.30886076, 10: 30017447.30886076, 11: 16760158.796828683, 12: 8796949.343951093, 13: 4947365.970771144, 14: 3797381.970771144, 15: 2651536.5041044774, 16: 1877813.419029374, 17: 1363350.3819298667, 18: 1024544.2380769933, 19: 722546.6442815806, 20: 643241.4546326294, 21: 509133.6414066776, 22: 418223.49140667764, 23: 317512.6517577265, 24: 261568.8043432082, 25: 216271.43865631748, 26: 184034.2586798829, 27: 153780.61576464615, 28: 127928.16630164167, 29: 109987.48092580127}

</div>
</div>
</div>



## The Elbow Method

Not a type of criteria like p<0.05, but the elbow method you look for where the change in the variance explained from adding more clusters drops extensively. 



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```# plot elbow graph
import matplotlib
from matplotlib import pyplot as plt
plt.plot(list(sum_sq.keys()),
         list(sum_sq.values()),
        linestyle = '-',
        marker = 'H',
        markersize = 2,
        markerfacecolor = 'red')

[<matplotlib.lines.Line2D at 0x7fdd6261ae48>]

png

#Looks like we can justify 4 clusters.

See how adding the 5th doesn’t really impact the total variance as much? It might be interesting to do the analysis both at 4 and 5 and try to interpret.

```kmeans = KMeans(n_clusters=4) kmeans.fit(country) y_kmeans = kmeans.predict(country)

</div>

</div>

<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```y_kmeans

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 3, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0], dtype=int32)

Looks like they are mostly 0s. Let’s merge our data back together so we could get a clearer picture.

```loc=pd.pivot_table(df, values=[‘Latitude’, ‘Longitude’], index=’Country/Region’, aggfunc=’mean’) loc[‘cluster’]=y_kmeans loc

</div>

<div class="output_wrapper" markdown="1">
<div class="output_subarea" markdown="1">



<div markdown="0" class="output output_html">
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Latitude</th>
      <th>Longitude</th>
      <th>cluster</th>
    </tr>
    <tr>
      <th>Country/Region</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Afghanistan</th>
      <td>33.9391</td>
      <td>67.7100</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Albania</th>
      <td>41.1533</td>
      <td>20.1683</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Algeria</th>
      <td>28.0339</td>
      <td>1.6596</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Andorra</th>
      <td>42.5063</td>
      <td>1.5218</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Angola</th>
      <td>-11.2027</td>
      <td>17.8739</td>
      <td>0</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>Uzbekistan</th>
      <td>41.3775</td>
      <td>64.5853</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Venezuela</th>
      <td>6.4238</td>
      <td>-66.5897</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Vietnam</th>
      <td>14.0583</td>
      <td>108.2772</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Zambia</th>
      <td>-13.1339</td>
      <td>27.8493</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Zimbabwe</th>
      <td>-19.0154</td>
      <td>29.1549</td>
      <td>0</td>
    </tr>
  </tbody>
</table>
<p>183 rows × 3 columns</p>
</div>
</div>


</div>
</div>
</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#join in our dataframes
alldata= country.join(loc)
alldata.to_csv("alldata.csv")  
alldata

	Confirmed	Deaths	Recovered	Latitude	Longitude	cluster
Country/Region
Afghanistan	40	1	1	33.9391	67.7100	0
Albania	89	2	2	41.1533	20.1683	0
Algeria	201	17	65	28.0339	1.6596	0
Andorra	113	1	1	42.5063	1.5218	0
Angola	2	0	0	-11.2027	17.8739	0
...	...	...	...	...	...	...
Uzbekistan	43	0	0	41.3775	64.5853	0
Venezuela	70	0	15	6.4238	-66.5897	0
Vietnam	113	0	17	14.0583	108.2772	0
Zambia	3	0	0	-13.1339	27.8493	0
Zimbabwe	3	0	0	-19.0154	29.1549	0

183 rows × 6 columns

from google.colab import files
files.download("alldata.csv")

```alldata.sort_values(‘cluster’, inplace=True)

</div>

</div>

#How do we interpret our clusters?

<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```alldata[alldata.cluster!=0]

	Confirmed	Deaths	Recovered	Latitude	Longitude	cluster
Country/Region
China	81397	3265	72362	32.729748	111.684242	1
Germany	24873	94	266	51.165700	10.451500	2
US	33276	417	178	38.112296	-84.664082	2
France	16044	674	2200	3.320689	-13.517378	2
Iran	21638	1685	7931	32.427900	53.688000	2
Spain	28768	1772	2575	40.463700	-3.749200	2
Italy	59138	5476	7024	41.871900	12.567400	3

pd.set_option('display.max_rows', 500)  #this allows us to see all rows. 
alldata[alldata.cluster==0]

	Confirmed	Deaths	Recovered	Latitude	Longitude	cluster
Country/Region
Afghanistan	40	1	1	33.939100	67.710000	0
Albania	89	2	2	41.153300	20.168300	0
Algeria	201	17	65	28.033900	1.659600	0
Andorra	113	1	1	42.506300	1.521800	0
Angola	2	0	0	-11.202700	17.873900	0
Antigua and Barbuda	1	0	0	17.060800	-61.796400	0
Argentina	225	4	3	-38.416100	-63.616700	0
Armenia	194	0	2	40.069100	45.038200	0
Australia	1314	7	88	-24.502867	141.055589	0
Austria	3244	16	9	47.516200	14.550100	0
Azerbaijan	65	1	10	40.143100	47.576900	0
Bahamas, The	4	0	0	25.034300	-77.396300	0
Bahrain	332	2	149	26.066700	50.557700	0
Bangladesh	27	2	3	23.685000	90.356300	0
Barbados	14	0	0	13.193900	-59.543200	0
Belarus	76	0	15	53.709800	27.953400	0
Belgium	3401	75	263	50.503900	4.469900	0
Benin	2	0	0	9.307700	2.315800	0
Bhutan	2	0	0	27.514200	90.433600	0
Bolivia	24	0	0	-16.290200	-63.588700	0
Bosnia and Herzegovina	126	1	2	43.915900	17.679100	0
Brazil	1593	25	2	-14.235000	-51.925300	0
Brunei	88	0	2	4.535300	114.727700	0
Bulgaria	187	3	3	42.733900	25.485800	0
Burkina Faso	75	4	5	12.238300	-1.561600	0
Cabo Verde	3	0	0	16.538800	-23.041800	0
Cambodia	84	0	1	12.565700	104.991000	0
Cameroon	40	0	0	3.848000	11.502100	0
Canada	1465	21	10	50.993533	-92.262983	0
Cape Verde	0	0	0	15.111100	-23.616700	0
Central African Republic	3	0	0	6.611100	20.939400	0
Chad	1	0	0	15.454200	18.732200	0
Chile	632	1	8	-35.675100	-71.543000	0
Colombia	231	2	3	4.570900	-74.297300	0
Congo (Brazzaville)	3	0	0	-0.228000	15.827700	0
Congo (Kinshasa)	30	1	0	-4.038300	21.758700	0
Costa Rica	134	2	2	9.748900	-83.753400	0
Cote d'Ivoire	14	0	1	7.540000	-5.547100	0
Croatia	254	1	5	45.100000	15.200000	0
Cruise Ship	712	8	325	35.449800	139.664900	0
Cuba	35	1	0	21.521800	-77.781200	0
Cyprus	95	1	3	35.126400	33.429900	0
Czechia	1120	1	6	49.817500	15.473000	0
Denmark	1514	13	1	63.287800	-13.338100	0
Djibouti	1	0	0	11.825100	42.590300	0
Dominica	1	0	0	15.415000	-61.371000	0
Dominican Republic	202	3	0	18.735700	-70.162700	0
East Timor	0	0	0	-8.550000	125.560000	0
Ecuador	789	14	3	-1.831200	-78.183400	0
Egypt	327	14	56	26.820600	30.802500	0
El Salvador	3	0	0	13.794200	-88.896500	0
Equatorial Guinea	6	0	0	1.650800	10.267900	0
Eritrea	1	0	0	15.179400	39.782300	0
Estonia	326	0	2	58.595300	25.013600	0
Eswatini	4	0	0	-26.522500	31.465900	0
Ethiopia	11	0	4	9.145000	40.489700	0
Fiji	2	0	0	-17.713400	178.065000	0
Finland	626	1	10	61.924100	25.748200	0
French Guiana	18	0	6	3.933900	-53.125800	0
Gabon	5	1	0	-0.803700	11.609400	0
Gambia, The	1	0	0	13.443200	-15.310100	0
Georgia	54	0	3	42.315400	43.356900	0
Ghana	24	1	0	7.946500	-1.023200	0
Greece	624	15	19	39.074200	21.824300	0
Greenland	0	0	0	72.000000	-40.000000	0
Grenada	1	0	0	12.116500	-61.679000	0
Guadeloupe	56	0	0	16.265000	-61.551000	0
Guam	0	1	0	13.444300	144.793700	0
Guatemala	19	1	0	15.783500	-90.230800	0
Guernsey	0	0	0	49.450000	-2.580000	0
Guinea	2	0	0	9.945600	-9.696600	0
Guyana	7	1	0	4.860400	-58.930200	0
Haiti	2	0	0	18.971200	-72.285200	0
Holy See	1	0	0	41.902900	12.453400	0
Honduras	26	0	0	15.200000	-86.241900	0
Hungary	131	6	16	47.162500	19.503300	0
Iceland	568	1	36	64.963100	-19.020800	0
India	396	7	27	20.593700	78.962900	0
Indonesia	514	48	29	-0.789300	113.921300	0
Iraq	233	20	57	33.223200	43.679300	0
Ireland	906	4	5	53.142400	-7.692100	0
Israel	1071	1	37	31.046100	34.851600	0
Jamaica	16	1	2	18.109600	-77.297500	0
Japan	1086	40	235	36.204800	138.252900	0
Jersey	0	0	0	49.190000	-2.110000	0
Jordan	112	0	1	30.585200	36.238400	0
Kazakhstan	60	0	0	48.019600	66.923700	0
Kenya	15	0	0	-0.023600	37.906200	0
Korea, South	8897	104	2909	35.907800	127.766900	0
Kosovo	2	0	0	42.602600	20.903000	0
Kuwait	188	0	27	29.311700	47.481800	0
Kyrgyzstan	14	0	0	41.204400	74.766100	0
Latvia	139	0	1	56.879600	24.603200	0
Lebanon	248	4	8	33.854700	35.862300	0
Liberia	3	0	0	6.428100	-9.429500	0
Liechtenstein	37	0	0	47.166000	9.555400	0
Lithuania	131	1	1	55.169400	23.881300	0
Luxembourg	798	8	6	49.815300	6.129600	0
Madagascar	3	0	0	-18.766900	46.869100	0
Malaysia	1306	10	139	4.210500	101.975800	0
Maldives	13	0	0	3.202800	73.220700	0
Malta	90	0	2	35.937500	14.375400	0
Martinique	37	1	0	14.641500	-61.024200	0
Mauritania	2	0	0	21.007900	-10.940800	0
Mauritius	18	1	0	-20.348400	57.552200	0
Mayotte	11	0	0	-12.827500	45.166200	0
Mexico	251	2	4	23.634500	-102.552800	0
Moldova	94	1	1	47.411600	28.369900	0
Monaco	23	0	1	43.738400	7.424600	0
Mongolia	10	0	0	46.862500	103.846700	0
Montenegro	21	0	0	42.708700	19.374400	0
Morocco	115	4	3	31.791700	-7.092600	0
Mozambique	1	0	0	-18.665700	35.529600	0
Namibia	3	0	0	-22.957600	18.490400	0
Nepal	2	0	1	28.394900	84.124000	0
Netherlands	4216	180	2	23.716450	-49.180450	0
New Zealand	66	0	0	-40.900600	174.886000	0
Nicaragua	2	0	0	12.865400	-85.207200	0
Niger	2	0	0	17.607800	8.081700	0
Nigeria	30	0	2	9.082000	8.675300	0
North Macedonia	114	1	1	41.608600	21.745300	0
Norway	2383	7	1	60.472000	8.468900	0
Oman	55	0	17	21.473500	55.975400	0
Pakistan	776	5	5	30.375300	69.345100	0
Panama	245	3	0	8.538000	-80.782100	0
Papua New Guinea	1	0	0	-6.315000	143.955500	0
Paraguay	22	1	0	-23.442500	-58.443800	0
Peru	363	5	1	-9.190000	-75.015200	0
Philippines	380	25	17	12.879700	121.774000	0
Poland	634	7	1	51.919400	19.145100	0
Portugal	1600	14	5	39.399900	-8.224500	0
Puerto Rico	0	1	0	18.200000	-66.500000	0
Qatar	494	0	33	25.354800	51.183900	0
Republic of the Congo	0	0	0	-1.440000	15.556000	0
Reunion	47	0	0	-21.115100	55.536400	0
Romania	433	3	64	45.943200	24.966800	0
Russia	367	0	16	61.524000	105.318800	0
Rwanda	19	0	0	-1.940300	29.873900	0
Saint Lucia	2	0	0	13.909400	-60.978900	0
Saint Vincent and the Grenadines	1	0	0	12.984300	-61.287200	0
San Marino	160	20	4	43.942400	12.457800	0
Saudi Arabia	511	0	16	23.885900	45.079200	0
Senegal	67	0	5	14.497400	-14.452400	0
Serbia	222	2	1	44.016500	21.005900	0
Seychelles	7	0	0	-4.679600	55.492000	0
Singapore	455	2	144	1.352100	103.819800	0
Slovakia	185	1	7	48.669000	19.699000	0
Slovenia	414	2	0	46.151200	14.995500	0
Somalia	1	0	0	5.152100	46.199600	0
South Africa	274	0	0	-30.559500	22.937500	0
Sri Lanka	82	0	3	7.873100	80.771800	0
Sudan	2	1	0	12.862800	30.217600	0
Suriname	5	0	0	3.919300	-56.027800	0
Sweden	1934	21	16	60.128200	18.643500	0
Switzerland	7245	98	131	46.818200	8.227500	0
Syria	1	0	0	34.802100	38.996800	0
Taiwan*	169	2	28	23.700000	121.000000	0
Tanzania	12	0	0	-6.369000	34.888800	0
Thailand	599	1	44	15.870000	100.992500	0
The Bahamas	0	0	0	24.250000	-76.000000	0
The Gambia	0	0	0	13.466700	-16.600000	0
Timor-Leste	1	0	0	-8.874200	125.727500	0
Togo	16	0	1	8.619500	0.824800	0
Trinidad and Tobago	50	0	1	10.691800	-61.222500	0
Tunisia	75	3	1	33.886900	9.537500	0
Turkey	1236	30	0	38.963700	35.243300	0
Uganda	1	0	0	1.373300	32.290300	0
Ukraine	73	3	1	48.379400	31.165600	0
United Arab Emirates	153	2	38	23.424100	53.847800	0
United Kingdom	5741	282	67	37.641557	-31.984943	0
Uruguay	135	0	0	-32.522800	-55.765800	0
Uzbekistan	43	0	0	41.377500	64.585300	0
Venezuela	70	0	15	6.423800	-66.589700	0
Vietnam	113	0	17	14.058300	108.277200	0
Zambia	3	0	0	-13.133900	27.849300	0
Zimbabwe	3	0	0	-19.015400	29.154900	0

Try some EDA of your own. This is an 10 point in class assignment. LMS (Section 1: In class assignment Clustering) by next Monday 3/30.

Using the Covid-19 Clustering example, try something different as part of the EDA.

Turn in ~1/2 page writeup (NOT A JUPYTER NOTEBOOK) describing what you did.

Examples:

Try visualizing the data differently.

Try running a different clustering algorithm.

Try a different number of clusters. How would it be different if we created ratios that controlled for total population? We tried a different clustering algorithm?