Exploratory Data Analysis for COVID-19 Outbreak in the U.S.

Author: Sultan Albogami

Last Updated: 3/31/2020

Description: Initial investigations on COVID-19 data state and county wise so as to discover patterns, spot anomalies, test hypothesis and check assumptions with the help of summary statistics and graphical representations.

Importing Libraries

In [1]:
import os
# !pip install numpy, run only for the first time.
import numpy as np
# !pip install pandas
import pandas as pd
# !pip install matplotlib
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from matplotlib import style
style.use('ggplot')
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.7/site-packages (1.18.2)
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.7/site-packages (1.0.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (2019.3)
Requirement already satisfied: numpy>=1.13.3 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (1.18.2)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas) (1.14.0)
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: matplotlib in /srv/conda/envs/notebook/lib/python3.7/site-packages (3.2.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from matplotlib) (2.4.6)
Requirement already satisfied: numpy>=1.11 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from matplotlib) (1.18.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from matplotlib) (1.1.0)
Requirement already satisfied: python-dateutil>=2.1 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from matplotlib) (2.8.1)
Requirement already satisfied: cycler>=0.10 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: setuptools in /srv/conda/envs/notebook/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib) (45.1.0.post20200119)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from python-dateutil>=2.1->matplotlib) (1.14.0)

Reading Data

In [2]:
os.chdir(r"/home/jovyan/")
In [3]:
df = pd.read_csv(r'util/data/us-states-03-30-20.csv')
In [4]:
df.head()
Out[4]:
date state fips cases deaths
0 2020-01-21 Washington 53 1 0
1 2020-01-22 Washington 53 1 0
2 2020-01-23 Washington 53 1 0
3 2020-01-24 Illinois 17 1 0
4 2020-01-24 Washington 53 1 0
In [5]:
df.shape
Out[5]:
(1554, 5)

Plotting cases v.s. deaths by date

In [6]:
ax = plt.gca()
df.plot(kind='line', x='date', y='cases', figsize=(12, 8), ax=ax)
df.plot(kind='line', x='date', y='deaths', figsize=(12, 8), ax=ax)
plt.ylabel('Count')
plt.title('Increase of cases and deaths over time')
plt.show()

Top 10 states with the most number cases and deaths as of 03-30-2020

In [7]:
# Sum the cases and deaths
latest_sum = df.groupby(['state'])['cases', 'deaths'].agg('sum')

# Sort in descending order
latest_sum = latest_sum.sort_values(by=['cases', 'deaths'], ascending=False)

latest_sum.head(10)
/srv/conda/envs/notebook/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  
Out[7]:
cases deaths
state
New York 387714 4991
New Jersey 73838 897
California 47065 921
Washington 40990 2145
Michigan 31601 702
Florida 28846 401
Massachusetts 27827 252
Illinois 27243 342
Louisiana 23678 896
Pennsylvania 18863 194
In [8]:
# Plot the result
latest_sum.head(10).plot(kind='bar', figsize=(10, 6))
plt.ylabel('Count')
plt.title('Top 10 states with the most number of cases and deaths as of 03-30-2020')
plt.show()

Total Number of Cases and Deaths as of 2020-03-30

In [9]:
latest_total = df.groupby('date')['cases', 'deaths'].sum().reset_index()
latest_total = latest_total[latest_total['date']==max(latest_total['date'])].reset_index(drop=True)
latest_total
/srv/conda/envs/notebook/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  """Entry point for launching an IPython kernel.
Out[9]:
date cases deaths
0 2020-03-30 163796 3073

Top 10 states with the most new cases, deaths, percentage of deaths, case growth rate, death growth rate as of 03-30-2020

In [10]:
# Extract new cases and deaths by date using loc.
present_stats = df.loc[df['date'] == '2020-03-30', ['date', 'state', 'cases', 'deaths']]

# Present death percentage
present_stats['death percentage'] = (present_stats['deaths'] / present_stats['cases']) * 100

# Sort in descending order
present_stats = present_stats.sort_values(by=['cases', 'deaths', 'death percentage'], ascending=False)

present_stats.head(10)
Out[10]:
date state cases deaths death percentage
1532 2020-03-30 New York 67174 1224 1.822134
1530 2020-03-30 New Jersey 16636 199 1.196201
1503 2020-03-30 California 7421 146 1.967390
1522 2020-03-30 Michigan 6508 197 3.027044
1521 2020-03-30 Massachusetts 5752 61 1.060501
1508 2020-03-30 Florida 5694 71 1.246927
1550 2020-03-30 Washington 5179 221 4.267233
1513 2020-03-30 Illinois 5070 84 1.656805
1539 2020-03-30 Pennsylvania 4156 48 1.154957
1518 2020-03-30 Louisiana 4025 186 4.621118
In [11]:
# Plot the result
present_stats.head(10).plot(kind='bar', x='state', y='death percentage', figsize=(10, 6))

# Set the plot title
plt.title('Top 10 states with the highest death  death percentage as of 03-30-2020')
Out[11]:
Text(0.5, 1.0, 'Top 10 states with the highest death  death percentage as of 03-30-2020')
In [12]:
# !pip install plotly
# !conda install psutil --yes

import plotly.express as px

fig = px.bar(df , x='date', y='cases', color='state', labels={'y':'cases'},
             hover_data=['state'],
             title='Evolution of Reported COVID-19 Cases in the United States')
fig.show()
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: plotly in /srv/conda/envs/notebook/lib/python3.7/site-packages (4.6.0)
Requirement already satisfied: retrying>=1.3.3 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from plotly) (1.3.3)
Requirement already satisfied: six in /srv/conda/envs/notebook/lib/python3.7/site-packages (from plotly) (1.14.0)
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.8.2
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base conda



# All requested packages already installed.

In [13]:
fig = px.bar(df , x='date', y='deaths', color='state', labels={'y':'cases'},
             hover_data=['state'],
             title='Evolution of Reported COVID-19 Deaths in the United States')
fig.show()
In [14]:
# Tree Map Visualization of COVID-19 Cases by Date and State
fig = px.treemap(df.sort_values(by='cases', ascending=False).reset_index(drop=True), 
                 path=["state", "date"], values="cases", height=700,
                 title='Number of COVID-19 Cases by State and Date',
                 color_discrete_sequence = px.colors.qualitative.Prism)
fig.data[0].textinfo = 'label+text+value'
fig.show()
In [15]:
# Tree Map Visualization of COVID-19 Death Cases by State and Date
fig = px.treemap(df.sort_values(by='deaths', ascending=False).reset_index(drop=True), 
                 path=["state", "date"], values="deaths", height=700,
                 title='Number of deaths from COVID-19 by State and Date',
                 color_discrete_sequence = px.colors.qualitative.Prism)
fig.data[0].textinfo = 'label+text+value'
fig.show()

EDA for U.S. Counties

In [17]:
df = pd.read_csv('util/data/us-counties-03-30-20.csv')
In [18]:
df.head()
Out[18]:
date county state fips cases deaths
0 2020-01-21 Snohomish Washington 53061.0 1 0
1 2020-01-22 Snohomish Washington 53061.0 1 0
2 2020-01-23 Snohomish Washington 53061.0 1 0
3 2020-01-24 Cook Illinois 17031.0 1 0
4 2020-01-24 Snohomish Washington 53061.0 1 0
In [19]:
df.shape
Out[19]:
(21799, 6)
In [20]:
# Sum the cases and deaths
latest_sum = df.groupby(['county'])['cases', 'deaths'].agg('sum')

# Sort in descending order
latest_sum = latest_sum.sort_values(by=['cases', 'deaths'], ascending=False)

latest_sum.head(10)
/srv/conda/envs/notebook/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning:

Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

Out[20]:
cases deaths
county
New York City 223933 4047
Westchester 57820 69
Nassau 41585 228
Suffolk 34767 280
Unknown 26922 276
King 21128 1682
Cook 20221 201
Wayne 15509 306
Bergen 13194 247
Los Angeles 13173 214
In [21]:
# Plot the result
latest_sum.head(10).plot(kind='bar', figsize=(10, 6))
plt.ylabel('Count')
plt.title('Top 10 counties with the most number of cases and deaths as of 03-30-2020')
plt.show()
In [22]:
fig = px.bar(df, x='date', y='cases', color='county', labels={'y':'cases'},
             hover_data=['county'],
             title='Evolution of Reported COVID-19 Cases in the United States Counties')
fig.show()
In [23]:
fig = px.bar(df, x='date', y='deaths', color='county', labels={'y':'cases'},
             hover_data=['county'],
             title='Evolution of Reported COVID-19 Deaths in the United States Counties')
fig.show()
In [24]:
# Tree Map Visualization of COVID-19 Cases by County and Date
fig = px.treemap(df.sort_values(by='cases', ascending=False).reset_index(drop=True), 
                 path=["county", "date"], values="deaths", height=700,
                 title='Number of deaths from COVID-19 by County and Date',
                 color_discrete_sequence = px.colors.qualitative.Prism)
fig.data[0].textinfo = 'label+text+value'
fig.show()
In [ ]:
# Tree Map Visualization of COVID-19 Deaths by County and Date
fig = px.treemap(df.sort_values(by='deaths', ascending=False).reset_index(drop=True), 
                 path=["county", "date"], values="deaths", height=700,
                 title='Number of deaths from COVID-19 by County and Date',
                 color_discrete_sequence = px.colors.qualitative.Prism)
fig.data[0].textinfo = 'label+text+value'
fig.show()