Exploratory Data Analysis for COVID-19 Outbreak in the U.S.¶

Author: Sultan Albogami

Last Updated: 3/31/2020

Description: Initial investigations on COVID-19 data state and county wise so as to discover patterns, spot anomalies, test hypothesis and check assumptions with the help of summary statistics and graphical representations.

Importing Libraries

import os
# !pip install numpy, run only for the first time.
import numpy as np
# !pip install pandas
import pandas as pd
# !pip install matplotlib
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from matplotlib import style
style.use('ggplot')

WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.7/site-packages (1.18.2)
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.7/site-packages (1.0.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (2019.3)
Requirement already satisfied: numpy>=1.13.3 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (1.18.2)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas) (1.14.0)
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: matplotlib in /srv/conda/envs/notebook/lib/python3.7/site-packages (3.2.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from matplotlib) (2.4.6)
Requirement already satisfied: numpy>=1.11 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from matplotlib) (1.18.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from matplotlib) (1.1.0)
Requirement already satisfied: python-dateutil>=2.1 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from matplotlib) (2.8.1)
Requirement already satisfied: cycler>=0.10 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: setuptools in /srv/conda/envs/notebook/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib) (45.1.0.post20200119)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from python-dateutil>=2.1->matplotlib) (1.14.0)

Reading Data

os.chdir(r"/home/jovyan/")

df = pd.read_csv(r'util/data/us-states-03-30-20.csv')

df.head()

df.shape

(1554, 5)

Plotting cases v.s. deaths by date¶

ax = plt.gca()
df.plot(kind='line', x='date', y='cases', figsize=(12, 8), ax=ax)
df.plot(kind='line', x='date', y='deaths', figsize=(12, 8), ax=ax)
plt.ylabel('Count')
plt.title('Increase of cases and deaths over time')
plt.show()

Top 10 states with the most number cases and deaths as of 03-30-2020¶

# Sum the cases and deaths
latest_sum = df.groupby(['state'])['cases', 'deaths'].agg('sum')

# Sort in descending order
latest_sum = latest_sum.sort_values(by=['cases', 'deaths'], ascending=False)

latest_sum.head(10)

/srv/conda/envs/notebook/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

# Plot the result
latest_sum.head(10).plot(kind='bar', figsize=(10, 6))
plt.ylabel('Count')
plt.title('Top 10 states with the most number of cases and deaths as of 03-30-2020')
plt.show()

Total Number of Cases and Deaths as of 2020-03-30

latest_total = df.groupby('date')['cases', 'deaths'].sum().reset_index()
latest_total = latest_total[latest_total['date']==max(latest_total['date'])].reset_index(drop=True)
latest_total

/srv/conda/envs/notebook/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  """Entry point for launching an IPython kernel.

Top 10 states with the most new cases, deaths, percentage of deaths, case growth rate, death growth rate as of 03-30-2020¶

# Extract new cases and deaths by date using loc.
present_stats = df.loc[df['date'] == '2020-03-30', ['date', 'state', 'cases', 'deaths']]

# Present death percentage
present_stats['death percentage'] = (present_stats['deaths'] / present_stats['cases']) * 100

# Sort in descending order
present_stats = present_stats.sort_values(by=['cases', 'deaths', 'death percentage'], ascending=False)

present_stats.head(10)

# Plot the result
present_stats.head(10).plot(kind='bar', x='state', y='death percentage', figsize=(10, 6))

# Set the plot title
plt.title('Top 10 states with the highest death  death percentage as of 03-30-2020')

Text(0.5, 1.0, 'Top 10 states with the highest death  death percentage as of 03-30-2020')

# !pip install plotly
# !conda install psutil --yes

import plotly.express as px

fig = px.bar(df , x='date', y='cases', color='state', labels={'y':'cases'},
             hover_data=['state'],
             title='Evolution of Reported COVID-19 Cases in the United States')
fig.show()

WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: plotly in /srv/conda/envs/notebook/lib/python3.7/site-packages (4.6.0)
Requirement already satisfied: retrying>=1.3.3 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from plotly) (1.3.3)
Requirement already satisfied: six in /srv/conda/envs/notebook/lib/python3.7/site-packages (from plotly) (1.14.0)
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.8.2
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base conda


# All requested packages already installed.

fig = px.bar(df , x='date', y='deaths', color='state', labels={'y':'cases'},
             hover_data=['state'],
             title='Evolution of Reported COVID-19 Deaths in the United States')
fig.show()

# Tree Map Visualization of COVID-19 Cases by Date and State
fig = px.treemap(df.sort_values(by='cases', ascending=False).reset_index(drop=True), 
                 path=["state", "date"], values="cases", height=700,
                 title='Number of COVID-19 Cases by State and Date',
                 color_discrete_sequence = px.colors.qualitative.Prism)
fig.data[0].textinfo = 'label+text+value'
fig.show()

# Tree Map Visualization of COVID-19 Death Cases by State and Date
fig = px.treemap(df.sort_values(by='deaths', ascending=False).reset_index(drop=True), 
                 path=["state", "date"], values="deaths", height=700,
                 title='Number of deaths from COVID-19 by State and Date',
                 color_discrete_sequence = px.colors.qualitative.Prism)
fig.data[0].textinfo = 'label+text+value'
fig.show()

EDA for U.S. Counties¶

df = pd.read_csv('util/data/us-counties-03-30-20.csv')

df.head()

df.shape

(21799, 6)

# Sum the cases and deaths
latest_sum = df.groupby(['county'])['cases', 'deaths'].agg('sum')

# Sort in descending order
latest_sum = latest_sum.sort_values(by=['cases', 'deaths'], ascending=False)

latest_sum.head(10)

/srv/conda/envs/notebook/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning:

Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

# Plot the result
latest_sum.head(10).plot(kind='bar', figsize=(10, 6))
plt.ylabel('Count')
plt.title('Top 10 counties with the most number of cases and deaths as of 03-30-2020')
plt.show()

fig = px.bar(df, x='date', y='cases', color='county', labels={'y':'cases'},
             hover_data=['county'],
             title='Evolution of Reported COVID-19 Cases in the United States Counties')
fig.show()

fig = px.bar(df, x='date', y='deaths', color='county', labels={'y':'cases'},
             hover_data=['county'],
             title='Evolution of Reported COVID-19 Deaths in the United States Counties')
fig.show()

# Tree Map Visualization of COVID-19 Cases by County and Date
fig = px.treemap(df.sort_values(by='cases', ascending=False).reset_index(drop=True), 
                 path=["county", "date"], values="deaths", height=700,
                 title='Number of deaths from COVID-19 by County and Date',
                 color_discrete_sequence = px.colors.qualitative.Prism)
fig.data[0].textinfo = 'label+text+value'
fig.show()

# Tree Map Visualization of COVID-19 Deaths by County and Date
fig = px.treemap(df.sort_values(by='deaths', ascending=False).reset_index(drop=True), 
                 path=["county", "date"], values="deaths", height=700,
                 title='Number of deaths from COVID-19 by County and Date',
                 color_discrete_sequence = px.colors.qualitative.Prism)
fig.data[0].textinfo = 'label+text+value'
fig.show()

	date	state	fips	cases
0	2020-01-21	Washington	53	1
1	2020-01-22	Washington	53	1
2	2020-01-23	Washington	53	1
3	2020-01-24	Illinois	17	1
4	2020-01-24	Washington	53	1

	cases	deaths
state
New York	387714	4991
New Jersey	73838	897
California	47065	921
Washington	40990	2145
Michigan	31601	702
Florida	28846	401
Massachusetts	27827	252
Illinois	27243	342
Louisiana	23678	896
Pennsylvania	18863	194

	date	state	cases	deaths	death percentage
1532	2020-03-30	New York	67174	1224	1.822134
1530	2020-03-30	New Jersey	16636	199	1.196201
1503	2020-03-30	California	7421	146	1.967390
1522	2020-03-30	Michigan	6508	197	3.027044
1521	2020-03-30	Massachusetts	5752	61	1.060501
1508	2020-03-30	Florida	5694	71	1.246927
1550	2020-03-30	Washington	5179	221	4.267233
1513	2020-03-30	Illinois	5070	84	1.656805
1539	2020-03-30	Pennsylvania	4156	48	1.154957
1518	2020-03-30	Louisiana	4025	186	4.621118

	date	county	state	fips	cases
0	2020-01-21	Snohomish	Washington	53061.0	1
1	2020-01-22	Snohomish	Washington	53061.0	1
2	2020-01-23	Snohomish	Washington	53061.0	1
3	2020-01-24	Cook	Illinois	17031.0	1
4	2020-01-24	Snohomish	Washington	53061.0	1

	cases	deaths
county
New York City	223933	4047
Westchester	57820	69
Nassau	41585	228
Suffolk	34767	280
Unknown	26922	276
King	21128	1682
Cook	20221	201
Wayne	15509	306
Bergen	13194	247
Los Angeles	13173	214