DC Area COVID Estimator

This project attempts to estimate the rolling 14-day COVID-19 case load per 100,000 people in Washington, DC. Another goal is to give a snapshot of the DC-Metro area’s COVID risk as public sources of this data become more difficult to find through conventional means (e.g. newspapers or websites) or as public health authorities scale back on their case reporting. The project was launched in response to DC announcing they would scale back their COVID reporting to once weekly.

  • COVID Tracker Prediction 9/6/2022

    The multiple linear regression model has been trained on local jurisdiction case data and used to predict the levels of DC COVID case numbers per 100,000 as a target variable. To predict yesterday’s COVID level in DC, the model needs to have the local areas’ data.

    Fortunately, the NYTimes usually updates these numbers early in the day (typically DC reports their numbers around 1pm eastern):

    The model easily makes a prediction for 9/6 numbers in DC using the .predict() method in statsmodels:

    The actual DC numbers for 9/6 are:

    The model is off by almost 10 cases per 100,000 people. Not a great error. Next step will be to explore other machine learning methods, starting with a simple and interpretable model, the decision tree.

  • Linear Regression

    The multiple linear regression estimator grabs COVID case loads from several surrounding jurisdictions from the New York Times GitHub page for their COVID tracker. Population estimates were gathered manually from the Census Bureau:

    • Fairfax, Virginia: 1,150,309
    • Montgomery, Maryland: 1,051,000
    • Prince George’s, Maryland: 909,327
    • Falls Church city, Virginia: 14,658
    • Arlington, Virginia: 232,965
    • Prince William, Virginia: 470,335
    • Loudoun, Virginia: 413,538
    • Alexandria city, Virginia: 160,505
    • Charles, Maryland: 166,617
    • Anne Arundel, Maryland: 579,234

    The tracker also grabs DC-specific caseloads from the CDC API endpoint for state data for the target variable of the regression estimate.

    The raw case numbers are cumulative, and need to be transformed into per 100k rolling daily case numbers by the following steps:

    1. pandas .shift(1) function
    2. pandas .rolling(14).mean() divided by locale population, multiplied by 100,000

    The ten locales are then used as the matrix for estimating in a multiple linear regression using the statsmodels package in Python with an added constant term. The target (y) variable is the actual reported cases from DC obtained through the same transformation as above.

    Two statistics are used for measurement: R2 and root mean squared error. The R2 statistic has been >.95 since the project began, indicating high correlation between estimated and actual COVID values. RMSE has been around 8.75 since the project began, which can be interpreted as meaning that the estimator is usually accurate to within about 8.75 cases per 100,000.