Our model predicts the average daily new cases per 100,000 people that should be expected over the next week from the prediction date. The "r" value refers to the Pearson correlation coefficient between predictions and validations where possible. The latest predictions cover the next 14 days and are not validated.
We utilize machine learning on a dataset of relevant predictors of COVID-19 outbreaks or potential for outbreaks at a U.S. county level, generating a regression model that predicts future daily case counts.
Socioeconomic Data: We utilize county-level socioeconomic metrics including the CCVI index from the Surgo Foundation, the proportions of elderly, black, hispanic, and male inhabitants, and population density from the US Census, as these factors are closely linked to COVID-19 outbreak risk.
Health Data: We utilize county-level data on various diseases' mortality rates and smoking prevalence statistics from IHME to understand the population susceptibility to COVID-19.
Rt Data: We use state-level data for reproduction rates of COVID-19 computed by standard epidemiological models by covid19-projections.com and rt.live to understand the level of statewide transmission.
Testing Data: We use state-level data on total tests conducted and positive tests received from covidtracking.com moving average, smoothed with a 7-day moving average to understand current diagnostic efforts.
Cases Data: We use county-level data on cases reported in each county every day from Johns Hopkins University. This data is smoothed with a 7-day moving average to understand epidemics at the local level.
Feature Engineering: Our raw dataset includes a wide range of statistics that are linked to COVID-19 for dates ranging from March 2020 to the present. We are actively tuning our models with dimensionality reduction and normalization techniques to improve performance and reduce overfitting.
Algorithms: We currently utilize Random Forests and Artificial Neural Networks on our datasets. These regression models output the projected average daily cases per 100,000 for the next 2 weeks. We are still developing our prediction framework and are working with multiple other models.
Mobility: A challenge we encountered is that mobility data is unavailable for some rural counties. To maximize the population that we serve, we use two models, one of which incorporates mobility data and the other of which does not. If mobility projections from a particular county are not available for a particular date, it is replaced with a non-mobility prediction.
Performance: Our latest model performs comparably on its training dataset (a 90% subset of the whole dataset) and a validation dataset (the remaining 10% of the whole dataset), as shown on the table below. Mean Absolute Error (cases per 100,000) is abbreviated with "MAE" and would ideally be < 1. The coefficient of determination is abbreviated with "R2" and would ideally be > 0.9.
|Model||Training R2||Training MAE||Validation R2||Validation MAE|
A machine learning model is only as reliable as the data used to train it. We have done our best to obtain the most reliable datasets available that are relevant to the current COVID-19 epidemic in the United States. However, we cannot help it if these include errors or skewing factors.
Case counts per county may be skewed by inaccurate test results and disparities in testing capacity, especially in rural vs. urban regions. We attempt to factor this skew in by including testing datasets, but these are at the state level. In fact, many of our other datasets, such as Rt values, are also at the state level, and may not be very applicable to some counties.
Our model and almost all other epidemiological models for the US only take into account case data based on testing, so their predictions are based on a limited view of the actual scope of the epidemic.
We have chosen not to build a traditional epidemiological model due to scarcity of necessary data and our desire to easily take into account a diverse set of relevant risk factors at the county level. While our model has shown good performance thus far, please focus on the general trends of the predictions instead of taking them too literally.
We hope to release a licensed version of our source code in the coming weeks to aid other researchers once we are certain that our pipeline is well-optimized and bug-free.
We are working hard to make sure that every county in the US is represented in our dataset.
Please contact us if you have any questions, comments, or concerns using the form at the top of this page.
COVID-19 has more adverse effects on certain populations within the U.S. For instance, men experience higher mortality than women and minorities like African Americans/Blacks experience higher mortality than Caucasians and Asians. Urban regions experience larger outbreaks than rural regions, but rural regions may have lower access to testing.
Consequently, our model takes into account racial and socioeconomic factors for every county to more accurately predict outbreaks.