PREreview of Influence of social determinants of health and county vaccination rates on machine learning models to predict COVID-19 case growth in Tennessee
- Published
- DOI
- 10.5281/zenodo.5551162
- License
- CC BY 4.0
This review is the result of a virtual, live-streamed preprint journal club organized and hosted by PREreview and OHSU’s BioData Club. The discussion was joined by 9 people, including OHSU researchers and the event organizing team.
Wylezinski et al. investigated the impact of clinical and social determinants of health (SDOH) risk factors on the COVID-19 case growth in Tennessee. To that aim, they used a variety of publicly available data to train machine learning (ML) models to predict COVID-19 rates, and ranked each SDOH factor’s impact on the model’s performance. The study shows that COVID-19 case growth data shows disparities in socioeconomic, environmental, demographic, and health outcomes (particularly, mental health). Such approaches could benefit community, policy, and research responses to disease outbreaks, COVID-19 and health disparities more generally. Our group particularly appreciated the use of openly available datasets. We do, however, have major concerns about the SDOH risk factors used. Description of the dataset and the ML models could benefit from more details. Major and minor concerns, as well as some suggestions on how to address them, are listed below.
Major concerns and feedback
- The SDOH risk factors brought up confusion as they were not clearly defined nor was an explanation on how they were measured provided. Additionally, several distinct SDOH risk factors were combined with no explanation, such as race/ethnicity and poverty. We believe that listing all SDOH risk factors, their definition, source, and how they were measured in a table would greatly help with understanding the results and their implication on policy.
- Description of the data is scarce. A descriptive table of the data that was used for training the models would be helpful. Also, the feature sets used for the ML models seem to be highly correlated, but there is no description of the methods used to account for such variables. State the results for each ML model used would be beneficial. Along the same lines, the manuscript mentions multiple methods were used for the training, but it is unclear which method is illustrated in the figures. We recommend stating that explicitly in the text and figure caption. While the methods section states cross-validation and hold-out methods were used, there are no further details on how it was implemented. We recommend stating the cross-validation method used and the number of samples that were held out. We also suggest mentioning the software package used for implementing the ML models and if existing packages were used, please cite the source.
- Some of the data are difficult to interpret (see minor comments below) and therefore it is hard to evaluate whether the conclusions are supported by the findings. Some conclusions seem overstated when correlating vaccination status and infection rates. We think it would be appropriate to present the conclusions with more transparency and acknowledgment regarding limitations. It would also be helpful to show that counties with low infection rates are color-coded for vaccination rates (complementary to Supplementary Figure 1C).
- There is not sufficient detail provided to allow the reproduction and validation of the study unless requesting more information from the authors, and it is unclear what qualifies as a “reasonable request”. We would have found it helpful to have read information on the software package used, input samples and summary statistics, input features used, etc.
- In Figure 1, the color coding and size are difficult to interpret. It would be useful to have the captions better explain how the reader is supposed to interpret the visualization. Related to this, it is unclear what the difference (conceptually) is between blue and black dots. Figure 1 could be reconfigured, maybe with the addition of numbers and/or a smaller summary figure, to better display data and consistency.
- There are inconsistencies between the data presented in Figure 1 and the text. Specifically, the ‘race and ethnicity’ SDOH risk factor increases in the figure while the text mentions that this SDOH risk factor decreases, and 7 timepoints are shown in the figure while the text mentions 13 timepoints were taken. We recommend updating these inconsistencies for easier understanding.
- SDOH risk factors in Figure 1 could be correlated but there seem to be no methods to control for this. It would be helpful to understand if this possible confounding effect was accounted for and controlled for.
Minor concerns and feedback
- It is unclear if there were consistent data and variable definitions across the datasets used. Any differences found in the samples in terms of data collection may require a control for differences in the site. It would be helpful to include in the methods section whether all features in Figure 1 were used in the ML model.
- It is unclear how county data was aggregated into groups. We recommend either to be more descriptive or find a simpler way to present some of the data (e.g., top 5 counties with the highest increase or decrease in infection rates).
- The ethical concerns surrounding this study have not been adequately discussed. Specifically, it would be important to discuss and acknowledge the study questions and its methodology in respect to recommendations and discussions about public health research involving race (e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2837428). It seems unclear whether and how using multiple datasets reduces implicit bias from the researchers or from the datasets. It would be useful if there were a more extended explanation on the matter.
- Is there any reason why the Supplementary figures are not the main figures in the paper? We recommend moving them to the main paper as they present key information needed to interpret the results.
- Supplementary Figure 2C, the legend reads as a bias with “Best” and “Worst” rather than keeping consistent language with “Highest” and “Lowest”. We recommend updating this.
- Supplementary Figure 2 could benefit from being replaced as textual description and/or possibly adding a figure for vaccination rate for each SDOH risk factor used.
- We recommend stating study limitations clearly in the conclusions. For instance, the different coronavirus variants were not taken into account in the prediction, possibly because of a lack of data, yet those undoubtedly would impact infection rates. Similarly, there is no explanation for why the specific time frame right after the onset of vaccine rollout was chosen; this period of time is rather short and involves a relatively small amount of data. It would be important to reflect on how these choices might have impacted the predictive models.
We thank the authors for posting this work as a preprint and hope our feedback will help improve the next version of the manuscript.
Acknowledgments
The organizing team is grateful to all the participants of the PREreview + BioData Club’s Open Reviewers Workshop. We especially thank those who engaged in the live-streamed preprint journal club discussion held during our last module of the workshop. It was a pleasure to have such a lively group.