Future waves of COVID-19 might be predicted using internet search data, according to a study published in the journal Scientific Reports.
In the study, researchers watched the number of COVID-related Google searches made across the country and used that information, together with conventional COVID-19 metrics such as confirmed cases, to predict hospital admission rates weeks in advance.
Using the search data provided by Google Trends, scientists were able to build a computational model to forecast COVID-19 hospitalizations. Google Trends is an online portal that provides data on Google search volumes in real time.
“If you have a bunch of people searching for ‘COVID testing sites near me’ … you’re going to still feel the effects of that downstream at the hospital level in terms of admissions,” said data scientist Philip Turk of the University of Mississippi Medical Center, who was not involved in the study. “That gives health care administrators and leaders advance warning to prepare for surges — to stock up on personal protective equipment and staffing and to anticipate a surge coming at them.”
For predictions one or two weeks in advance, the new computer model stacks up well against existing ones. It beats the U.S. Centers for Disease Control and Prevention’s “national ensemble” forecast, which combines models made by many research teams — though there are some single models that outperform it.
According to study co-author Shihao Yang, a data scientist at the Georgia Institute of Technology, the new model’s value is its unique perspective — a data source that is independent of conventional metrics. Yang is working to add the new model to the CDC’s COVID-19 forecasting hub.
Watching trends in how often people Google certain terms, like “cough” or “COVID-19 vaccine,” could help fill in the gaps in places with sparse testing or weak health care systems.
Yang also thinks that his model will be especially useful when new variants pop up. It did a good job of predicting spikes in hospitalizations thought to be associated with new variants such as omicron, without the time delays typical of many other models.
“It’s like an earthquake,” Yang said. “Google search will tell me a few hours ahead that a tsunami is hitting. … A few hours is enough for me to get prepared, allocate resources and inform my staff. I think that’s the information that we are providing here. It’s that window from the earthquake to when the tsunami hit the shore where my model really shines.”
The model considers Google search volumes for 256 COVID-19-specific terms, such as “loss of taste,” “COVID-19 vaccine” and “cough,” together with core statistics like case counts and vaccination rates. It also has temporal and spatial components — terms representing the delay between today’s data and the future hospitalizations it predicts, and how closely connected different states are.
Every week, the model retrains itself using the past 56 days’ worth of data. This keeps the model from being weighed down by older data that don’t reflect how the virus acts now.
Turk previously developed a different model to predict COVID-19 hospitalizations on a local level for the Charlotte, North Carolina, metropolitan area. The new model developed by Yang and his colleagues uses a different method and is the first to make state- and national-level predictions using search data.
Turk was surprised by “just how harmonious” the result was with his earlier work.
“I mean, they’re basically looking at two different models, two different paths,” he said. “It’s a great example of science coming together.”
Using Google search data to make public health forecasts has downsides. For one, Google could stop allowing researchers to use the data at any time, something Yang admits is concerning to his colleagues.
‘Noise’ in searches
Additionally, search data are messy, with lots of random behavior that researchers call “noise,” and the quality varies regionally, so the information needs to be smoothed out during analysis using statistical methods.
Local linguistic quirks can introduce problems because people from different regions sometimes use different words to describe the same thing, as can media coverage when it either raises or calms pandemic fears, Yang said. Privacy protections also introduce complications — user data are aggregated and injected with extra noise before publishing, a protection that makes it impossible to fish out individual users’ information from the public dataset.
Running the model with search data alone didn’t work as well as the model with search data and conventional metrics. Taking out search data and using only conventional COVID-19 metrics to make predictions also hurt the new model’s performance. This indicates that, for this model, the magic is in the mix — both conventional COVID-19 metrics and Google Trends data contain information that is useful for predicting hospitalizations.
“The fact that the data is valuable, and [the] data [is] difficult to process are two independent questions. There [is] information in there,” Yang said. “I can talk to my mom about this. It’s very simple, just intuitive. … If we are able to capture that intuition, I think that’s what makes things work.”