Google’s ambitious nowcasting effort made waves when it was initiated in 2008, and again when it was quietly shelved 7 years later. Its premise was that real-time flu infection rates could be closely estimated on the basis of internet search activity in place of actual counts from hospitals and clinics.

This was significant for several reasons. Previously, flu infection rates were based on counts gathered and published by the Center for Disease Control (CDC), based on updates from medical facilities around the country. The inevitable lag was around 3 weeks, but the information was known to be reasonably accurate.

Google’s innovation sought to eliminate the lag while preserving or even improving accuracy by correlating flu infection rates with internet search behaviour. Google believed, probably correctly, that flu sufferers were more likely to search the internet for their symptoms than they were to make the trip to a clinic, and to do so earlier and in greater numbers. And by aggregating the far bigger data generated by online search, Google could potentially generate a higher-resolution estimate or “nowcast” of real-time flu conditions.

At first, the results were exciting. Google’s Flu Trends data lined up with the CDC’s. But two significant misses tanked the original effort: underestimating the 2009 H1N1 outbreak, and then significantly overestimating the 2012-13 flu season. While different accounts tell the story differently, it seems that the model didn’t account sufficiently for users’ shifting search behaviour.

That's not surprising, given that any comparative analysis require some level of longitudinal consistency, and though it seems like it's been with us forever, these are still relatively early days for the internet.

Not much was known about Google’s methodology - Flu Trends was a black box. Nearly all agree, however, that its problems were in the model’s execution, not its conception.

What’s most significant about Flu Trends, however, is that it was among the first efforts to produce accurate, current information rendered valid not by its causal validity, but by correlation supplemented by sheer volume. It substituted sample size for observational control.

Flu Trends helped to herald big data as a living example of the potential value of our data exhaust. Our descriptive and predictive capacity need no longer be limited to direct cause-and-effect, but now could be effectively based on correlations uncovered in huge reams of disparate data. And that opens up such previously insurmountable tasks as nowcasting and forecasting to anyone with access to a datastream and a model.

Research teams are actively working on refinements to enable accurate surveillance and forecasting of disease infection around the world. Models are being augmented to consider not a single factor or two, but also disease-specific characteristics such as infection rates and incubation periods, climate, weather and historical trends, as well as immunity rates derived from live datastreams in much the same was as traffic jams and congestion are predicted under traffic models. Machine learning will enable the model to self-adjust based on the accuracy of its past reports.

Google Flu Trends is still very much alive, and allows us to glimpse how our information will increasingly come to be used and produced.

See more examples of Machine Learning in our Everyday Encounters blog series >>