Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Big Data: The Case of Google Flu Trends Assignment Instructions Influenza remains a world - wide health problem. In the U . S . ,
Big Data: The Case of Google Flu Trends
Assignment Instructions
Influenza remains a worldwide health problem. In the US the Center for Disease Control and Prevention CDC gets reports from doctors about visits in which the patient complains of flulike symptoms. CDC compiles these reports into regional and national early flu warnings. The CDC reports quantify influenzalike cases by week, by region and nationally. The reporting and analysis time required means that CDC reports will lag actual events.
Those who are sick may seek help by using social media queries to gain information about their symptoms. In Google scientists realized that such queries might be immediately useful in predicting the prevalence of flu. I.e rather than wait for CDC to announce the onset of a flu breakout, detect the onset immediately by analyzing what people are asking about online. Google called its proposed nowcasting system Google Flu Trends GFT
The appeal of this datadriven approach should be evident. Google had a large body of searches that were launched by people with the relevant symptoms. Thus, the proposed system could be developed quickly at low cost and might provide near realtime information. The proposed system could possibly be as accurate as the CDC system and it would be quicker. Possibly, web searches could come from more people than those who actually seek medical help, and so GFT might be based on a greater amount of information.
Google gathered million healthrelated searches from the period. Google wrote software that correlated the frequency of specific terms in web searches with the frequency of influenzalike cases reported by CDC in the period. From this mass of data, Google was thus able to identify search terms that seemed indicative of influenza cases. In this module, you learned that linear regression is a statistical tool that data scientists use to develop equations that are used to make predictions. In this case, the goal was to develop equations that could predict the actual incidence of influenzalike cases reported by the CDC
We should stop to think about what this bigdata analysis actually entails. Assume a person Google searched I feel lousy today, cant stop coughing and ache all over, what should I do about that? Does that person have influenza? Or a bad cold? What if the person Google searched merely I feel lousy, what should I do Is that search a predictor of anything? Googles software sifted through million such entries, to identify possible search terms perhaps lousycoughache would make it from the examples given By linear regression, terms that seemed to correlate well with CDC influenza frequencies were identified and weighted. The resulting linear regression equations had variables and their output closely fit the actual CDC frequencies for to both regionally and nationally.
In linear regression, overfitting occurs when the equations reflect finegrained learning of the training data, but also a failure to learn the general principles that underlay the data. An overfit system has learned the past, but does not do well when asked to predict the future. To test their system, Google held out the data searches and CDC frequencies from system development. The question was: when the search term values were input into the GFT equations would be resulting output closely predict the CDC frequencies? Google reported that in fact the correlations were very close, as they had been for the development years. Google reported that overfitting had been avoided, and that the bigdata based system was ready for use.
The HN flu epidemic that popped up in early summer was an immediate challenge. GFT significantly underpredicted flu cases for the first wave of this epidemic ie there were significantly more actual cases reported by CDC than were predicted by GFT
Google engineers revamped GFT replacing some search terms with others, and increasing the number of terms. There was a second wave of HN and the revamped model performed well in that wave. Each year, Google would update GFT for the most recent searches and CDC flu incidence data.
As time went on GFT often accurately projected the incidence of flu, but it also failed at times. GFT overestimated by a large margin in the flu season. Beginning in August GFT overestimated national flu incidence in out of weeks.
Some researchers found that lagged CDC data was at times a more accurate predictor than GFT for example, to estimate this weeks flu incidence, merely using the CDC numbers from weeks ago might be as accurate as the current GFT estimate. However, researchers who criticized GFT were quick to point out that big data efforts still hold great promise. In one study, combining GFT with lagged CDC data in a nonlinear regression yielded better performance than GFT alo
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started