Big Data The Case of Google Flu Trends Assignment Instructions Influenza remains a world wide health problem In the U S , the Center for Disease Control and Prevention ( CDC ) gets reports from doctors about visits in which the patient complains of flu like symptoms CDC compiles these reports into regional and national early flu warnings The CDC reports quantify influenza like cases by week, by region and nationally The reporting and analysis time required means that CDC reports will lag actual events Those who are sick may seek help by using social media queries to gain information about their symptoms In 2 0 0 8 , Google scientists realized that such queries might be immediately useful in predicting the prevalence of flu I e , rather than wait for CDC to announce the onset of a flu breakout, detect the onset immediately by analyzing what people are asking about on line Google called its proposed nowcasting system Google Flu Trends ( GFT ) The appeal of this data driven approach should be evident Google had a large body of searches that were launched by people with the relevant symptoms Thus, the proposed system could be developed quickly at low cost and might provide near real time information The proposed system could possibly be as accurate as the CDC system and it would be quicker Possibly, web searches could come from more people than those who actually seek medical help, and so GFT might be based on a greater amount of information Google gathered 5 0 million health related searches from the 2 0 0 3 2 0 0 7 period Google wrote software that correlated the frequency of specific terms in web searches with the frequency of influenza like cases reported by CDC in the period From this mass of data, Google was thus able to identify 4 5 search terms that seemed indicative of influenza cases In this module, you learned that linear regression is a statistical tool that data scientists use to develop equations that are used to make predictions In this case, the goal was to develop equations that could predict the actual incidence of influenza like cases reported by the CDC We should stop to think about what this big data analysis actually entails Assume a person Google searched I feel lousy today, can t stop coughing and ache all over, what should I do about that Does that person have influenza Or , a bad cold What if the person Google searched merely I feel lousy, what should I do Is that search a predictor of anything Google s software sifted through 5 0 million such entries, to identify possible search terms ( perhaps lousy , cough , ache would make it from the examples given ) By linear regression, 4 5 terms that seemed to correlate well with CDC influenza frequencies were identified and weighted The resulting linear regression equations had 4 5 variables and their output closely fit the actual CDC frequencies for 2 0 0 3 to 2 0 0 7 , both regionally and nationally In linear regression, over fitting occurs when the equations reflect fine grained learning of the training data, but also a failure to learn the general principles that underlay the data An over fit system has learned the past, but does not do well when asked to predict the future To test their system, Google held out the 2 0 0 8 data ( searches and CDC frequencies ) from system development The question was when the 2 0 0 8 search term values were input into the GFT equations would be resulting output closely predict the 2 0 0 8 CDC frequencies Google reported that in fact the correlations were very close, as they had been for the 2 0 0 3 2 0 0 7 development years Google reported that over fitting had been avoided, and that the big data based system was ready for use The H 1 N 1 flu epidemic that popped up in early summer 2 0 0 9 was an immediate challenge GFT significantly under predicted flu cases for the first wave of this epidemic i e , there were significantly more actual cases reported by CDC than were predicted by GFT Google engineers revamped GFT , replacing some search terms with others, and increasing the number of terms There was a second wave of H 1 N 1 , and the revamped model performed well in that wave Each year, Google would update GFT for the most recent searches and CDC flu incidence data As time went on , GFT often accurately projected the incidence of flu, but it also failed at times GFT overestimated by a large margin in the 2 0 1 1 2 0 1 2 flu season Beginning in August 2 0 1 1 , GFT overestimated national flu incidence in 1 0 0 out of 1 0 8 weeks Some researchers found that lagged CDC data was at times a more accurate predictor than GFT for example, to estimate this week s flu incidence, merely using the CDC numbers from 3 weeks ago might be as accurate as the current GFT estimate However, researchers who criticized GFT were quick to point out that big data efforts still hold great promise In one study, combining GFT with lagged CDC data in a non linear regression yielded better performance than GFT alo

Question

Big Data  The Case of Google Flu Trends Assignment Instructions Influenza remains a world   wide health problem  In the U   S   , the Center for Disease Control and Prevention ( CDC ) gets reports from doctors about visits in which the patient complains of flu   like symptoms  CDC compiles these reports into regional and national early flu warnings  The CDC reports quantify influenza   like cases by week, by region and nationally  The reporting and analysis time required means that CDC reports will lag actual events  Those who are sick may seek help by using social media queries to gain information about their symptoms  In 2 0 0 8 , Google scientists realized that such queries might be immediately useful in predicting the prevalence of flu  I e   , rather than wait for CDC to announce the onset of a flu breakout, detect the onset immediately by analyzing what people are asking about on   line  Google called its proposed nowcasting system Google Flu Trends ( GFT ) The appeal of this data   driven approach should be evident  Google had a large body of searches that were launched by people with the relevant symptoms  Thus, the proposed system could be developed quickly at low cost and might provide near real   time information  The proposed system could possibly be as accurate as the CDC system and it would be quicker  Possibly, web searches could come from more people than those who actually seek medical help, and so GFT might be based on a greater amount of information  Google gathered 5 0 million health   related searches from the 2 0 0 3   2 0 0 7 period  Google wrote software that correlated the frequency of specific terms in web searches with the frequency of influenza   like cases reported by CDC in the period  From this mass of data, Google was thus able to identify 4 5 search terms that seemed indicative of influenza cases  In this module, you learned that linear regression is a statistical tool that data scientists use to develop equations that are used to make predictions  In this case, the goal was to develop equations that could predict the actual incidence of influenza   like cases reported by the CDC We should stop to think about what this big   data analysis actually entails  Assume a person Google searched I feel lousy today, can t stop coughing and ache all over, what should I do about that  Does that person have influenza  Or , a bad cold  What if the person Google searched merely I feel lousy, what should I do   Is that search a predictor of anything  Google s software sifted through 5 0 million such entries, to identify possible search terms ( perhaps lousy , cough , ache would make it from the examples given )   By linear regression, 4 5 terms that seemed to correlate well with CDC influenza frequencies were identified and weighted  The resulting linear regression equations had 4 5 variables and their output closely fit the actual CDC frequencies for 2 0 0 3 to 2 0 0 7 , both regionally and nationally  In linear regression, over   fitting occurs when the equations reflect fine   grained learning of the training data, but also a failure to learn the general principles that underlay the data  An over   fit system has learned the past, but does not do well when asked to predict the future  To test their system, Google held out the 2 0 0 8 data ( searches and CDC frequencies ) from system development  The question was  when the 2 0 0 8 search term values were input into the GFT equations would be resulting output closely predict the 2 0 0 8 CDC frequencies  Google reported that in fact the correlations were very close, as they had been for the 2 0 0 3   2 0 0 7 development years  Google reported that over   fitting had been avoided, and that the big   data based system was ready for use  The H 1 N 1 flu epidemic that popped up in early summer 2 0 0 9 was an immediate challenge  GFT significantly under   predicted flu cases for the first wave of this epidemic i   e   , there were significantly more actual cases reported by CDC than were predicted by GFT   Google engineers revamped GFT , replacing some search terms with others, and increasing the number of terms  There was a second wave of H 1 N 1 , and the revamped model performed well in that wave  Each year, Google would update GFT for the most recent searches and CDC flu incidence data  As time went on , GFT often accurately projected the incidence of flu, but it also failed at times  GFT overestimated by a large margin in the 2 0 1 1   2 0 1 2 flu season  Beginning in August 2 0 1 1 , GFT overestimated national flu incidence in 1 0 0 out of 1 0 8 weeks  Some researchers found that lagged CDC data was at times a more accurate predictor than GFT for example, to estimate this week s flu incidence, merely using the CDC numbers from 3 weeks ago might be as accurate as the current GFT estimate  However, researchers who criticized GFT were quick to point out that big data efforts still hold great promise  In one study, combining GFT with lagged CDC data in a non   linear regression yielded better performance than GFT alo

Accepted Answer

The Answer is in the image, click to view ...

Question

Big Data: The Case of Google Flu Trends Assignment Instructions Influenza remains a world - wide health problem. In the U . S . ,

Step by Step Solution

Step: 1

Get Instant Access to Expert-Tailored Solutions

Step: 2

Step: 3

Ace Your Homework with AI

Recommended Textbook for

The Temple Of Django Database Performance

Students also viewed these Databases questions

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question