In March of 2023, Goldman Sachs published a report , indicating that 25 of the tasks in US and Europe can be automated using AI However, not all industries will be affected equally According to the report, certain jobs, like office tasks, legal, architecture and social sciences have a potential for 30 automation, while positions like construction, installation and building maintenance are going to be largely unaffected Facebook Research, most recently published a paper which highlights Moravec's paradox, the thesis that the hardest problems in AI involve sensorimotor skills, not abstract thought or reasoning, which coincide with Goldman Sachs predictions While both of these papers are very impressive, they also heavily influenced by the recent advances in Large Language Models (LLMs) For this final project I have prepared a collection of 200K news articles (about 900 MB) articles on important topics Data Science, Machine Learning and Artificial Intelligence, and I need to identify what industries and job lines are going to be most impacted by the AI over the next several years, based on the information that I can disseminate from this text corpus The objective this assignment is to identify what types of tasks and jobs are most likely to see the biggest impact from AI by extracting meaningful insights from unstructured text My goal is to provide actionable recommendations on what can be done with AI to automate the jobs and or improve employee productivity Please pay attention to the introduction of novel technologies and algorithms, such as AI for image generation and Conversational AI, as they represent the entire paradigm shift in adoption of AI technologies and data science in general You can access the data by using one of the following methods Download data by following this think from your browser https storage googleapis com msca bdp data open news final project news final project parquetLinks to an external site Use Pandas from anywhere (personal laptop, Colab or any cloud) df news final project pd read parquet('https storage googleapis com msca bdp data open news final project news final project parquet', engine 'pyarrow') NLP GCP 11 1 Final Project Starter ipynb Download NLP GCP 11 1 Final Project Starter ipynb Note this is live data, so the layout and record counts in your dataframe will vary from the counts in the attached notebook For the assignment, I'm trying to do the following steps Clean up the noise, by eliminating newlines, tabs, remnants of web crawls, and other irrelevant text Discard irrelevant articles Detect major topics Identify top candidates for AI integration these can be related to any industry and yield positive or negative results (sentiment analysis) Suggest why certain types of jobs are more likely to be impacted by AI Plot a timeline to illustrate how the sentiment is changing over time Identify new technologies and AI solutions that might be affecting the employment landscape Plot a timeline to illustrate the introduction of some of these technologies Demonstrate what companies, academic institutions and government entities can do to accelerate the development of these transformative capabilities Leverage appropriate NLP techniques to identify organizations, people and locations, then apply targeted sentiment What types of companies (based on the lines of business) are planning to invest in these technologies today or near future (success stories) Showcase appropriate visualization to summarize your recommendations (i e word cloud chart or bubble chart) What types of applications cannot currently be transformed by AI, based on today's state of technology (failures) Showcase appropriate visualization to summarize your recommendations (i e word cloud chart or bubble chart) Additional guidance Clean up or sample data if you need to shorten processing times or reduce memory usage Default sentiment will likely be wrong from any software package and will require some tweaking Keyword dictionary approach Data annotation and development of custom classifier Building custom model on open source data (i e Yelp) Fine tuning Transformer Pipeline We are encouraged to explore a combination several techniques to identify key topics Topic modeling (i e LDA using gensim or ktrain) or using BERTopic Classification (hand label several topics on a sample and then train classifier) Clustering (cluster topics around pre selected keywords or word vectors) Zero shot (NLI) modeling Please submit actual program codes (Jupyter notebooks) All the plots should be of production quality and easily readable Fuzzy plots, untitled plots, unreadable labels, overlapping labels are unacceptable Any statements made should be supported by data Only recommendations or goals of the project sections can contain elements not directly supported by the data We are welcome to use any software packages of our choice to complete the assignment

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 02, 2023

In March of 2023, Goldman Sachs published a report., indicating that ~25% of the tasks in US and Europe can be automated using AI. However,

In March of 2023, Goldman Sachs published a report., indicating that ~25% of the tasks in US and Europe can be automated using AI. However, not all industries will be affected equally. According to the report, certain jobs, like office tasks, legal, architecture and social sciences have a potential for 30%+ automation, while positions like construction, installation and building maintenance are going to be largely unaffected.

Facebook Research, most recently published a paper which highlights Moravec's paradox, the thesis that the hardest problems in AI involve sensorimotor skills, not abstract thought or reasoning, which coincide with Goldman Sachs predictions.

While both of these papers are very impressive, they also heavily influenced by the recent advances in Large Language Models (LLMs). For this final project I have prepared a collection of ~200K news articles (about 900 MB) articles on important topics: Data Science, Machine Learning and Artificial Intelligence, and I need to identify what industries and job lines are going to be most impacted by the AI over the next several years, based on the information that I can disseminate from this text corpus.

The objective this assignment is to identify what types of tasks and jobs are most likely to see the biggest impact from AI by extracting meaningful insights from unstructured text. My goal is to provide actionable recommendations on what can be done with AI to automate the jobs and / or improve employee productivity. Please pay attention to the introduction of novel technologies and algorithms, such as AI for image generation and Conversational AI, as they represent the entire paradigm shift in adoption of AI technologies and data science in general.

You can access the data by using one of the following methods:

Download data by following this think from your browser: https://storage.googleapis.com/msca-bdp-data-open/news_final_project/news_final_project.parquetLinks to an external site.
Use Pandas from anywhere (personal laptop, Colab or any cloud) df_news_final_project = pd.read_parquet('https://storage.googleapis.com/msca-bdp-data-open/news_final_project/news_final_project.parquet', engine='pyarrow')
- NLP_GCP_11.1_Final_Project_Starter.ipynb Download NLP_GCP_11.1_Final_Project_Starter.ipynb Note: this is live data, so the layout and record counts in your dataframe will vary from the counts in the attached notebook

For the assignment, I'm trying to do the following steps:

Clean-up the noise, by eliminating newlines, tabs, remnants of web crawls, and other irrelevant text
Discard irrelevant articles
Detect major topics
Identify top candidates for AI integration - these can be related to any industry and yield positive or negative results (sentiment analysis).
- Suggest why certain types of jobs are more likely to be impacted by AI
- Plot a timeline to illustrate how the sentiment is changing over time
Identify new technologies and AI solutions that might be affecting the employment landscape
- Plot a timeline to illustrate the introduction of some of these technologies
Demonstrate what companies, academic institutions and government entities can do to accelerate the development of these transformative capabilities
Leverage appropriate NLP techniques to identify organizations, people and locations, then apply targeted sentiment
- What types of companies (based on the lines of business) are planning to invest in these technologies today or near future (success stories)?
  - Showcase appropriate visualization to summarize your recommendations (i.e. word cloud chart or bubble chart)
- What types of applications cannot currently be transformed by AI, based on today's state of technology (failures)?
  - Showcase appropriate visualization to summarize your recommendations (i.e. word cloud chart or bubble chart)

Additional guidance:

Clean-up or sample data if you need to shorten processing times or reduce memory usage
Default sentiment will likely be wrong from any software package and will require some tweaking
- Keyword / dictionary approach
- Data annotation and development of custom classifier
- Building custom model on open-source data (i.e. Yelp)
- Fine-tuning Transformer Pipeline
We are encouraged to explore a combination several techniques to identify key topics:
- Topic modeling (i.e. LDA using gensim or ktrain) or using BERTopic
- Classification (hand-label several topics on a sample and then train classifier)
- Clustering (cluster topics around pre-selected keywords or word vectors)
- Zero-shot (NLI) modeling
- Please submit actual program codes (Jupyter notebooks)
- All the plots should be of production quality and easily readable. Fuzzy plots, untitled plots, unreadable labels, overlapping labels are unacceptable.
- Any statements made should be supported by data. Only recommendations or goals of the project sections can contain elements not directly supported by the data
We are welcome to use any software packages of our choice to complete the assignment

Step by Step Solution

★★★★★

3.49 Rating (152 Votes )

There are 3 Steps involved in it

Step: 1

It sounds like you have quite an ambitious project ahead of Heres a structured approach take to tackle each of the outlined tasks Data Preprocessing C... blur-text-image