Question
In March of 2023, Goldman Sachs published a report., indicating that ~25% of the tasks in US and Europe can be automated using AI. However,
In March of 2023, Goldman Sachs published a report., indicating that ~25% of the tasks in US and Europe can be automated using AI. However, not all industries will be affected equally. According to the report, certain jobs, like office tasks, legal, architecture and social sciences have a potential for 30%+ automation, while positions like construction, installation and building maintenance are going to be largely unaffected.
Facebook Research, most recently published a paper which highlights Moravec's paradox, the thesis that the hardest problems in AI involve sensorimotor skills, not abstract thought or reasoning, which coincide with Goldman Sachs predictions.
While both of these papers are very impressive, they also heavily influenced by the recent advances in Large Language Models (LLMs). For this final project I have prepared a collection of ~200K news articles (about 900 MB) articles on important topics: Data Science, Machine Learning and Artificial Intelligence, and I need to identify what industries and job lines are going to be most impacted by the AI over the next several years, based on the information that I can disseminate from this text corpus.
The objective this assignment is to identify what types of tasks and jobs are most likely to see the biggest impact from AI by extracting meaningful insights from unstructured text. My goal is to provide actionable recommendations on what can be done with AI to automate the jobs and / or improve employee productivity. Please pay attention to the introduction of novel technologies and algorithms, such as AI for image generation and Conversational AI, as they represent the entire paradigm shift in adoption of AI technologies and data science in general.
You can access the data by using one of the following methods:
- Download data by following this think from your browser: https://storage.googleapis.com/msca-bdp-data-open/news_final_project/news_final_project.parquetLinks to an external site.
- Use Pandas from anywhere (personal laptop, Colab or any cloud) df_news_final_project = pd.read_parquet('https://storage.googleapis.com/msca-bdp-data-open/news_final_project/news_final_project.parquet', engine='pyarrow')
- NLP_GCP_11.1_Final_Project_Starter.ipynb Download NLP_GCP_11.1_Final_Project_Starter.ipynb Note: this is live data, so the layout and record counts in your dataframe will vary from the counts in the attached notebook
For the assignment, I'm trying to do the following steps:
- Clean-up the noise, by eliminating newlines, tabs, remnants of web crawls, and other irrelevant text
- Discard irrelevant articles
- Detect major topics
- Identify top candidates for AI integration - these can be related to any industry and yield positive or negative results (sentiment analysis).
- Suggest why certain types of jobs are more likely to be impacted by AI
- Plot a timeline to illustrate how the sentiment is changing over time
- Identify new technologies and AI solutions that might be affecting the employment landscape
- Plot a timeline to illustrate the introduction of some of these technologies
- Demonstrate what companies, academic institutions and government entities can do to accelerate the development of these transformative capabilities
- Leverage appropriate NLP techniques to identify organizations, people and locations, then apply targeted sentiment
- What types of companies (based on the lines of business) are planning to invest in these technologies today or near future (success stories)?
- Showcase appropriate visualization to summarize your recommendations (i.e. word cloud chart or bubble chart)
- What types of applications cannot currently be transformed by AI, based on today's state of technology (failures)?
- Showcase appropriate visualization to summarize your recommendations (i.e. word cloud chart or bubble chart)
- What types of companies (based on the lines of business) are planning to invest in these technologies today or near future (success stories)?
Additional guidance:
- Clean-up or sample data if you need to shorten processing times or reduce memory usage
- Default sentiment will likely be wrong from any software package and will require some tweaking
- Keyword / dictionary approach
- Data annotation and development of custom classifier
- Building custom model on open-source data (i.e. Yelp)
- Fine-tuning Transformer Pipeline
- We are encouraged to explore a combination several techniques to identify key topics:
- Topic modeling (i.e. LDA using gensim or ktrain) or using BERTopic
- Classification (hand-label several topics on a sample and then train classifier)
- Clustering (cluster topics around pre-selected keywords or word vectors)
- Zero-shot (NLI) modeling
- Please submit actual program codes (Jupyter notebooks)
- All the plots should be of production quality and easily readable. Fuzzy plots, untitled plots, unreadable labels, overlapping labels are unacceptable.
- Any statements made should be supported by data. Only recommendations or goals of the project sections can contain elements not directly supported by the data
- We are welcome to use any software packages of our choice to complete the assignment
Step by Step Solution
3.49 Rating (152 Votes )
There are 3 Steps involved in it
Step: 1
It sounds like you have quite an ambitious project ahead of Heres a structured approach take to tackle each of the outlined tasks Data Preprocessing C...Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started