Question: Scraping and Chunking: Mini Project Your goal for this project is to scrape an entire public - hosted knowledge base and chunk it effectively into
Scraping and Chunking: Mini Project
Your goal for this project is to scrape an entire publichosted knowledge base and chunk it effectively into clean smaller pieces so that related content is preserved in a single chunk.
Considerations:
Please implement this solution in Python. We recommend using BeautifulSoup.
The priorities in descending order are:
Cleanlinessorganization of the code SIMPLER IS BETTER
Correctness
Creativity
This project is not timed and you may use the internet and LLM helpers as you wish.
After Finish give the github repo link including the code and the result file
Step
First, scrape all the Help Articles from the Notion help center. Make sure you get every page and all the relevant content from that page. Feel free to ignore any guides in Notion Academy.
Step
Extract the core text content from each article. Feel free to ignore images, other media, and any components that are not directly related to the core article. Make sure to include all titles, notes, and paragraphs.
Step
Now its time to split the articles into smaller chunks. This is important for any RAGtype system. Make sure to keep headers and paragraphs together and dont break up bulleted lists midlist. Your chunks should be roughly characters or fewer but could be more if its necessary to keep related context together.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
