Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Web scraping You will be scraping data taken from Goodreads.com, cleaning it, and extracting information from it. You will need to use the BeautifulSoup library

Web scraping

You will be scraping data taken from Goodreads.com, cleaning it, and extracting information from it. You will need to use the BeautifulSoup library to parse through the HTML documents. We have provided two static documents for you to use, but you will need to scrape some live content as well.

The files are in, including the starter code in the .py file:

https://drive.google.com/drive/folders/1Cm8mvolxpzlu4lqfhOROmTCgJsNpnXjQ?usp=sharing

After you've implemented all of the required functions, you will need to write test cases for each one. We have provided guidance for what to test for in the comments, but it will be up to you to implement the logic in the code. In order to write good test cases, you will need to open the websites, explore, and get a sense of what your data should actually look like.

If you choose to do the extra credit part, you will be exposed to using multiple data cleaning methods at once. For that, you need to combine BeautifulSoup with Regex and write the output to a .csv file.

The code

You will need to write several functions and their test cases. Start from the starter code provided, which looks like the following:

image text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribed
def get_titles_from_search_results () : Write a function that creates a Beautifulsoup object on "search results . html" . Parse through the object and return a list of tuples containing book titles, authors, and ratings (as printed on the Goodreads website) in the format given below. Make sure to strip () any newlines from the book titles and author names. [ ( 'Book title 1', 'Author 1' , 'Rating 1' ) , ('Book title 2' , 'Author 2' , Rating 2' ) . . . ] This is what we're expecting to see returned: [ ( 'Harry Potter and the Deathly Hallows (Harry Potter, #7) ', 'J.K. Rowling', '4. 62') , ( 'Harry Potter and the Order of the Phoenix (Harry Potter, #5) ', 'J.K. Rowling, ' , 4.50') , ("Harry Potter and the Sorcerer's Stone (Harry Potter, #1) ", 'J.K. Rowling', '4. 47') , ('Harry Potter and the Prisoner of Azkaban (Harry Potter, #3) ', 'J.K. Rowling' , '4.57'), ('Harry Potter and the Chamber of Secrets (Harry Potter, #2)', 'J. K. Rowling' , '4. 43'), ('Harry Potter and the Goblet of Fire (Harry Potter, #4)', 'J.K. Rowling' , '4.56'), ('Harry Potter and the Half-Blood Prince (Harry Potter, #6)', 'J. K. Rowling' , '4.57'), ('Harry Potter and the Cursed Child: Parts One and Two (Harry Potter, #8) ', 'John Tiffany (Adaptation) , ', '3.62'), ('Harry Potter and the Order of the Phoenix (Harry Potter, #5, Part 1) ', 'J.K. Rowling', '4. 62') , ('Harry Potter Series Box Set (Harry Potter, #1-7) ', 'J.K. Rowling', '4.73'), ('Harry, a History: TheTrue Story of a Boy Wizard, His Fans, and Life Inside the Harry Potter Phenomenon' , 'Melissa Anelli (Goodreads Author) , ', '4.12') , ('Harry Potter Collection (Harry Potter, #1-6) ', 'J.K. Rowling' , '4.73'), ('The Unofficial Harry Potter Cookbook: From Cauldron Cakes to Knickerbocker Glory--More Than 150 Magical Recipes for Wizards and Non-Wizards Alike' , 'Dinah Bucholz', '4.10') , ('Harry Potter: A History of Magic', 'British Library, ', '4.22'), ('Selections from Harry Potter and the Order of the Phoenix: Piano Solos' , 'John Williams, ', '4.71'), ('Harry Potter Boxed Set, Books 1-5 (Harry Potter, #1-5) ', 'J.K. Rowling, ', '4.78'), ('Harry Potter and the Chamber of Secrets: Sheet Music for Flute with C.D' , 'John Williams ' , '4.64') , ('Harry Potter Page to Screen: The Complete Filmmaking Journey', 'Bob Mccabe' , '4.57') , ('Harry Potter: Film Wizardry', 'Brian Sibley', '4.50') , ('Harry Potter: The Prequel (Harry Potter, #0.5) ', 'J.K. Rowling' , '4.18') ]def get_search links () : Write a function that creates a Beautifulsoup object after retrieving content from "https : //www. goodreads . com/search?q=fantasy&qid=NwUsLiA2Nc". Parse through the object and return a list of URLs for each of the first ten books in the search using the following format: ['https : //www. goodreads . com/book/show/84136. Fantasy_Lover? from_search=true&fro m_srp=true&qid=NwUsLiA2Nc&rank=1' , . . .] Notice that you should ONLY add URLs that start with "/book/show/" to your list, and be sure to append the full Path (https: //www. goodreads. com) to the URL so that the url is in the format "https ://www. goodreads . com/book/show/kdka". This is what get_search_links should return: I'https ://www. goodreads . com/book/show/84136. Fantasy_Lover?from_search=true&from_srp=tr ue&qid=NwUsLiA2Nc&rank=1' , 'https://www. goodreads . com/book/show/6542645-fantasy-in-death?from_search=true&from_sr p=true&qid=NwUsLiA2Nc&rank=2' , 'https://www. goodreads . com/book/show/35082746-fantasy-of-frost?from_search=true&from_s rp=true&qid=NwUsLiA2Nc&rank=3' , 'https : //www. goodreads . com/book/show/2081. The_Mind_s_I?from_search=true&from_srp=true& qid=NwUsLiA2Nc&rank=4' , https://www. goodreads . com/book/show/25255723-gods-and-mortals?from_search=true&from_srp=true&qid=NwUsLiA2Nc&rank=5' , 'https: //www. goodreads . com/book/show/6931452-the-kingdom-of-fantasy? from_search=true&f rom srp=true&qid=NwUsLiA2Nc&rank=6' , 'https://www. goodreads . com/book/show/13600356-epic?from_search=true&from_srp=true&qid= NwUsLiA2Nc&rank=7' , 'https : //www. goodreads . com/book/show/31363. How_to_Write_Science_Fiction_Fantasy?from_s earch=true&from srp=true&qid=NwUsLiA2Nc&rank=8' , 'https : //www. goodreads . com/book/show/39282719-kurintor-nyusi?from_search=true&from_srp =true&qid=NwUsLiA2Nc&rank=9' , 'https : //www. goodreads . com/book/show/42667807-die-vol-1?from search=true&from_srp=true &qid=NwUsLiA2Nc&rank=10' ]def get_book_summary (book_html) : Write a function that creates a Beautifulsoup object that extracts book information from a book's webpage, given the HTML file of the book. Parse through the Beautifulsoup object, and capture the book title, book author, number of pages, and book rating. This function should return a tuple in the following format: ( ' Some book title', 'the book's author', number of pages, book rating) HINT: Using Beautifulsoup's find () method may help you here. You can easily capture CSS selectors with your browser's inspector window. Make sure to strip () any newlines from the book title, number of pages, and rating. The list of tuples you will create in test_get_book_summery after calling get_book_summary on all 10 html files include these books (they might be in a different order) : [ ( 'Fantasy Lover' , 'Sherrilyn Kenyon' , 337, 4.14) , ('Fantasy in Death' , 'J.D. Robb' , 356, 4.26) , ('Fantasy of Frost', 'Kelly St. Clare', 264, 4.18), ('The Mind's I: Fantasies and Reflections on Self and Soul' , 'Douglas R. Hofstadter' , 512, 4.14) , ( 'Gods and Mortals: Fourteen Free Urban Fantasy & Paranormal Novels Featuring Thor, Loki, Greek Gods, Native American Spirits, Vampires, Werewolves, & More', 'C. Gockel', 2948, 3.81) , ('Epic: Legends of Fantasy' , 'John Joseph Adams' , 624, 3.7), ('The Kingdom of Fantasy' , 'Geronimo Stilton', 316, 4.34), ('How to Write Science Fiction & Fantasy' , 'Orson Scott Card', 140, 3.9), ('Kurintor Nyusi: Diverse Epic Fantasy',\fdef summarize_best_books (filepath) : Write a function to get a list of categories, book title and URLs from the "BEST BOOKS OF 2020" page in "best_books_2020. html". This function should create a Beautifulsoup object from a filepath and return a list of (category, book title, URL) tuples. For example, if the best book in category "Fiction" is "The Testaments (The Handmaid's Tale, #2) ", with URL https://www. goodreads . com/choiceawards/best-fiction-books-2020, then you should append ( "Fiction", "The Testaments (The Handmaid's Tale, #2) ", "https : //www. goodreads . com/choiceawards/best-fiction-books-2020") to your list of tuples. def write_csv (data, filename) : Write a function that takes in a list of tuples (called data, i.e. the one that is returned by get_titles_from_search_results () ) , sorts the tuples in descending order by largest rating, writes the data to a csv file, and saves it to the passed filename. The first row of the cav should contain "Book title", "Author Name", "Rating", respectively as column headers. For each tuple in data, write a new row to the csv, placing each element of the tuple in the correct column. When you are done your CSV file should look like this: Book title, Author Name, Rating Book1 , Authorl , Ratingl Book2 , Author2, Rating2 Book3, Author3 , Rating3 In order of highest rating to lowest rating. This function should not return anything.For each function you wrote above you should write a non-trivial test case to make sure that your function works properly. we have described the test cases that you should write in the comment for the test functions. It is up to you to correctly implement this logic using the assert statements in the unittest library. When you look at your written csv file your result should have... This as the first line in the csv file after the header "Harry Potter Boxed Set, Books 15 (Harry Potter, #15)","J.K. Rowling,",4.73 This has the last row in the csv file: "Harry Potter and the Cursed Child: Parts One and Two (Harry Potter, #8)","John Tiffany (Adaptation),",3.62 goodreads Home My Books Browse - Community Search books Q Sign In Join MARGARET Share ATWOOD The Testaments (The Handmaid's Tale #2) Recommend It | Stats | Recent Status Updates by Margaret Atwood (Goodreads Author) READERS ALSO ENJOYED Other editions * * * * 4.20 . Rating details . 207,668 ratings . 21,266 reviews XGirl, SUCH CO Enlarge cover An alternate cover edition of ISBN 978-0385543781 WHI can be found here. goodreads Woman, The Dutch a FUN HouseA AGE NIC THE CHOICE TESTAMENTS When the van door slammed on Offred's future Other 2019 Ann BO at the end of The Handmaid's Tale, readers had no BERNARDINE EVARISTO Patchett REID way of telling what lay ahead for her--freedom, WINNER Want to Read prison or death. See similar books... Rate this book With The Testaments, the wait is over. GENRES Margaret Atwood's sequel picks up the story more than fifteen Fiction 3,368 users years after Offred stepped into the unknown, with the explosive Science Fiction > Dystopia 1,338 users testaments of three female narrators from Gilead. Science Fiction 779 users In this brilliant sequel to The Handmaid's Tale, acclaimed author Feminism 669 users Margaret Atwood answers the questions that have tantalized readers for decades. Audiobook 494 users Cultural > Canada 239 users "Dear Readers: Everything you've ever asked me about Gilead and its inner workings is the inspiration for this book. Well, Adult 233 users almost everything! The other inspiration is the world we've been Literary Fiction 203 users living in." --Margaret Atwood (less) Speculative Fiction 196 users

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Transport Operations

Authors: Allen Stuart

2nd Edition

978-0470115398, 0470115394

Students also viewed these Programming questions

Question

Prove equations (32.15), (32.16), and (32.17).

Answered: 1 week ago