Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Big Data Sunnie Chung Information Extraction from Webpages as Semi-Structured Data or Unstructured Data Method 1: Webpage in Semi-Structured Data Model with DOM and XPath

Big Data Sunnie Chung Information Extraction from Webpages as Semi-Structured Data or Unstructured Data Method 1: Webpage in Semi-Structured Data Model with DOM and XPath Method 2: Webpages as Unstructured Data with Parsing You may use any method of your choice. Process the webpages in the following sites to extract the given information below to create an index table in a SQL Server for next phases of text mining. You can choose any script language for this lab  Python, Java Script, PHP, or any programming language to scarp the following collection of the webpages to extract information then store them in a table in a SQL Server as follow. From Collected State of the Union Addresses of U.S. Presidents in the following website (in Chrome) at https://www.infoplease.com/homework-help/history/collected-state-union-addresses-us-presidents https://www.infoplease.com/homework-help/us-documents/state-union-address-george-washington-january-8-1790 Part 1: For each item (line) in the contents in the HTML file of the page, extract the following 5 information: ? Name of President, ? Date of Union Address, ? Link to Address ? Text of Address Extract the four information to write them in a table structure in CSV/TSV file with each Address per line or insert a record of the five column values into a table in your SQL server. Each value of Name of President and Date of Union Address are displayed as String and create a Link with each URL to Address as a value of Link to Address column. Store the text of each Address in Full Address Text column. You may create it as Blob if necessary. For Example, your script extracts three information from the following html code of the site and display the three columns in a table format in the page as below. George Washington (January 8, 1790) George Washington (December 8, 1790) Save the full text of each address in a file and store the file path as pointer to the file in a column so that you can create a table from it to retrieve them anytime later and create another column to save the full text of address. The structured text file (for example, CSV, TSV) created would look like the below. Note your transformed CSV/TSV text file would not have the heading with the column names in the first line of the output file. They are for a table scheme (You may change your column names when creating a table in your SQL Server). For some Big Data API, it may require to insert the heading with column titles in the first line of your CSV file to be created as a table in a database server. Name of President Date of Union Address Link to Address Text of Address George Washington Januar y 8, 1790 https://www.infoplease.com/homeworkhelp/us-documents/state-union-addressgeorge-washington-january-8-1790 Fellow-Citizens of the Senate and House of Representatives: 

I embrace witopportunity which now presents itself of congratulating you on the present faGeorge Washington Decemb er 8, 1790 https://www.infoplease.com/homeworkhelp/us-documents/state-union-addressgeorge-washington-december-8-1790 Fellow-Citizens of the Senate and House of Representatives: In meeting you again I feel much satisfaction in being able to repeat my congratulatprospects which continue to distinguish our public affairs. The abundant fruits of an The four column names as a heading here will be your table schema when you create a Table in a SQL Server later from your CSV or text file. Four columns per row have the following schema: (President, Date of Union Address, Link to Address, Text_Address). Note that some links of the infoplease site is hijacked to be redirected to another page. To get each correct Union Address text, you will have to go to each address page in the following format: https://www.infoplease.com/homework-help/us-documents/state-union-address-firstname-lastnamemonth-day-year Example: For John Adams (December 3, 1799), to reach the address site, go to : https://www.infoplease.com/homework-help/us-documents/state-union-address-john-adams-december-3-1799 Make it Automated for the Table Creation in your Database Server from your script/program: Either with ODBC/JDBC calls for insertions of each record directly from your script/program to a table in your SQL Server or Creating a Stored Procedure in your SQL Server to create a Table from your output file (.CSV) with Bulk Insert. You may use any other methods such as CLR Table Valued Function to create a Table as well. Part 2: For Extra Credit. It is NOT Required for Lab1. Create one big text file to include all the Address text that were extracted from each html file of each Address. Combine (Append) each address text into one big text file that has all the texts of the Union Addresses (each text starts with its name of President followed by date) so that you can perform further data processing for Big data analytics on the big, combined text of your entire collection such as word count of your entire collection to do text mining process. For some phases of data processing later, you will need to process all the texts, so it is easier to have them in one file. You may use XPath library call for this Lab. See the instruction of How to Debug and Set Up XPath Library in the Lab section. See the Simplified HTML Source below. HTML (in SimplifiedInfoUnionAddress.htm)of the Simplified State Union Address

Collected State of the Union Addresses of U.S. Presidents


Next

Contents

George Washington (January 8, 1790) George Washington (December 8, 1790) George Washington (October 25, 1791) George Washington (November 6, 1792) George Washington (December 3, 1793) George Washington (November 19, 1794) George Washington (December 8, 1795) George Washington (December 7, 1796) John Adams (November 22, 1797) John Adams (December 8, 1798) John Adams (December 3, 1799) John Adams (November 11, 1800) Thomas Jefferson (December 8, 1801) Thomas Jefferson (December 15, 1802) Thomas Jefferson (October 17, 1803) Thomas Jefferson (November 8, 1804) Thomas Jefferson (December 3, 1805) Thomas Jefferson (December 2, 1806) Thomas Jefferson (October 27, 1807) Thomas Jefferson (November 8, 1808) James Madison (November 29, 1809) James Madison (December 5, 1810) James Madison (November 5, 1811) James Madison (November 4, 1812) James Madison (December 7, 1813) James Madison (September 20, 1814) James Madison (December 5, 1815) James Madison (December 3, 1816) James Monroe (December 12, 1817) James Monroe (November 16, 1818) James Monroe (December 7, 1819) James Monroe (November 14, 1820) James Monroe (December 3, 1821) James Monroe (December 3, 1822) James Monroe (December 2, 1823) James Monroe (December 7, 1824) John Quincy Adams (December 6, 1825) John Quincy Adams (December 5, 1826) John Quincy Adams (December 4, 1827) John Quincy Adams (December 2, 1828) Andrew Jackson (December 8, 1829) Andrew Jackson (December 6, 1830) Andrew Jackson (December 6, 1831) Andrew Jackson (December 4, 1832) Andrew Jackson (December 3, 1833) Andrew Jackson (December 1, 1834) Andrew Jackson (December 7, 1835) Andrew Jackson (December 5, 1836) For Python Script: http://docs.python-guide.org/en/latest/scenarios/scrape/ Submit BOTH in the followings: On Blackboard: 1. Lab Report that shows your set up/platform procedure, the execution to generate the output file in a structured any text file format (or in CSV, TSV), your output in text and as a SQL table and your source codes/scripts. 2. All your codes/scripts, input and all the output files in one zip file.

 



Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Mobile Communications

Authors: Jochen Schiller

2nd edition

978-0321123817, 321123816, 978-8131724262

Students also viewed these Programming questions

Question

What is a mortgage pass-through security?

Answered: 1 week ago

Question

BPR always involves automation. Group of answer choices True False

Answered: 1 week ago