Question
Big Data Sunnie Chung Information Extraction from Webpages as Semi-Structured Data or Unstructured Data Method 1: Webpage in Semi-Structured Data Model with DOM and XPath
I embrace witopportunity which now presents itself of congratulating you on the present faGeorge Washington Decemb er 8, 1790 https://www.infoplease.com/homeworkhelp/us-documents/state-union-addressgeorge-washington-december-8-1790 Fellow-Citizens of the Senate and House of Representatives: In meeting you again I feel much satisfaction in being able to repeat my congratulatprospects which continue to distinguish our public affairs. The abundant fruits of an The four column names as a heading here will be your table schema when you create a Table in a SQL Server later from your CSV or text file. Four columns per row have the following schema: (President, Date of Union Address, Link to Address, Text_Address). Note that some links of the infoplease site is hijacked to be redirected to another page. To get each correct Union Address text, you will have to go to each address page in the following format: https://www.infoplease.com/homework-help/us-documents/state-union-address-firstname-lastnamemonth-day-year Example: For John Adams (December 3, 1799), to reach the address site, go to : https://www.infoplease.com/homework-help/us-documents/state-union-address-john-adams-december-3-1799 Make it Automated for the Table Creation in your Database Server from your script/program: Either with ODBC/JDBC calls for insertions of each record directly from your script/program to a table in your SQL Server or Creating a Stored Procedure in your SQL Server to create a Table from your output file (.CSV) with Bulk Insert. You may use any other methods such as CLR Table Valued Function to create a Table as well. Part 2: For Extra Credit. It is NOT Required for Lab1. Create one big text file to include all the Address text that were extracted from each html file of each Address. Combine (Append) each address text into one big text file that has all the texts of the Union Addresses (each text starts with its name of President followed by date) so that you can perform further data processing for Big data analytics on the big, combined text of your entire collection such as word count of your entire collection to do text mining process. For some phases of data processing later, you will need to process all the texts, so it is easier to have them in one file. You may use XPath library call for this Lab. See the instruction of How to Debug and Set Up XPath Library in the Lab section. See the Simplified HTML Source below. HTML (in SimplifiedInfoUnionAddress.htm)of the Simplified State Union Address
Collected State of the Union Addresses of U.S. PresidentsNext |
Contents
George Washington (January 8, 1790) George Washington (December 8, 1790) George Washington (October 25, 1791) George Washington (November 6, 1792) George Washington (December 3, 1793) George Washington (November 19, 1794) George Washington (December 8, 1795) George Washington (December 7, 1796) John Adams (November 22, 1797) John Adams (December 8, 1798) John Adams (December 3, 1799) John Adams (November 11, 1800) Thomas Jefferson (December 8, 1801) Thomas Jefferson (December 15, 1802) Thomas Jefferson (October 17, 1803) Thomas Jefferson (November 8, 1804) Thomas Jefferson (December 3, 1805) Thomas Jefferson (December 2, 1806) Thomas Jefferson (October 27, 1807) Thomas Jefferson (November 8, 1808) James Madison (November 29, 1809) James Madison (December 5, 1810) James Madison (November 5, 1811) James Madison (November 4, 1812) James Madison (December 7, 1813) James Madison (September 20, 1814) James Madison (December 5, 1815) James Madison (December 3, 1816) James Monroe (December 12, 1817) James Monroe (November 16, 1818) James Monroe (December 7, 1819) James Monroe (November 14, 1820) James Monroe (December 3, 1821) James Monroe (December 3, 1822) James Monroe (December 2, 1823) James Monroe (December 7, 1824) John Quincy Adams (December 6, 1825) John Quincy Adams (December 5, 1826) John Quincy Adams (December 4, 1827) John Quincy Adams (December 2, 1828) Andrew Jackson (December 8, 1829) Andrew Jackson (December 6, 1830) Andrew Jackson (December 6, 1831) Andrew Jackson (December 4, 1832) Andrew Jackson (December 3, 1833) Andrew Jackson (December 1, 1834) Andrew Jackson (December 7, 1835) Andrew Jackson (December 5, 1836) For Python Script: http://docs.python-guide.org/en/latest/scenarios/scrape/ Submit BOTH in the followings: On Blackboard: 1. Lab Report that shows your set up/platform procedure, the execution to generate the output file in a structured any text file format (or in CSV, TSV), your output in text and as a SQL table and your source codes/scripts. 2. All your codes/scripts, input and all the output files in one zip file.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started