Question

1 Approved Answer

Posted on Aug 31, 2024

Please answer 2, 3 and 4 Data is everywhere nowadays. Analyzing data can help us to discover the pattern in a date set to predict

Please answer 2, 3 and 4

Data is everywhere nowadays. Analyzing data can help us to discover the pattern in a date set to predict the outcome. It can also help us automatically classify documents. The article in the following link illustrates the documentation classification approach a simple case. https://www.kdnuggets.com/2015/01/text-analysis-101-document-classification.html based on It mentions three steps to classify the document. They are collecting data, preprocessing and applying classifying algorithm and strategy separately. In this part, we are going to finish the first two steps by answering the questions below 1) (2 points) Assume now you are given a webpage with a URL as below cnn.com/2018/03/28ews/companies/amazon-stock/index.html The webpage is the news about "amazon". Please write a single line shell command to download the source code of the webpage and save the file as news.html. Hint: you can use wget in Ubuntu and curl in MacOs] 2) (2 points) Html is a mark-up language in which the tags are used to let web browser know how to format the content. You can refer to the link below for more details: https://www.tutorialspoint.com/html/html overview.htm Now your task is to extract the actual article from the webpage source code new.html by extracting the text tagged by pp>. Please use a single line shell command to finish this task and save the article to file news.txt. Note: the extracted article is like below: 3) (3 points) The first two steps have helped us collect data from web resource. Then we are going to preprocess the article by removing some less important words from the text file, such as "a, "the" and so on. These words are not used for text mining. Please write a C program removeChar.c to finish this task. To simplify this problem, your C program just needs to obtain the input from standard input and remove only word Data is everywhere nowadays. Analyzing data can help us to discover the pattern in a date set to predict the outcome. It can also help us automatically classify documents. The article in the following link illustrates the documentation classification approach a simple case. https://www.kdnuggets.com/2015/01/text-analysis-101-document-classification.html based on It mentions three steps to classify the document. They are collecting data, preprocessing and applying classifying algorithm and strategy separately. In this part, we are going to finish the first two steps by answering the questions below 1) (2 points) Assume now you are given a webpage with a URL as below cnn.com/2018/03/28ews/companies/amazon-stock/index.html The webpage is the news about "amazon". Please write a single line shell command to download the source code of the webpage and save the file as news.html. Hint: you can use wget in Ubuntu and curl in MacOs] 2) (2 points) Html is a mark-up language in which the tags are used to let web browser know how to format the content. You can refer to the link below for more details: https://www.tutorialspoint.com/html/html overview.htm Now your task is to extract the actual article from the webpage source code new.html by extracting the text tagged by pp>. Please use a single line shell command to finish this task and save the article to file news.txt. Note: the extracted article is like below: 3) (3 points) The first two steps have helped us collect data from web resource. Then we are going to preprocess the article by removing some less important words from the text file, such as "a, "the" and so on. These words are not used for text mining. Please write a C program removeChar.c to finish this task. To simplify this problem, your C program just needs to obtain the input from standard input and remove only word