Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

file: / / / C: / Users / 1 4 0 9 3 / Downloads / School % 2 0 - % 2 0 TTU

file:///C:/Users/14093/Downloads/School%20-%20TTU/DC.pdf . file:///C:/Users/14093/Downloads/School%20-%20TTU/BC3.pdf . give 2 sentences expalining each step. This assignment asks you to examine text mining for classification. Provide your answers to the questions in a Word document named Assign6_LastName.doc along with your source code that should be saved as Assign6_LastName, and click the title link to upload and subit them.
The BC3 text file includes news related to Blockchain technologies, and the DS text file contains news regarding Ebola issues. The preprocessing for text files is critical. Notably, in this assignment, you are requested to conduct mining text files. Therefore, the preprocessing becomes much more critical to complete the below two tasks. Please carefully read the directions and answer the questions.
Please treat yourself as a data scientist and come up with solutions. To provide the best answers to the below questions, you may need to iterate the procedures several times.
Data Preprocessing:
Import the necessary packages.
Import the two text files, and create the corpus. You can create one corpus containing two text files or generate two different corpora (or corpuses) for each text file. If you create two corpora (or corpuses) for two text files, you need to combine two document-term matrices to answer the below questions. (10 pts)
Clean the data by applying the necessary steps. In the process of cleaning the data, please come up with reasonable self-stopwords for both text files. If you have two corpora (or corpuses), perform the cleaning process separately for each corpus, and combine the document-term matrix after cleaning the data. There is no single way to conduct this task; hence youre strongly encouraged to provide detailed comments on your code. (35 pts)
Apply the Latent Dirichlet Allocation (LDA) technique to complete the following:
Find the optimal number of topics k for the text data. After determining k, provide a rationale for why this number of topics is considered optimal for the dataset.(15 pts)
Display the term-probability and document-probability by setting the number of topics at k. Subsequently, explain the significance of these probabilities in the context of the chosen topics.(20 pts)
Verify your outcomes from (a) and (b) above with the tidy() function. Discuss how the tidy() function's results support or refine your findings, and explain any insights or patterns revealed by this analysis.(20 pts)
Note that the data files and pre-processing will be used in the test as well. It is strongly recommended that you prepare the source code of your assignment (Assign6_LastName) for the test.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Systems For Advanced Applications 18th International Conference Dasfaa 2013 Wuhan China April 22 25 2013 Proceedings Part 2 Lncs 7826

Authors: Weiyi Meng ,Ling Feng ,Stephane Bressan ,Werner Winiwarter ,Wei Song

2013th Edition

3642374492, 978-3642374494

More Books

Students also viewed these Databases questions