Question

1 Approved Answer

Posted on Sep 24, 2024

Use K-Means Clustering to Identify Twitter Topics in RapidMiner To get the tweets from Twitter, we need to create a connection in the rapid miner

Use K-Means Clustering to Identify Twitter Topics in RapidMiner To get the tweets from Twitter, we need to create a connection in the rapid miner studio. Go to the Connections tab and add a connection. Select the connection type as twitter, select any local repository and give this connection a name e.g. Twitter. Click create. In the next window, you need to get access to your Twitter account. Click on the request access token icon on the right. Now you have to authenticate the connection for the rapid miner. Click on open URL or show URL instead and go to the link. You have to log in to Twitter and there will be a code on the screen that you have to write in the copy code section. Once done click complete and you will see the Access Token field filled with some encoded text. Click on the test connection to check the connection has been established. Now save the connection. Now we are going to extract data from Twitter and load it in the rapid miner. Search for Search Twitter operator and place it in the process window. You need to set the parameters of the Search Twitter operator to the following. We are going to use bitcoin for query in this example (you can use any other term e.g. election or covid). You can also change the language to English using en. Connect the operator to the result and see the output. If you dont have more than 50 tweets then increase the limit. Add select attributes operator and select only id and text attributes of Twitter data. At this point, you can write your Twitter data into an Excel file (CSV file) for future use. Now we have to clean the data and the first step is to convert the data into text and do some text processing. Add Nominal to Text operator into the process. Now use Replace operator to remove any URL in the tweets, you can also remove any @ data as well. We have to use a regular expression to remove the URL which is [(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*) Add process document from data operator in the process. Double click on it to add the sub-processes. Add Tokenize (for tokenizing the document), Transform cases, Filter Stopwords (English), and filter token by length (to filter very small and very large words) operators. Now change the parameters of the Process Document from Data operator as given below, it will help us remove the very popular and unpopular words. Now the data is ready for clustering. Add K-means clustering operator into the process, we are going to create 5 clusters. Also, add a cluster model visualizer operator to visualize the clusters. The overall process will look like the following: See the five clusters and the relation of various words with different clusters. We can aggregate the values of words for each cluster using the Aggregate operator. Add the aggregate operator. We are going to use default aggregation with filter type value_type and value type real, the aggregation function will be average and we are going to group the attributes by clusters. Take transpose to convert into columnar form. Your output will look like the following: Analyze the result and see the relation of words with clusters. We can use turbo prep to prepare our data in presentable form. Use turbo prep to make your data look like the following: Now use the charts option on the top right corner to create word clouds as shown below: Task: Extract 100 tweets on Queen Elizabeth and analyze the data using the steps mentioned above and record your findings. Work on your assessment 3 and discuss any issues with your tutor.