Question
Prepare a suitable data structure for text mining purposes. Develop algorithms to find separates words in a string. Develop a string matching method to find
- Prepare a suitable data structure for text mining purposes.
- Develop algorithms to find separates words in a string.
- Develop a string matching method to find similar words.
- Develop readable and documented algorithms for business applications.
Scenario
Your current client is an international firm that provides text mining solutions to its clients. Their product basically works as follows: First, they obtain a text file or string, which could be a product review on a web page, a blog, or tweets about the product. Then, they identify each word in the text file or string, count the number of times each word appears, and finally sort the words from most frequent to least frequent. These frequency lists are used for obtaining text mining statistics.
Normally, after obtaining this wordlist, they use their existing dictionaries to combine similar or closely related words, such as am, is, are or recent, recently or yearly, annually. However, they also want an API that contains a method for checking similar words among the words present in the text file, especially when one of the words is a part of another word as in recent and recently or as in historic and prehistoric, and provides automated suggestions for updating their dictionary. They plan to use this method for updating/developing dictionaries for different languages.
In text mining, some data preparation is required before the actual process begins. These preparations include changing every character to lowercase, removing punctuation marks, and so on. For this project, you need to transform text files into wordlists, count each word, and remove every symbol and punctuation mark.
import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.util.ArrayList; import java.util.Collections; import java.util.HashMap; import java.util.Iterator; import java.util.Map; import java.util.StringTokenizer; import javax.swing.JFileChooser; import javax.swing.JFrame;
public class WordSuggestions { /** * * @return * * @throws IOException */ public ArrayList
//Read words in Files //Count //sort return null; } /** * * @param wordCountList * * @return * * @throws IOException */ public ArrayList
/** * * @param wordList * * @return */ public ArrayList
/** * * @param fileArray * * @return * * @throws IOException */ public ArrayList
return null; }
/** * * @return */ public File[] getFiles() { return null; } /**This method checks if P exists in T * * @param P Pattern to match * @param T Text to search * * @return true if P exists in T */ public boolean badCharacterRuleMatch(String P, String T) { int n = T.length(); int m = P.length();
int e = 256; int left[][] = new int[m][e]; for (int i = 0; i < m; i++) for (int j = 0; j < e; j++) left[i][j] = -1; for (int i = 0; i < m; i++) { if (i != 0) for (int j = 0; j < e; j++) left[i][j] = left[i - 1][j]; left[i][P.charAt(i)] = i; }
boolean hasMatch = false; int skip; for (int i = 0; i < n - m + 1; i += skip) { skip = 0; for (int j = m - 1; j >= 0; j--) { if (P.charAt(j) != T.charAt(i + j)) { skip = Math.max(1, j - left[j][T.charAt(i + j)]); break; } } if (skip == 0) { hasMatch = true; break; } } return hasMatch; }
/** * main() method stub */ public static void main(String args[]) { WordSuggestions ws; ws = new WordSuggestions(); ws.getFiles(); }
}
----------------------------------------------------------------------------------------------------------------------------------------------------------
public class WordCounter {
private String word; private int counter; public WordCounter(String word) { super(); this.word = word; counter = 1; } public WordCounter(String word, int counter) { super(); this.word = word; this.counter = counter; }
public String getWord() { return word; }
public int getCounter() { return counter; } public void count() { this. counter = counter++; } }
- Do not forget to search for a shorter word in the longer word.
- Avoid checking each pair more than once.
- For easy readability use full variable names instead of short 2-3 character abbreviation.
- For code reusability, develop separate methods when available.
Before you begin, read the London.txt and Rome.txt files, which can be found in the workspace. A simple WordCounter.java class is created for storing word and counter pairs (that is, the word and its frequency in the file). Use the WordSuggestions.java file to develop methods.
Implement the getWordFrequencies() method. This method should take the text file, read the text, and convert it into a wordlist. Then, the wordlist should be counted and sorted. Lastly, the counted and sorted wordlist should be returned.
Task
Develop the getSuggestions(ArrayList
3
Each word is compared with every other word to check whether one word is within the other as in like and likelihood. You can use bad-character rule, Boyer Moore, or Good Suffix to skip unnecessary character checks in the string.
Tasks
Develop the countWords(ArrayList
1
Develop the getWordList(File[] fileArray) method to get all words in the text file. Ignore the words that have 3 or less characters.
1
Lastly, implement the getFiles() method to allow the user to select a file using the JFileChooser class.
The getFiles() method will open a JFileChooser window that can be accessed in the Desktop tab. You may need to expand this pane to view the GUI.
3 / 4
Filetree
-
~/sandbox
-
London.txt
-
Rome.txt
-
WordCounter.java
-
WordSuggestions.java
-
- WordSuggestions.java
- WordCounter.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.StringTokenizer;
import javax.swing.JFileChooser;
import javax.swing.JFrame;
public class WordSuggestions {
/**
*
* @return
*
* @throws IOException
*/
public ArrayList
//Choose Files
//Read words in Files
//Count
//sort
return null;
}____________________________________
Each word is compared with every other word to check whether one word is within the other as in like and likelihood. You can use bad-character rule, Boyer Moore, or Good Suffix to skip unnecessary character checks in the string.
Your customer wants you to develop a method that will find the sentences that contain a specific word. This is basically a word search, but your customer needs the list of the full sentence which has the search term.
They are planning to use this method for sentiment analysis, which involves computationally identifying and categorizing opinions expressed in a piece of text. They are planning to search the web (tweets, blogs, social media, and so on) for user opinions about a specific product or situation. Once they get the sentence with the search word in it, they will evaluate the attitude of the user using the whole sentence.
Language Java
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started