Question
(JAVA) Write code into the main function of class Analysis. You are free to modify Business.java if you use it and finally everything must be
(JAVA) Write code into the main function of class Analysis. You are free to modify Business.java if you use it and finally everything must be put in a package called Data.
Starter Code:
Business.java
package Data;
public class Business {
String businessID;
String businessName;
String businessAddress;
String reviews;
int reviewCharCount;
public String toString() {
return "------------------------------------------------------------------------------- "
+ "Business ID: " + businessID + " "
+ "Business Name: " + businessName + " "
+ "Business Address: " + businessAddress + " "
//+ "Reviews: " + reviews + " "
+ "Character Count: " + reviewCharCount;
}
}
Use the data set provided at the bottom of the post. It contains the following format:
{businessID, businessName, businessAddress, reviews}
{businessID, businessName, businessAddress, reviews}
The reviews consists of lowercase English letters and spaces without any punctuation or non-English characters. The goal of this assignment is to process this data set and extract meaningful words from the data set that represent the businesses.
Read the file, and create a Business Object for each business. Each Business should contain its reviews as a String. You may use the class Business provided in Business.java.
We will determine whether or not a word meaningfully represents a business with the number of times it appears in the reviews. However words like a, that, is, or and are most frequent, and these words clearly do not say much about the business.
Therefore, we use the term frequency-inverse document frequency (tf-idf) score:
tf-idf(w, D) = ______number of times word w appears in document D_____
number of documents in the entire corpus that contain word w
The tf-idf score is high if a rare word appears many times in a certain document. The numerator is the term frequency , and the denominator is the document frequency.
For each word, count the number of documents the word appears in.
For the top 10 Businesses with the most characters in their reviews, output (to the command line) the top 30 words with the highest tf-idf scores.
When you do so, youll notice that some words with high tf-idf scores are so rare that they appear in 1 or 2 documents. Some of these words are misspellings or slangs that only make sense to locals. Filter these out.
If a word appears in less than 5 documents, assign a tf-idf score of 0.
Poorly written code will take too long to process the full data set. Perform some optimization.
Optimize your code so that it runs on the full data set within 10 minutes.
Your code could look something like
public static void main(String[] args) {
Map
List
while (true) {
Business b = readBusiness(???);
If (b==null) // end of file and processed all businesses
Break;
businessList.add(b);
}
for (Business b : businessList)
addDocumentCount(corpusDFCount, b);
//sort by character count
Collections.sort(businesslist, ???);
//for the top 10 businesses with the most review characters
for (int i=0; i<10; i++) {
Map
//Entry is a static nested interface of class Map
List
sortByTfidf(tfidfScoreList);
System.out.println(businessList.get(i));
printTopWords(tfidfScoreList, 30);
}
}
This code can be further optimizes using a PriorityQueue
public static void main(String[] args) {
Map
PriorityQueue
while (true) {
Business b = readBusiness(???);
If (b==null) //end of file and processed all businesses
break;
addDocumentCount(corpusDFCount, b);
businessQueue.add(b);
if(businessQueue.size()>10)
businessQueue.remove();
}
//for the top 10 businesses with most review characters
for (int i=0; i<10; i++) {
Business currB = businessQueue.remove();
Map
//Entry is a static nested interface of class Map
List
sortByTfidf(tfidfScoreList);
System.out.println(currB);
printTopWords(tfidfScoreList, 30);
}
}
(This code is just a suggestion provided to clarify the instructions.)
Your output should look something like:
Example 1
Business ID: 60454
Business Name: Bacchanal Buffet Business Address: Caesars Palace Las Vegas Hotel And Casino 3570 Las Vegas Boul evard South The Strip Las Vegas NV 89109
Character Count: 3780749
(bacchanal,10.75) (bacchanals,2.73) (buffet,1.48) (baccahanal,1.17) (bachannal, 1.07) (buffets,1.06) (bacchanel,1) (bachanal,0.92) (ginseng,0.91) (alvaro,0.89) (baccanal,0.89) (carving,0.84) (legsclaws,0.83) (macarons,0.74) (oysters,0.64) (platinumdiamond,0.6) (virkelig,0.6) (wicked,0.57) (bacchanalian,0.56) (crab,0. 54) (dionysus,0.5) (fastpass,0.5) (wicket,0.5) (gelato,0.48) (legs,0.46) (bucch anal,0.45) (the,0.45) (maccaroons,0.44) (jonah,0.44) (vomitorium,0.44)
Example 2
Business ID: 20187
Business Name: Mon Ami Gabi
Business Address: 3655 Las Vegas Blvd S The Strip Las Vegas NV 89109
Character Count: 3481415
Hint. When optimizing your code, consider the following advice.
Instead of directly using java.io.FileInputStream, java.io.BufferedInputStream, or java.io.Reader, use java.io.BufferedReader with java.io.FileReader. You can read a line of the dataset with readLine() of BufferedReader. You can separate a String about commas with split() of String. String are immutable, and using + to concatenate Strings is inefficient. When youre building up a large string, use java.lang.StringBuffer or java.lang.StringBuilder.
If you add additional fields to class Business, make sure your additional fields arent too large. In particular, youre doing something wrong if each Business owns (as a field) a Collection or a Map.
Remark. As a point of reference, my code runs in 90 seconds. Almost all of this time is spent in the while loop reading the Businesses and accumulating the document count.
Remark. The suggested code for this assignment is not very object-oriented, and thats fine. Object-oriented programming is a tool, and a tool should only be used when it fits the task.
Remark. In our data analysis, we entirely ignore the grammar the reviews were written in. Natural language processing is a field of artificial intelligence that studies how to process human language with a computer. Techniques from natural language processing would use the grammatical information and produce better results.
Had issues uploading the data set as a file on here so I will post it in the question
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started