Question
IN JAVA! Problem Statements: Design and implement a web crawler and data structure for the Simulation of Google Search Engine (if you didnt or you
IN JAVA!
Problem Statements: Design and implement a web crawler and data structure for the Simulation of Google Search Engine (if you didnt or you built a bad one, it is the time for you to retry and develop a nicer one), you are a Software Engineer at Google and are asked to conduct the following Googles Search Engine Internal Process:
1. Based on scores of 30 URLs that you retrieved from your web crawler, use Quicksort algorithm to sort the scores for PageRank instead of using Heapsort from PA1. (You MUST print a list of 30 URLs including Index, total score, PageRank, and URL.)
2. Based on scores of 30 URLs, use Binary Search Tree to manipulate the data. The following section list the programming requirements for BST.
3. To speed up the search result, Google search engine dynamically collects a list of top N popular keywords (N=10 for instance in PA1) and use Bucket Sort to sort their companys names of URLs in alphabetical order (example: http://www.abcde.com). You can store companys name starting with A in bucket A, starting with B in bucket B, and so on.
Programming Requirements:
1. The Google Search Engine Internal Process MUST be written in Java and it is required to usepseudo codes in the textbook. (Please be noted that you MUST use the pseudo codes provided in textbook to write your Java codes. Any other codes will be treated as failing requirements)
2. You need to use a Web Crawler to enter keywords and then each keyword will receive 30 URLs for further Google Internal Process.
3. You must follow the four PageRank factors to calculate score for PageRank.
A Web page's PageRank depends on a few factors: 1. The frequency and location of keywords within the Web page: If the keyword only appears once within the body of a page, it will receive a low score for that keyword. 2. How long the Web page has existed: People create new Web pages every day, and not all of them stick around for long. Google places more value on pages with an established history. 3. The number of other Web pages that link to the page in question: Google looks at how many Web pages link to a particular site to determine its relevance.4. How much the webpage owner has paid to Google for advertisement purpose: Websites owners pay a lump sum of money to Google to increase the priority of PageRank for advertisement of their services/products.
4. Your simulation application MUST at least contain the following functions for URL Binary Search Tress (BST):
a) Build up a Process BST. (MUST follow BST properties specified in textbook and ppt slides. Your own tree structure will not be accepted.)
b) Users can search a specific PageRank and show the specific URL (User want to know the score of a specific website.).
c) Users can insert a URL to the BST based on its total score and show the result.
d) User can delete a URL from the BST and show the result.
e) Users can make a sorted list of URLs according to score from the BST and show the result.
f) To show the result, you MUST print a list of URLs including Index, total score, PageRank, and URL.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started