Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

IN JAVA! Problem Statements: Design and implement a web crawler and data structure for the Simulation of Google Search Engine (if you didnt or you

IN JAVA!

Problem Statements: Design and implement a web crawler and data structure for the Simulation of Google Search Engine (if you didnt or you built a bad one, it is the time for you to retry and develop a nicer one), you are a Software Engineer at Google and are asked to conduct the following Googles Search Engine Internal Process:

1. Based on scores of 30 URLs that you retrieved from your web crawler, use Quicksort algorithm to sort the scores for PageRank instead of using Heapsort from PA1. (You MUST print a list of 30 URLs including Index, total score, PageRank, and URL.)

2. Based on scores of 30 URLs, use Binary Search Tree to manipulate the data. The following section list the programming requirements for BST.

3. To speed up the search result, Google search engine dynamically collects a list of top N popular keywords (N=10 for instance in PA1) and use Bucket Sort to sort their companys names of URLs in alphabetical order (example: http://www.abcde.com). You can store companys name starting with A in bucket A, starting with B in bucket B, and so on.

Programming Requirements:

1. The Google Search Engine Internal Process MUST be written in Java and it is required to usepseudo codes in the textbook. (Please be noted that you MUST use the pseudo codes provided in textbook to write your Java codes. Any other codes will be treated as failing requirements)

2. You need to use a Web Crawler to enter keywords and then each keyword will receive 30 URLs for further Google Internal Process.

3. You must follow the four PageRank factors to calculate score for PageRank.

A Web page's PageRank depends on a few factors: 1. The frequency and location of keywords within the Web page: If the keyword only appears once within the body of a page, it will receive a low score for that keyword. 2. How long the Web page has existed: People create new Web pages every day, and not all of them stick around for long. Google places more value on pages with an established history. 3. The number of other Web pages that link to the page in question: Google looks at how many Web pages link to a particular site to determine its relevance.4. How much the webpage owner has paid to Google for advertisement purpose: Websites owners pay a lump sum of money to Google to increase the priority of PageRank for advertisement of their services/products.

4. Your simulation application MUST at least contain the following functions for URL Binary Search Tress (BST):

a) Build up a Process BST. (MUST follow BST properties specified in textbook and ppt slides. Your own tree structure will not be accepted.)

b) Users can search a specific PageRank and show the specific URL (User want to know the score of a specific website.).

c) Users can insert a URL to the BST based on its total score and show the result.

d) User can delete a URL from the BST and show the result.

e) Users can make a sorted list of URLs according to score from the BST and show the result.

f) To show the result, you MUST print a list of URLs including Index, total score, PageRank, and URL.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Information Modeling And Relational Databases

Authors: Terry Halpin, Tony Morgan

2nd Edition

0123735688, 978-0123735683

More Books

Students also viewed these Databases questions

Question

3. Identify the four characteristics of popular culture.

Answered: 1 week ago