Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 26, 2024

Task 2 : Secondary Index & Aggregation Query Processing Objective: Experimentation with Secondary Index over non - ordering non - key attribute and Aggregate Query

Task

2

: Secondary Index & Aggregation Query Processing

Objective: Experimentation with Secondary Index over non

-

ordering non

-

key attribute and

Aggregate Query Execution Planning.

Assume the relation CITIZEN

(

,

Tax

-

Code, Salary, Age

)

storing information about UK citizens' tax codes and

annual salaries. There are

r = 60, 000, 000

records. Each attribute has the same size:

128

bytes. The relation is

stored in a file sorted by the salary attribute. The block size B

= 1024

bytes and any pointer in the system has

size

= 128

bytes. The salary attribute is assumed to be uniformly distributed across tuples. There are

6, 000

distinct salary values and there are

10, 000

different tax

-

code values. The tax

-

code and age are assumed to be

statistically independent of salary. A data scientist, who has built a Secondary Index over the Tax Code

(

non

-

ordering, non

-

key attribute

),

is interested in the specific Tax Code:

' 1234 L' .

The data scientist did analyse the

distribution of the Tax Code attribute and noted the following:

Let

x

be the number of citizens with Tax Code

1234 L

per data block. That is

,

given a random block,

there are X citizens with Tax Code

1234

.

P (x 1) = 0.5,

.

.,

the probability that at least one citizen has tax code

1234 L

50 %

within a block.

Therefore, when we pick up a block at random, the probability of finding therein at least one citizen

with Tax Code

1234 L

0.5 .

If there are

b

data blocks in the file and we are asked to retrieve those citizens with Tax Code

1234

,

then

ideally, we expect to access

\frac{b}{2}

blocks

(

ideal case

) .

However, in reality, we do not know where these blocks

are! If we use the na

ve solution

(

scan the whole file

)

to retrieve all those citizens, then we need to access

b

blocks.

2 . 1

The data scientist claims that the expected cost using the Secondary Index should be between

\frac{b}{2}

and

b .

Which is the expected cost of retrieving the citizens of Tax Code

1234 L

using the Secondary Index?

How much bigger is this cost compared to the ideal case?

2 . 2

The data scientist is asked for a query processing plan for the aggregation query:

SELECT Salary, AVG

(

Age

)

FROM CITIZEN

WHERE TaxCode

=' 1234 L'

GROUP BY Salary

If the data management system devotes

100, 000

blocks of RAM

(

memory

),

.

.,

approx.

103

(

each one of

1024

bytes

)

for executing the query, and

5000

blocks for storing the results of the query, help the scientist by

providing a query execution plan. Describe your plan

(

.

.,

steps, methods, ideas

)

and report on the

corresponding expected number of block accesses of your proposed solution. How much memory would you

need to store the result of the aggregate query?

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Spatial Databases A Tour

Authors: Shashi Shekhar, Sanjay Chawla

Consider two policiesa tax cut that will last for only 1 year and a tax cut that is expected to be permanent. Which policy will stimulate greater spending by consumers? Which policy will have the...

Answered: 1 week ago

Previous Question Next Question