Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Task 2 : Secondary Index & Aggregation Query Processing Objective: Experimentation with Secondary Index over non - ordering non - key attribute and Aggregate Query

Task 2: Secondary Index & Aggregation Query Processing
Objective: Experimentation with Secondary Index over non-ordering non-key attribute and
Aggregate Query Execution Planning.
Assume the relation CITIZEN(ID, Tax-Code, Salary, Age) storing information about UK citizens' tax codes and
annual salaries. There are r=60,000,000 records. Each attribute has the same size: 128 bytes. The relation is
stored in a file sorted by the salary attribute. The block size B =1024 bytes and any pointer in the system has
size =128 bytes. The salary attribute is assumed to be uniformly distributed across tuples. There are 6,000
distinct salary values and there are 10,000 different tax-code values. The tax-code and age are assumed to be
statistically independent of salary. A data scientist, who has built a Secondary Index over the Tax Code (non-
ordering, non-key attribute), is interested in the specific Tax Code: '1234L'. The data scientist did analyse the
distribution of the Tax Code attribute and noted the following:
Let x be the number of citizens with Tax Code 1234L per data block. That is, given a random block,
there are X citizens with Tax Code 1234L.
P(x1)=0.5, i.e., the probability that at least one citizen has tax code 1234L is 50% within a block.
Therefore, when we pick up a block at random, the probability of finding therein at least one citizen
with Tax Code 1234L is 0.5.
If there are b data blocks in the file and we are asked to retrieve those citizens with Tax Code 1234L, then
ideally, we expect to access b2 blocks (ideal case). However, in reality, we do not know where these blocks
are! If we use the nave solution (scan the whole file) to retrieve all those citizens, then we need to access b
blocks.
Q2.1 The data scientist claims that the expected cost using the Secondary Index should be between
b2 and b. Which is the expected cost of retrieving the citizens of Tax Code 1234L using the Secondary Index?
How much bigger is this cost compared to the ideal case?
Q2.2 The data scientist is asked for a query processing plan for the aggregation query:
SELECT Salary, AVG (Age)
FROM CITIZEN
WHERE TaxCode ='1234L'
GROUP BY Salary
If the data management system devotes 100,000 blocks of RAM (memory), i.e., approx. 103MB (each one of
1024 bytes) for executing the query, and 5000 blocks for storing the results of the query, help the scientist by
providing a query execution plan. Describe your plan (e.g., steps, methods, ideas) and report on the
corresponding expected number of block accesses of your proposed solution. How much memory would you
need to store the result of the aggregate query?
image text in transcribed

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Spatial Databases A Tour

Authors: Shashi Shekhar, Sanjay Chawla

1st Edition

0130174807, 978-0130174802

More Books

Students also viewed these Databases questions