Answered step by step
Verified Expert Solution
Question
1 Approved Answer
The Apriori algorithm for discovering frequent itemsets can be expensive. So people are always looking at different ways to improve further the approach's efficiency. One
The Apriori algorithm for discovering frequent itemsets can be expensive. So people are always looking at different ways to improve further the approach's efficiency.
One expensive step, for example, is validating candidate itemsets by counting their supports to find the ones meeting or exceeding the support threshold. A common implementation is to use a hashtree. If this step is done navely the implementation will be too slow to be used.
Let us consider an Apriori program that takes as input the frequent itemsets of length k with respect to some supportcount threshold the transaction itemsets, and a supportcount threshold, and produces as output the frequent itemsets of length kWe assume the supportcount threshold for the input frequent itemsets provided will be the same as or lower than, the requested threshold for the output. Let us name such a program levelUp.
How much time levelUp will take for a given task depends on how many precandidates are produced by the join step A precandidate advances to being a candidate itemset if it passes the apriori test. At a given supportcount threshold, the same number of candidate itemsets will result, regardless. But perhaps we can reduce the number of precandidates produced.
Consider that we order the items in each itemset from least to most frequent, rather than just ordering them in some random way, say lexicographically as by the standard algorithm That is how frequently each item appears in the transaction database; so the frequencies of those itemsets. Why could this help? The prefixes of the frequent itemsets of length k will be less common, meaning fewer precandidates should result from the join step. If this is significant in practice, this could make the algorithm perform better.
Write a program in Python or in Java called levelUppy or LevelUpjava respectively, to test this. Your algorithm should take three arguments, and a fourth optional argument:
a file with the frequent itemsets of length k at the given support threshold but not with the support counts reported
a file with the transaction itemsets, and
the support threshold count.
Eg
python levelUp.py mushroomlevsupdat mushroomtrans.dat
or
java LevelUp mushroomlevsupdat mushroomtrans.dat
The frequent itemsets are to be read in from a file; eg mushroomlevsupdat.
Each frequent itemset is on a separate line and is space separated.
Each item is represented by an integer value.
For each itemset, the items are ordered in the same way.
Eg
This should run the usual Apriori algorithm to write to standard output the frequent itemsets of length k as described in A above, with the support counts.
If called with the optional argument, eg
python levelUp.py mushroomlevsupdat mushroomtrans.dat mushroomlevsupwCount.dat
it should do the same, but applying the precandidate optimization, of ordering the items per itemset by frequency. The last argument, eg mushroomlevsupwCount.dat, is a file of the frequent itemsets with the support counts. The wCount variant for a frequentitemset file is the same as above, except
The first integer per line is the support count.
Eg
Instrument your program to track running time, the number of precandidates found, and the number of candidates found.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started