See that the actual question says to "Count the number of tokens in a stream" and NOT distinct tokens.
- To solve this question, knowledge of Hashing (Universal and 2-Universal) and Streaming Algorithms is mandatory.
- Below, background information is given VERY BRIEFLY, only for revision or telling the tutor what sort of answers are expected.
- If you are not familiar with these concepts, GOOGLE/LEARN FIRST!
- If still facing doubts, leave a comment and I will reply in under a few minutes.
Veg brief background information regarding Streaming algorithms (actual guestion is after this): In streaming algorithms, we think about space. Input consists of m elements (often called tokens). Algorithms see items one by one (in a stream). Algorithms have limited memory of size B which is total number of bits (B k, pick a number uniformly at random in range [1, j], let's call it p. 3. If (p S k), replace reslp] (pm element in the reservoirs array) with the element j ((2,), else discard 3,. 4. This claims that P is actually % (can be proven by induction). k Counting number of distinct elements in a stream 1. Stream 1, 1, 2, 3, 1, 2, 2, 5, 5, 2, 7 Distinct (n) = 5, Total (m) = 11 Space = 0(n). We need to do better. But if we want exact deterministic solution, Space = mn). 2. Modified target Using less space (Otlogn)), estimating 'n' (allowing errors), estimating well with high probability. 3- |de_a|= I Using idealized hash function h : {1, 2, 3, ..., n} > [0,1]. I Big Assumption: h maps any number in {1, 2, 3, ..., n} uniformly and independently at random in interval [0,1]. I Lemma: Suppose X1,X2, ,Xi1 are random variables uniformly 1 distributed in interval [0,1] 8: suppose Y = minin. Then, E [Y] = n+1 Proof: We find a Probability Density function. Let dt be a very small interval in [0,1]. We find P(Y belongs to [t, t+dt]), meaning that the minimum falls in that tiny interval. For this to happen, the rest of the elements should be in 1 t and one element/token in the dt interval. P[(one number in [t, t+dt] n (all others falling in [t,1])] w n . (dt . (1 071-1) 1 E[Y] = fn.(1 t)\"'1dt = . Algorithm using Idea I (basic estimator): In the following, let Y be the final value returned by the algorithm. LetY [0,1] (h is an idealized hash func) b) Show that Var[Y] = n(n 1)/2 c) It can be shown that the value ofX grows only till loglagm with high probability - While (stream is non-empty) Let i be the next element/token Y i S 512 using Chebysheifs Inequality. 4. Idea II: I (using aggregate estimator) Run k independent copies of the basic estimator: Y1, Y2, , Yk. 1 \"22'? l 1 Return 3 1 E[Z] = i, Var[Z] 5 mini Applying Chebyshev's inequality again to check probability of error P[|Y L > i 0. For this part, we consider an alternate (and somewhat more elegant) way of modifying the basic estimator to achieve better estimates. Suppose you modify the given algorithm as follows - you increment X with probability (1+a)\" , for some a > 0 (a = 1 in the above algorithm). What should the algorithm return now? Determine the value of a that you need to choose in order to find an estimate Y such that ]Y ml 5 em with probability at least 9/10? Disclaimer: The solution to the above problem can be found on the internet with a little effort. But I need an answer with good and legit explanation. The actual Question: Counting the Number of tokens in a stream It is trivial to see that if there are m tokens in the stream, then (lagzml many bits sufce to keep track of the number of tokens. Now consider the following randomized algorithm. Probabilistic Counting: Initialize X = 0. while stream is non-empty With probabilityzix, increment X