Question

1 Approved Answer

Posted on Jun 29, 2024

Large data sets 2 points possible (graded) When a data set is large, we would ideally want to split up our loss function into a

Large data sets 2 points possible (graded) When a data set is large, we would ideally want to split up our loss function into a sum of individual loss functions for each data point: f (w) = If. (w), where fi (w) is the individual data point loss functions. For example, in a least squares regression, we choose / to be the square of the residual for measured data point y with prediction y ;: fi (w) = (9: -9)', so that the loss function f is a sum of these squared residuals. The gradient of the loss function is then the sum of gradients for each individual data-point loss function: (VA) (w) = >(A)(w). This requires the computation of the gradient for each and every data point before we can find the total gradient of the loss function. Suppose we are working on a data set with 80 columns. The entries for each column are stored as single & byte double precision floating point numbers. In addition, we have ten billion data points in the data set; meaning that there are ten billion rows. How much memory does the entire data set require, in Gigabytes? (Use the SI convention that 1 GB = 1000 MB.) Could the entire data set be loaded into RAM on an average laptop or desktop computer? Yes. No.f (w) + f (w) = Elfi (w)]. We can then estimate the update step by using the gradient for only one data point, with= w - (Vfi) (w). The data point to use should be chosen randomly at each iteration step. As long as we choose f, uniformly from the data set, we find that this estimator is unbiased: E[+] =w -QE[(Vf.) (w.)] = Wt - at (VA) (w. ) = Wt+1 - This is known as stochastic gradient descent. In order to ensure that we use all available data, we should choose the maximum number of iterations to be several times larger than the data set. There are also variations on this scheme where the data point used at each step is selected sequentially an array of data points. In that case, the update rule would be wit1 = wt - at (Vft mod N) (w.) . A sequential stochastic gradient descent can be useful if selecting random data points on the fly is computationally expensive. This can occur for some storage mediums, such as hard drives, which have a slow random access time. By reading the data sequentially, we improve performance by ordering our read requests to the storage medium. In the sequential variation of stochastic gradient descent, how should we initialize this array of data points to minimize the number of iterations required to converge (i.e. increase the algorithmic performance of SGD itself)? In the order that the data points were observed experimentally. In random order. The initialization strategy for the array of data points does not matter.SGD variance 1 point possible (graded) However, the variance of this estimator can be quite large. This means that while on average, the estimator is equal to the whole data-set update rule, any individual w141 can be quite scattered. We can combat this variance in two ways. First, notice that Var (wi+1) = al Var ((Vf.) (wx)). We can't control the variance of the gradients, that is dependent on our data set. What we can control is the step size, and by reducing the step size we also reduce the variance in the estimator. Second, we can modify the estimator so that it computes an average over a small sample, S, of size |S | = k: 1S( VA) ( 20) , where the sample is drawn uniformly from the population without replacement. Now the variance will be Var (will) = K " Var ((Vf:) ( w:)), when k is much smaller than the population size. So we can see that by increasing the size of the sample, we decrease the variance of the estimator as well. This is known as mini-batch stochastic gradient descent. What happens to the variance if we choose the mini-batch to be the same size as the data set (k = N)? O Var (u4) = 0 O Ver (w1) = of Var ((VA.) (wt)) Var (w;) = " Var ((Vf:) (w.)) We can't know this in general, it depends on the data set. Both of these techniques come at the cost of computational power. If we reduce the step size, then we need to take, and