Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Please provide the summary of the methodology and your understanding of this paper. Incluse necessary figures as well. Rapid Object Detection using a Boosted Cascade
Please provide the summary of the methodology and your understanding of this paper. Incluse necessary figures as well.
Rapid Object Detection using a Boosted Cascade of Simple Features single feature [2]. As a result each stage of the boosting process, which selects a new weak classifier, can be viewed as a feature selection process. AdaBoost provides an effective learning algorithm and strong bounds on generalization performance [13,9,10]. The third major contribution of this paper is a method for combining successively more complex classifiers in a cascade structure which dramatically increases the speed of the detector by focusing attention on promising regions of the image. The notion behind focus of attention approaches is that it is often possible to rapidly determine where in an image an object might occur [17,8,1]. More complex processing is reserved only for these promising regions. The key measure of such an approach is the "false negative" rate Figure 1: Example rectangle features shown relative to the of the attentional process. It must be the case that all, or enclosing detection window. The sum of the pixels which almost all, object instances are selected by the attentional lie within the white rectangles are subtracted from the sum filter. of pixels in the grey rectangles. Two-rectangle features are We will describe a process for training an extremely sim- shown in (A) and (B). Figure (C) shows a three-rectangle ple and efficient classifier which can be used as a "super- feature, and (D) a four-rectangle feature. vised" focus of attention operator. The term supervised refers to the fact that the attentional operator is trained to for using features rather than the pixels directly. The most detect examples of a particular class. In the domain of face common reason is that features can act to encode ad-hoc detection it is possible to achieve fewer than 1% false neg- domain knowledge that is difficult to learn using a finite atives and 40% false positives using a classifier constructed quantity of training data. For this system there is also a from two Harr-like features. The effect of this filter is to second critical motivation for features: the feature based reduce by over one half the number of locations where the system operates much faster than a pixel-based system. final detector must be evaluated. The simple features used are reminiscent of Haar basis Those sub-windows which are not rejected by the initial functions which have been used by Papageorgiou et al. [10]. classifier are processed by a sequence of classifiers, each More specifically, we use three kinds of features. The value slightly more complex than the last. If any classifier rejects of a two-rectangle feature is the difference between the sum the sub-window, no further processing is performed. The of the pixels within two rectangular regions. The regions structure of the cascaded detection process is essentially have the same size and shape and are horizontally or verthat of a degenerate decision tree, and as such is related to tically adjacent (see Figure 1). A three-rectangle feature the work of Geman and colleagues [1,4]. computes the sum within two outside rectangles subtracted An extremely fast face detector will have broad prac- from the sum in a center rectangle. Finally a four-rectangle tical applications. These include user interfaces, image feature computes the difference between diagonal pairs of databases, and teleconferencing. In applications where rectangles. rapid frame-rates are not necessary, our system will allow Given that the base resolution of the detector is 2424, for significant additional post-processing and analysis. In the exhaustive set of rectangle features is quite large, over addition our system can be implemented on a wide range of 180,000. Note that unlike the Haar basis, the set of rectansmall low power devices, including hand-helds and embed- gle features is overcomplete 1. ded processors. In our lab we have implemented this face detector on the Compaq iPaq handheld and have achieved detection at two frames per second (this device has a low 2.1. Integral Image power 200 mips Strong Arm processor which lacks floating Rectangle features can be computed very rapidly using an point hardware). intermediate representation for the image which we call the The remainder of the paper describes our contributions integral image. 2 The integral image at location x,y contains and a number of experimental results, including a detailed the sum of the pixels above and to the left of x,y, inclusive: description of our experimental methodology. Discussion of closely related work takes place at the end of each section. ii(x,y)=zz,yyi(x,y) 2. Features 1 A complete basis has no linear dependence between basis elements and has the same number of elements as the image space, in this case 576 . The full set of 180,000 thousand features is many times over-complete. Our object detection procedure classifies images based on 2 "There is a close relation to "summed area tables" as used in graphics [3]. We choose a different name here in order to emphasize its use for the the value of simple features. There are many motivations analysis of images, rather than for texture mapping. could be used to learn a classification function. In our system a variant of AdaBoost is used both to select a small set of features and train the classifier [6]. In its original form, the AdaBoost learning algorithm is used to boost the classification performance of a simple (sometimes called weak) learning algorithm. There are a number of formal guarantees provided by the AdaBoost learning procedure. Freund and Schapire proved that the training error of the strong classifier approaches zero exponentially in the number of rounds. More importantly a number of results were later proved about generalization performance [14]. The key Figure 2: The sum of the pixels within rectangle D can be insight is that generalization performance is related to the computed with four array references. The value of the inte- margin of the examples, and that AdaBoost achieves large gral image at location 1 is the sum of the pixels in rectangle margins rapidy. A. The value at location 2 is A+B, at location 3 is A+C, Recall that there are over 180,000 rectangle features asand at location 4 is A+B+C+D. The sum within D can sociated with each image sub-window, a number far larger be computed as 4+1(2+3) than the number of pixels. Even though each feature can be computed very efficiently, computing the complete set is prohibitively expensive. Our hypothesis, which is borne out where ii(x,y) is the integral image and i(x,y) is the origi- by experiment, is that a very small number of these features nal image. Using the following pair of recurrences: can be combined to form an effective classifier. The main s(x,y)=s(x,y1)+i(x,y) challenge is to find these features. s(x,y)=s(x,y1)+i(x,y) (1) In support of this goal, the weak learning algorithm is ii(x,y)=ix(x1,y)+s(x,y) (2) designed to select the single rectangle feature which best (where s(x,y) is the cumulative row sum, s(x,1)=0, to the approach of [2] in the domain of image database reand ii(1,y)=0 ) the integral image can be computed in trieval). For each feature, the weak learner determines the one pass over the original image. optimal threshold classification function, such that the minUsing the integral image any rectangular sum can be imum number of examples are misclassified. A weak clascomputed in four array references (see Figure 2). Clearly sifier hj(x) thus consists of a feature fj, a threshold j and the difference between two rectangular sums can be com- a parity pj indicating the direction of the inequality sign: puted in eight references. Since the two-rectangle features defined above involve adjacent rectangular sums they can be computed in six array references, eight in the case of the three-rectangle features, and nine for four-rectangle features. Here x is a 2424 pixel sub-window of an image. See Table 1 for a summary of the boosting process. 2.2. Feature Discussion In practice no single feature can perform the classificaRectangle features are somewhat primitive when compared tion task with low error. Features which are selected in early with alternatives such as steerable filters [5,7]. Steerable fil_ 0 . 1 ters, and their rese and 0.3. Features selected in later rounds, as the task besis of bounderes difficult, yield error rates between 0.4 and 0.5. in con boundaries, image compression, and texture analysis. In contrast rectangle features, while sensitive to the pres3.1. Learning Discussion ence of edges, bars, and other simple image structure, are quite coarse. Unlike steerable filters the only orientations Many general feature selection procedures have been proavailable are vertical, horizontal, and diagonal. The set of posed (see chapter 8 of [18] for a review). Our final applirectangle features do however provide a rich image repre- cation demanded a very aggressive approach which would sentation which supports effective learning. In conjunction discard the vast majority of features. For a similar recogniwith the integral image, the efficiency of the rectangle fea- tion problem Papageorgiou et al. proposed a scheme for feature set provides ample compensation for their limited flex- ture selection based on feature variance [10]. They demonibility. strated good results selecting 37 features out of a total 1734 features. 3. Learning Classification Functions Roth et al. propose a feature selection process based Given a featuresering on the Winnow exponential perceptron learning rule Given a feature set and a training set of positive and neg- The Winnow learning process converges to a solution where ative images, any number of machine learning approaches many of these weights are zero. Nevertheless a very large - Given example images (x1,y1)1,(xn,yn) where yi=0,1 for negative and positive examples respectively. - Initialize weights w1,i=2m1,2l1 for yi=0,1 respectively, where m and l are the number of negatives and positives respectively. - For t=1,,T : 1. Normalize the weights, wi,ij=1nwt,jwt,i so that wt is a probability distribution. 2. For each feature, j, train a classifier hj which is restricted to using a single feature. The Tigure 3: The first and second features selected by Aderror is evaluated with respect to wt,j= aBoost. The two features are shown in the top row and then iwihj(xi)yi. overlayed on a typical training face in the bottom row. The 3. Choose the classifier, hi, with the lowest error i. first feature measures the difference in intensity between the 4. Update the weights: region of the eyes and a region across the upper cheeks. The feature capitalizes on the observation that the eye region is often darker than the cheeks. The second feature compares the intensities in the eye regions to the intensity across the where ei=0 if example xi is classified cor- bridge of the nose. rectly, ei=1 otherwise, and i=1tet. - The final strong classifier is: of the nose and cheeks (see Figure 3). This feature is relatively large in comparison with the detection sub-window, and should be somewhat insensitive to size and location of the face. The second feature selected relies on the property where t=logt1 that the eyes are darker than the bridge of the nose. Table 1: The AdaBoost algorithm for classifier learn4. The Attentional Cascade ing. Each round of boosting selects one feature from the 180,000 potential features. This section describes an algorithm for constructing a cascade of classifiers which achieves increased detection performance while radically reducing computation time. The number of features are retained (perhaps a few hundred or key insight is that smaller, and therefore more efficient, thousand). boosted classifiers can be constructed which reject many of the negative sub-windows while detecting almost all positive instances (i.e. the threshold of a boosted classifier can 3.2. Learning Results be adjusted so that the false negative rate is close to zero). While details on the training and performance of the final Simpler classifiers are used to reject the majority of subsystem are presented in Section 5, several simple results windows before more complex classifiers are called upon merit discussion. Initial experiments demonstrated that a to achieve low false positive rates. frontal face classifier constructed from 200 features yields The overall form of the detection process is that of a dea detection rate of 95% with a false positive rate of I in generate decision tree, what we call a "cascade" (see Fig14084. These results are compelling, but not sufficient for ure 4). A positive result from the first classifier triggers the many real-world tasks. In terms of computation, this clas- evaluation of a second classifier which has also been adsifier is probably faster than any other published system, justed to achieve very high detection rates. A positive result requiring 0.7 seconds to scan an 384 by 288 pixel image. from the second classifier triggers a third classifier, and so Unfortunately, the most straightforward technique for im- on. A negative outcome at any point leads to the immediate proving detection performance, adding features to the clas- rejection of the sub-window. sifier, directly increases computation time. Stages in the cascade are constructed by training clasFor the task of face detection, the initial rectangle fea- sifiers using AdaBoost and then adjusting the threshold to tures selected by AdaBoost are meaningful and easily inter- minimize false negatives. Note that the default AdaBoost preted. The first feature selected seems to focus on the prop- threshold is designed to yield a low error rate on the trainerty that the region of the eyes is often darker than the region ing data. In general a lower threshold yields higher detec- 4.1. Training a Cascade of Classifiers The cascade training process involves two types of tradeoffs. In most cases classifiers with more features will achieve higher detection rates and lower false positive rates. At the same time classifiers with more features require more time to compute. In principle one could define an optimization framework in which: i) the number of classifier stages, ii) the number of features in each stage, and iii) the threshold of each stage, are traded off in order to minimize the expected number of evaluated features. Unfortunately finding this optimum is a tremendously difficult problem. Figure 4: Schematic depiction of a the detection cascade. In practice a very simple framework is used to produce A series of classifiers are applied to every sub-window. The an effective classifier which is highly efficient. Each stage ples with very little processing. Subsequent layers eliminate additional negatives but require additional computation. Af- the detection rate. A target is selected for the minimum ter several stages of processing the number of sub-windows detection. Each stage is trained by adding features until the have been reduced radically. Further processing can take target detection and false positives rates are met (these rates any form such as additional stages of the cascade (as in our are determined by testing the detector on a validation set). detection system) or an alternative detection system. Stages are added until the overall target for false positive and detection rate is met. 4.2. Detector Cascade Discussion tion rates and higher false positive rates. The complete face detection cascade has 38 stages with over For example an excellent first stage classifier can be con- 6000 features. Nevertheless the cascade structure results in structed from a two-feature strong classifier by reducing the fast average detection times. On a difficult dataset, conthreshold to minimize false negatives. Measured against a taining 507 faces and 75 million sub-windows, faces are validation training set, the threshold can be adjusted to de- detected using an average of 10 feature evaluations per subtect 100% of the faces with a false positive rate of 40%. See window. In comparison, this system is about 15 times faster Figure 3 for a description of the two features used in this than an implementation of the detection system constructed classifier. by Rowley et al. 3[12] A notion similar to the cascade appears in the face deComputation of the two feature classifier amounts to tection system described by Rowley et al. in which two deabout 60 microprocessor instructions. It seems hard to tection networks are used [12]. Rowley et al. used a faster imagine that any simpler filter could achieve higher rejec- yet less accurate network to prescreen the image in order to tion rates. By comparison, scanning a simple image tem- find candidate regions for a slower more accurate network. plate, or a single layer perceptron, would require at least 20 Though it is difficult to determine exactly, it appears that times as many operations per sub-window. Rowley et al.'s two network face system is the fastest exist- The structure of the cascade reflects the fact that ing face detector. 4 within any single image an overwhelming majority of sub-_ The structure of the cascaded detection process is eswindows are negative. As such, the cascade attempts to re- sentially that of a degenerate decision tree, and as such is ject as many negatives as possible at the earliest stage pos- related to the work of Amit and Geman [1]. Unlike techsible. While a positive instance will trigger the evaluation niques which use a fixed detector, Amit and Geman propose of every classifier in the cascade, this is an exceedingly rare an alternative point of view where unusual co-occurrences event. of simple image features are used to trigger the evaluation of a more complex detection process. In this way the full Much like a decision tree, subsequent classifiers are detection process need not be evaluated at many of the potrained using those examples which pass through all the tential image locations and scales. While this basic insight previous stages. As a result, the second classifier faces a 3 Henry Rowley very graciously supplied us with implementations of more difficult task than the first. The examples which make his detection system for direet comparison. Reported results are against it through the first stage are "harder" than typical exam- his fastest system. It is difficult to determine from the published literature, ples. The more difficult examples faced by deeper classi- but the Rowley-Baluja-Kanade detector is widely considered the fastest fiers push the entire receiver operating characteristic (ROC) detection system and has been beavily tested on real-world problems. curve downward. At a given detection rate, deeper classi- 4 Other published detectors have either neglected to discuss perforfiers have correspondingly higher false positive rates. on a large and difficult training set. is very valuable, in their implementation it is necessary to first evaluate some feature detector at every location. These features are then grouped to find unusual co-occurrences. In practice, since the form of our detector and the features that it uses are extremely efficient, the amortized cost of evaluating our detector at every scale and location is much faster than finding and grouping edges throughout the image. In recent work Fleuret and Geman have presented a face detection technique which relies on a "chain" of tests in order to signify the presence of a face at a particular scale and location [4]. The image properties measured by Fleuret and Geman, disjunctions of fine scale edges, are quite different than rectangle features which are simple, exist at all scales, and are somewhat interpretable. The two approaches also differ radically in their learning philosophy. The motivation for Fleuret and Geman's learning process is density estimation and density discrimination, while our detector is purely discriminative. Finally the false positive rate of Fleuret and Geman's approach appears to be higher than that of previous approaches like Rowley et al. and this approach. Unfortunately the paper does not report quantitative results of this kind. The included example images each have between 2 and 10 false positives. The speed of the cascaded detector is directly related to the number of features evaluated per scanned sub-window. Evaluated on the MIT+CMU test set [12], an average of 10 5 Results features out of a total of 6061 are evaluated per sub-window. . This is possible because a large majority of sub-windows classifier was trained to detect frontal are rejected by the first or second layer in the cascade. On upright faces. To train the detector, a set of face and non- a 700Mhz Pentium III processor, the face detector can proface training images were used. The face training set con- cess a 384 by 288 pixel image in about .067 seconds (ussisted of 4916 hand labeled faces scaled and aligned to a ing a starting scale of 1.25 and a step size of 1.5 described base resolution of 24 by 24 pixels. The faces were ex- below). This is roughly 15 times faster than the Rowleytracted from images downloaded during a random crawl of Baluja-Kanade detector [12] and about 600 times faster than the world wide web. Some typical face examples are shown the Schneiderman-Kanade detector [15]. in Figure 5. The non-face subwindows used to train the detector come from 9544 images which were manually in- Image Processing spected and found to not contain any faces. There are about All example sub-windows used for training were vari350 million subwindows within these non-face images. ance normalized to minimize the effect of different light- The number of features in the first five layers of the de- ing conditions. Normalization is therefore necessary during tector is 1,10,25,25 and 50 features respectively. The detection as well. The variance of an image sub-window remaining layers have increasingly more features. The total can be computed quickly using a pair of integral images. number of features in all layers is 6061 . Recall that 2=m2N1x2, where is the standard Each classifier in the cascade was trained with the 4916 deviation, m is the mean, and x is the pixel value within training faces (plus their vertical mirror images for a total the sub-window. The mean of a sub-window can be comof 9832 training faces) and 10,000 non-face sub-windows puted using the integral image. The sum of squared pixels (also of size 24 by 24 pixels) using the Adaboost training is computed using an integral image of the image squared procedure. For the initial one feature classifier, the non- (i.e. two integral images are used in the scanning process). face training examples were collected by selecting random During scanning the effect of image normalization can be sub-windows from a set of 9544 images which did not con- achieved by post-multiplying the feature values rather than tain faces. The non-face examples used to train subsequent pre-multiplying the pixels. layers were obtained by scanning the partial cascade across the non-face images and collecting false positives. A maxScanning the Detector imum of 10000 such non-face sub-windows were collected The final detector is scanned across the image at multifor each layer. ple scales and locations. Scaling is achieved by scaling the detector itself, rather than scaling the image. This process Speed of the Final Detector makes sense because the features can be evaluated at any Table 2: Detection rates for various numbers of false positives on the MIT + CMU test set containing 130 images and 507 faces. scale with the same cost. Good results were obtained using Thus, in order to construct a complete ROC curve, classifier a set of scales a factor of 1.25 apart. L_ layers are removed. We use the number of false positives as The detector is also scanned across location. Subsequent opposed to the rate of false positives for the x-axis of the locations are obtained by shifting the window some number ROC curve to facilitate comparison with other systems. To of pixels . This shifting process is affected by the scale of compute the false positive rate, simply divide by the total the detector: if the current scale is s the window is shifted number of sub-windows scanned. In our experiments, the by [s], where [] is the rounding operation. number of sub-windows scanned is 75,081,800. The choice of affects both the speed of the detector as Unfortunately, most previous published results on face well as accuracy. The results we present are for =1.0. detection have only included a single operating regime (i.e. We can achieve a significant speedup by setting =1.5 single point on the ROC curve). To make comparison with with only a slight decrease in accuracy. our detector easier we have listed our detection rate for the Integration of Multiple Detections false positive rates reported by the other systems. Table 2 Since the final detector is insensitive to small changes in tions for our system as well as other published systems. For translation and scale, multiple detections will usually occur the Rowley-Baluja-Kanade results [12], a number of differaround each face in a scanned image. The same is often true ent versions of their detector were tested yielding a number of some types of false positives. In practice it often makes of different results they are all listed in under the same headsense to return one final detection per face. Toward this end ing. For the Roth-Yang-Ahuja detector [11], they reported it is useful to postprocess the detected sub-windows in order their result on the MIT + CMU test set minus 5 images conto combine overlapping detections into a single detection. taining line drawn faces removed. In these experiments detections are combined in a very Figure 7 shows the output of our face detector on some simple fashion. The set of detections are first partitioned test images from the MIT +CMU test set. into disjoint subsets. Two detections are in the same subset if their bounding regions overlap. Each partition yields a A simple voting scheme to further improve results single final detection. The corners of the final bounding In table 2 we also show results from running three deregion are the average of the corners of all detections in the tectors (the 38 layer one described above plus two similarly set. trained detectors) and outputting the majority vote of the Experiments on a Real-world Test Set_ three detectors. This improves the detection rate as well as We testedoursysters positives. The improvement would be greater if the detectors were more independent. The corset [12]. This set consists of 130 images with 507 labeled relation of their errors results in a modest improvement over frontal faces. A ROC curve showing the performance of our the best single detector. detector on this test set is shown in Figure 6. To create the ROC curve the threshold of the final layer classifier is adjusted from to +. Adjusting the threshold to +6 Conclusions will yield a detection rate of 0.0 and a false positive rate of 0.0. Adjusting the threshold to , however, increases We have presented an approach for object detection which both the detection rate and false positive rate, but only to a minimizes computation time while achieving high detection certain point. Neither rate can be higher than the rate of the accuracy. The approach was used to construct a face dedetection cascade minus the final layer. In effect, a thresh- tection system which is approximately 15 faster than any old of is equivalent to removing that layer. Further previous approach. increasing the detection and false positive rates requires de-_ This paper brings together new algorithms, representacreasing the threshold of the next classifier in the cascade. tions, and insights which are quite generic and may well Figure 7: Output of our face detector on a number of test images from the MIT + CMU test set. [3] F. Crow. Summed-area tables for texture mapping. In Proceedings of SIGGRAPH, volume 18(3), pages 207-212, 1984. [4] F. Fleuret and D. Geman. Coarse-to-fine face detection. Int. J. Computer Vision, 2001. [5] William T. Freeman and Edward H. Adelson. The design and use of steerable filters. IEEE Transactions on Pattem Analysis and Machine Intelligence, 13(9):891-906, 1991. [6] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Computational Learning Theory: Eurocolt 95 , pages 23-37. Springer-Verlag, 1995. [7] H. Greenspan, S. Belongie, R. Gooodman, P. Perona, S. Rakshit, and C. Anderson. Overoomplete steerable pyramid filters and rotation invariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1994. Figure 6: ROC curve for our face detector on the [8] L. Itti, C. Koch, and E. Niebur. A model of saliency-based MIT + CMU test set. The detector was run using a step size visual attention for rapid scene analysis. IEEE Patt. Anal. of 1.0 and starting scale of 1.0(75,081,800 sub-windows Mach. Intell., 20(11):1254-1259, November 1998. scanned). [9] Edgar Osuna, Robert Freund, and Federico Girosi. Training support vector machines: an application to face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1997. have broader application in computer vision and image pro- [10] C. Papageorgiou, M. Oren, and T. Poggio. A general framecessing. work for object detection. In International Conference on Finally this paper presents a set of detailed experiments Computer Vision, 1998. on a difficult face detection dataset which has been widely [11] D. Roth, M. Yang, and N. Ahuja. A snowbased face detector. studied. This dataset includes faces under a very wide range In Neural Information Processing 12, 2000. of conditions including: illumination, scale, pose, and cam- [12] H. Rowley, S. Baluja, and T. Kanade. Neural network-based era variation. Experiments on such a large and complex face detection. In IEEE Patt. Anal. Mach. Intell., volume 20, dataset are difficult and time consuming. Nevertheless syspages 22-38, 1998. tems which work under these conditions are unlikely to be [13] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boostbrittle or limited to a single set of conditions. More imporing the margin: a new explanation for the effectiveness of tantly conclusions drawn from this dataset are unlikely to voting methods. Ann. Stat., 26(5):1651-1686, 1998. be experimental artifacts. [14] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In Proceedings of the References Fourteenth International Conference on Machine Learning, [1] Y. Amit, D. Geman, and K. Wilder. Joint induction of shape 1997. features and tree classifiers, 1997. [15] H. Schneiderman and T. Kanade. A statistical method for 3D [2] Anonymous. Anonymous. In Anonymous, 2000. object detection applied to faces and cars. In International Conference on Computer Vision, 2000Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started