Answered step by step
Verified Expert Solution
Link Copied!

Question

...
1 Approved Answer

nodes, but at least its bias can be quantified by Markov Chain L. INTRODUCTION analysis and thus can be corrected via appropriate re-weighting The popularity

image text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribed
nodes, but at least its bias can be quantified by Markov Chain L. INTRODUCTION analysis and thus can be corrected via appropriate re-weighting The popularity of online social networks (OSNs) in recent (RWRW). Third, we consider the Metropolis-Hastings Random years is continuously increasing. Facebook (FB), in particular, Walk (MHRW) that directly achieves the goal, i.e., yields a is one of the most important online social networks (OSNs) uniform stationary distribution of nodes (users). This technique today. It has the highest number of active users (at least 250M has been used in the past for P2P sampling [8], recently [I]) with more than half active FB users returning daily and the for a few OSNs [9]. [10], but not for Facebook. Finally, we largest number of visitors among OSNs according to Comscore also collect a sample that represents the "ground truth" (UNI) [2] (295M unique worldwide Internet users in March 2009). i.e., a truly uniform sample of Facebook userIDs, selected by This success has generated interest within the networking a rejection sampling procedure from the system's 32-bit ID community and has given rise to a number of measurement space. Such ground truth is in general unavailable, and our and characterization studies. Some studies [3], [4] are based on ability to use it as a basis of comparison is therefore a valuable complete datasets of specific FB communities, which are called asset of this study. We compare all sampling methods in terms "networks" within FB. However, the complete dataset is typi- of their bias and convergence properties. We also provide cally not available and, as most OSNs, Facebook is unwilling recommendations for their use in practice: a.g., we implement to share their company's data. Therefore, a relatively small but online formal convergence diagnostic tests and parallel walks representative sample is desirable in order to study properties for improved speed; we also discuss pros and cons of MHRW and test algorithms for these OSNs. A number of studies have VE. RWRW in practice. already crawled social networks and Facebook, using mostly In terms of results, we show that MHRW and RWRW work BFS-like techniques, which are known to introduce bias. remarkably well. We demonstrate their aggregate statistical properties, validating them against the known uniform sample, Minas Gjoke and Athina Markopoulou were partially supported by NSF and show how our formal diagnostics can be used to identify CAREER grant 0747110. Maciej Kurant was visiting UC Irvine during the period that this work was conducted and he was supported by grant ManCom convergence during the sampling process. In contrast, we find 2110 of the Hasler Foundation, Bern, Switzerland. Carter T. Butts was that the more traditional methods - BFS and RW - lead to supported by DOD ONR award NOOO14-06-1-1015. a significant bias in the case of FB. Finally, using one ofour validated samples (MHRW), we also characterize some work is mostly related to the random walk techniques, as we key properties of Facebook; we find some of them to be sub- obtain unbiased estimators using MHRW and RWRW; BFS stantively different from what was previously believed based and RW (without re-weighting) are used mainly as baselines on biased samples. The collected datasets are made publicly for comparison. We accompany the basic crawling techniques available for use by the research community at [1 1]. with formal, online convergence diagnostic tests using several The structure of the paper is as follows. Section II discusses node properties, which, to the best of our knowledge, has related work. Section III describes the sampling techniques not been done before in measurements of such systems. We and convergence diagnostics. Section IV summarizes the data also implement multiple parallel chains, which have also been collection process and the data sets. Section V evaluates and recently used in [16] but started at the same node (while we compares all sampling techniques in terms of convergence start from different nodes, thus better utilizing the multiple of various node properties and quality (lack of bias) of the chains). In terms of application, we perform unbiased sampling obtained sample. Section VI provides a characterization of of Facebook for the first time. A unique asset of our study is a some key Facebook properties, based on the MHRW sample. true uniform sample through sampling of userIDs, which can Section VII concludes the paper. serve as ground truth to evaluate the crawling technique. Other Measurements of Facebook. The work by Wilson et II. RELATED WORK al. [7] measures social and user interaction graphs in Facebook Crawling techniques can be roughly classified into two between March and May 2008. Their sampling methodology categories (a) graph traversal techniques and (b) random walks. is a region-constrained BFS. Such Region-Constrained BFS In graph fraversal techniques, each node in the connected might be appropriate to study particular regions, but it does not component is visited exactly once, if we let the process run provide Facebook-wide information, which is the goal of our until completion. These methods vary in the order in which study; furthermore, and unlike random walks, the bias of BFS they visit the nodes; examples include Breadth-Search-First has not been formally analyzed for arbitrary graphs. In [24] the (BFS). Depth-First Search (DFS), Forest Fire (FF) and Snow- authors examine the usage of privacy settings in Myspace and ball Sampling (SBS). BFS, in particular, is a basic technique Facebook and the potential privacy leakage. In our previous that has been used extensively for sampling OSNs in past work in [25]. we characterized the popularity and user reach research [5]-[7]- One reason for this popularity is that an of Facebook applications. Finally, there are also two complete (even incomplete) BFS sample collects a full view (all nodes and publicly available datasets corresponding to two university and edges) of some particular region in the graph, which is networks from Facebook, namely Harvard [3] and Caltech [4]. sometimes believed to be representative of the entire graph [7]. In contrast, we sample the global Facebook social graph and However, BFS leads to a bias towards high degree nodes [12]. also make the data set publicly available [II]. To the best of [13]. Furthermore, this bias has not been analyzed so far for our knowledge, compared to previous measurements this paper arbitrary graphs. In order to remove this bias, effort is usually provides the first unbiased sample of Facebook. put on completing the BFS, Le., on collecting all or most of the nodes in the graph. III. SAMPLING METHODOLOGY Random walks allow node re-visiting and have well-known A. Scope and Assumptions properties - see [14] for an excellent survey. They have been The FB social graph can be modeled as an undirected graph used for sampling the Web [15]. P2P networks [8]. [16], [17]. G = (V, E), where V is a set of nodes (users) and E is a and other large graphs [18]. The application of random walks to set of edges (mutual friendship relationships). Let k, be the OSNs, such as Twitter [10] and Friendster [9], is very recent; degree of node v. In this paper: (i) we are interested only in the to the best of our knowledge we are the first to apply these publicly declared friends, which, under default privacy settings, techniques to Facebook sampling [19]. Random walks can be are available to any logged-in user; (ii) we are not interested in biased but their bias can be analyzed using classic results from isolated users, Le, users without any declared friends; (iii) we Markov Chains and corrected by re-weighting the estimators. consider that the FB graph remains static during our crawling. This has been demonstrated in the context of P2P sampling We justify and discuss in detail assumption (iii) in Section IV. [16], where the re-weighted random walk is considered as B. Sampling Methodsput on completing the BFS, Le., on collecting all or most of the nodes in the graph. III. SAMPLING METHODOLOGY Random walks allow node re-visiting and have well-known A. Scope and Assumptions properties - see [14] for an excellent survey. They have been The FB social graph can be modeled as an undirected graph used for sampling the Web [15]. P2P networks [8]. [16], [17]. G = (V, E), where V is a set of nodes (users) and E is a and other large graphs [18]. The application of random walks to set of edges (mutual friendship relationships). Let &, be the OSNs, such as Twitter [10] and Friendster [9], is very recent; degree of node v. In this paper. (i) we are interested only in the to the best of our knowledge we are the first to apply these publicly declared friends, which, under default privacy settings, techniques to Facebook sampling [19]. Random walks can be are available to any logged-in user; (ii) we are not interested in biased but their bias can be analyzed using classic results from isolated users, ie., users without any declared friends; (iii) we Markov Chains and corrected by re-weighting the estimators. consider that the FB graph remains static during our crawling. This has been demonstrated in the context of P2P sampling We justify and discuss in detail assumption (iii) in Section IV. [16], where the re-weighted random walk is considered as a special case of Respondent-Driven Sampling (RDS) [20] B. Sampling Methods (if revisiting nodes is allowed and exactly one neighbor is The crawling of the social graph starts from an initial selected in every step [21]). Alternatively, the random walk node and proceeds iteratively. In every operation, we visit a can be modified using the Metropolis filter so as to achieve, node and discover all its neighbors. There are many ways, by design. any desired stationary distribution [22], [23]. In depending on the particular sampling method, in which we can our case, this distribution is the uniform, because it has no proceed. In this section, we describe the sampling methods we sampling bias. This algorithm, known as Metropolis-Hasting implemented in this paper. Our ultimate goal is to obtain a Random Walk (MHRW) has been applied to P2P networks uniform random sample of users in Facebook. [8]. modified to deal with peer churn (Metropolized Random 1) Breadth First Search (BFS): At each new iteration the Walk with Backtracking) and recently compared against Re- earliest explored but not-yet-visited node is selected next. As Weighted Random Walk (or RDS in the terminology of [9], this method discovers all nodes within some distance from the [16]). starting point, an incomplete BFS is likely to densely cover Compared to the aforementioned sampling techniques, our only some specific region of the graph.2) Random Walk (RW): In the classic random walk [14]. Stay at u the next-hop node w is chosen uniformly at random among end if the neighbors of the current node v. Le., the probability of end while moving from u to w is In every iteration of MHRW, at the current node v we randomly select a neighbor w and move there w.p. min (1, = ). if w is a neighbor of u, We always accept the move towards a node of smaller degree, otherwise. and reject some of the moves towards higher degree nodes. The random walk is [14] inherently biased. In a connected and This eliminates the bias towards high degree nodes. aperiodic graph, the probability of being at the particular node v converges to the stationary distribution 15" , Le. C. Convergence the classic RW samples nodes w.p. Ty THW o ko. This is clearly 1) Using Multiple Parallel Walks: Multiple parallel walks biased towards high degree nodes; a.g., a node with twice the are used in the MCMC literature [23] to improve convergence. degree will be visited by RW twice more often. In Section V, Intuitively, if we only have one walk, we might get trapped we show that several other node properties are correlated with in a certain region of the graph and that may erroneously the node degree and thus estimated with bias by RW sampling. declare convergence. Having multiple parallel chains reduces 3) Re-Weighted Random Walk (RWRW): A natural next step the probability of this happening and allows for more accurate is to crawl the network using RW, but to correct for the degree convergence diagnostics. An additional advantage of multiple bias by an appropriate re-weighting of the measured values. parallel walks, is that it is amenable to parallel implementation This can be done using the Hansen-Hurwitz estimator [26] as from different machines or different threads in the same first shown in [21], [27] for random walks and also later used machine; in both cases, this reduces the duration of the crawl. in [16]. Consider a stationary random walk that has visited We implemented each of the considered crawling algorithms V = 11,-..i'm unique nodes. Each node can belong to one with several parallel MHRW walks. Each walk starts from a of m groups with respect to a property of interest A, which different node in Vo CV, |Vo| 2 1 (Vo| = 28 in our case) and might be the degree, network size or any other discrete-valued proceeds independently of the others. The initial nodes Vo are node property. Let (A1, A2, .., Am) be all possible values of chosen randomly. For a fair comparison, we compare multiple A and corresponding groups; UTA, = V. E.g., if the property MHRWs to multiple RWs and multiple BFSs, all starting from of interest is the node degree, Ar contains all nodes u that the same set of nodes Vo- have degree ku = i. To estimate the probability distribution 2) Detecting Convergence with Online Diagnostics: Infer- of A. we need to estimate the proportion of nodes with value ences from MCMC assume that the samples are derived from Al, i= 1, ..m: the equilibrium distribution, which is true asymptotically. To correctly diagnose when convergence occurs, we use online P(A ) = SHEA, 1/ku diagnostic tests developed within the MCMC literature [23]. ZuEV 1/ku for the first time in the OSN sampling context. Estimators for continuous properties can be obtained using One type of convergence has to do with losing dependence related methods, e.g. kernel density estimators. from the starting point. A standard approach is to run the 4) Metropolis-Hastings Random Walk (MHRW): Instead of sampling long enough and to discard a number of initial correcting the bias after the walk, one can appropriately modify "burn-in' iterations. This comes at a cost, which in the case the transition probabilities so that it converges to the desired of FB is the consumed bandwidth (in the order TB) and uniform distribution. The Metropolis-Hastings algorithm [22] measurement time (days or weeks). It is therefore crucial to is a general Markov Chain Monte Carlo (MCMC) technique assess the convergence of our MCMC sampling, and to decide [23] for sampling from a probability distribution ar that is on appropriate settings of burn-in and total running time. The difficult to sample from directly. In our case, we would like burn-in can be decided by using intra-chain and inter-chain to sample nodes from the uniform distribution As = V- This diagnostics. In particular, we use two standard convergence can be achieved by the following transition probability: tests, widely accepted and well documented in the MCMC literature, Geweke [28] and Gelman-Rubin [29], described - min(1, -") if w is a neighbor of u, below. We outline the rationale of these tests and we refer the U.M if w = v, interested reader to the references for more details. In Section otherwise. V-A3, we apply these tests on several node properties, such asSHEA, 1/KM correctly diagnose when convergence occurs, we use online P(A.) = diagnostic tests developed within the MCMC literature [23]. ZuEV 1/ku for the first time in the OSN sampling context. Estimators for continuous properties can be obtained using One type of convergence has to do with losing dependence related methods, e.g. kernel density estimators. from the starting point. A standard approach is to run the 4) Metropolis-Hastings Random Walk (MHRW): Instead of sampling long enough and to discard a number of initial correcting the bias after the walk, one can appropriately modify "burn-in' iterations. This comes at a cost, which in the case the transition probabilities so that it converges to the desired of FB is the consumed bandwidth (in the order TB) and uniform distribution. The Metropolis-Hastings algorithm [22] measurement time (days or weeks). It is therefore crucial to is a general Markov Chain Monte Carlo (MCMC) technique assess the convergence of our MCMC sampling, and to decide [23] for sampling from a probability distribution /r that is on appropriate settings of burn-in and total running time. The difficult to sample from directly. In our case, we would like burn-in can be decided by using intra-chain and inter-chain to sample nodes from the uniform distribution As = . This diagnostics. In particular, we use two standard convergence can be achieved by the following transition probability: tests, widely accepted and well documented in the MCMC literature, Geweke [28] and Gelman-Rubin [29]. described - min(1, -2) if w is a neighbor of v, below. We outline the rationale of these tests and we refer the if w = v. interested reader to the references for more details. In Section otherwise. V-A3, we apply these tests on several node properties, such as It can be shown that the resulting stationary distribution is the node degree, privacy settings, network ID and membership. THE = , which is exactly the uniform distribution we are Geweke Diagnostic. The Geweke diagnostic [28] detects the looking for. Pay implies the following algorithm, which we convergence of a single Markov chain. Let X be a single se- refer to simply as MHRW in the rest of the paper: quence of samples of our metric of interest. Geweke considers v + initial node. two subsequences of X, its beginning Na (typically the first while stopping criterion not met do 10%), and its end No (typically the last 50%). Based on Xa Select node wr uniformly at random from neighbors of v. and Xo. we compute the z-statistic: = = B(X)-B(X) Generate uniformly at random a number 0 1 sequences. Intuitively speaking. the Gelman-Rubin diagnostic compares Fig. 1. Basic node information collected when visiting user u. (a) Friends the empirical distributions of individual chains with the em- list: this is a core feature of any OSN. In FB, friendship is always mutual thus leading to undirected edges. (b) UserID and Name: each user is uniquely pirical distribution of all sequences together: if these two are defined by her userID), which is a 32-bit number, and provides her presumably similar, we declare convergence. The test outputs a single value real name. (c) Networks. Facebook groups its users into networks of two types: At that is a function of means and variances of all chains. With regional (geographical) and workplaceischool. (d) Privacy settings 4 u- Each user u can restrict the amount of information or interaction with any non- time, R approaches 1, and convergence is declared typically friend node w. These are captured by four basic binary privacy attributes: 1 for values smaller than 1.02. (Add as friend), 2 (Photo), 3 (View Friends), 4(Send message). We refer in the We note that even after the burn-in period, strong correlation resulting 4-bit number as privacy settings Qu of node u. By default, Facebook acts Qx - 1111 (allow all). of consecutive samples in the chain may affect sequential analysis. This is typically addressed by thinning, ie,, keeping her list of friends. However, soon after we collected the UNI only one every " samples. Instead of thinning, we do sub- sample, FB moved from using numbers to using names as user sampling of nodes, which has essentially the same effect. IDs. In the near future, it is possible that FB may remove access to userIDs through the web-front interface. D. Ground Truth: Uniform Sample (UNI) In summary, we were fortunate to have obtained uniform Assessing the quality of any sampling method on an un- sampling of userIDs and thus be able to evaluate the different known graph is a challenging task. In order to have a "ground sampling methods against "ground truth". However, crawling truth" to compare against, the performance of such methods friendship relations is a fundamental primitive available in all is typically tested on artificial graphs (using models such as OSNs and, we believe, the right building block for designing Erdos-Renyi, Watts-Strogatz or Barabasi-Albert, etc.). This has sampling techniques in OSNs in the long run. the disadvantage that one can never be sure that the results can IV. DATA COLLECTION be generalized to real networks that do not follow the simulated graph models and parameters. In this paper, we focus on open publicly available basic Fortunately, Facebook was an exception at the time we information and do not study detailed user profiles that are performed our crawling. It allowed us to obtain a truly uniform more privacy-sensitive. sample of Facebook nodes by generating uniformly random One node view. Fig. I shows the information collected when 32-bit userIDs, and by polling Facebook about their existence. visiting the "show friends" webpage of a given user u, which If the ID exists, we keep it. otherwise we discard it. This we refer to as basic node information. We should emphasize simple method is a textbook technique known as rejection here that when we visit user u, we collect network and privacy sampling [30] and in general it allows to sample from any information for all her friends. distribution of interest, which in our case is the uniform. In Invalid nodes. There are two types of nodes that we declare particular, it guarantees to select uniformly random userIDs invalid. First, if a user u decides to hide her friends and to set from the existing FB users regardless of their actual distribution the privacy settings to Q. = * *0*, the crawl cannot continue. in the userID space, i.e., even if though the userIDs are not We address this problem by backtracking to the previous node allocated sequentially or evenly across the userID space. For and continuing the crawl from there, as if u was never selected. completeness, we derive this property of UNI sampling in the Second, there exist nodes with degree ky = 0; these are not Appendix. We refer to this method as 'UNI', and use it as a reachable by any crawls, but we stumble upon them during ground-truth uniform sampler. the UNI sampling of the userID space. Discarding both types Although UNI sampling currently solves the problem of of nodes is consistent with our problem statement, where uniform node sampling in Facebook and is a valuable asset we declared that we exclude such nodes (either not publiclythe disadvantage that one can never be sure that the results can IV. DATA COLLECTION be generalized to real networks that do not follow the simulated graph models and parameters. In this paper, we focus on open/publicly available basic Fortunately, Facebook was an exception at the time we information and do not study detailed user profiles that are performed our crawling. It allowed us to obtain a truly uniform more privacy-sensitive. sample of Facebook nodes by generating uniformly random One node view. Fig. I shows the information collected when 32-bit userIDs, and by polling Facebook about their existence. visiting the "show friends" webpage of a given user w, which If the ID exists, we keep it, otherwise we discard it. This we refer to as basic node information. We should emphasize simple method is a textbook technique known as rejection here that when we visit user u, we collect network and privacy sampling [30] and in general it allows to sample from any information for all her friends. distribution of interest, which in our case is the uniform. In Invalid nodes. There are two types of nodes that we declare particular, it guarantees to select uniformly random userIDs invalid. First, if a user u decides to hide her friends and to set from the existing FB users regardless of their actual distribution the privacy settings to Q, = * * 0s, the crawl cannot continue. in the userID space, ie., even if though the userIDs are not We address this problem by backtracking to the previous node and continuing the crawl from there, as if u was never selected. allocated sequentially or evenly across the userID space. For Second, there exist nodes with degree ky = 0; these are not completeness, we derive this property of UNI sampling in the reachable by any crawls, but we stumble upon them during Appendix. We refer to this method as 'UNI', and use it as a the UNI sampling of the userID space. Discarding both types ground-truth uniform sampler. Although UNI sampling currently solves the problem of of nodes is consistent with our problem statement, where uniform node sampling in Facebook and is a valuable asset we declared that we exclude such nodes (either not publicly of this study, it is not a general solution for sampling OSNs. available or isolated) from the graph under study. First. the ID space must not be sparse for this operation to Implementation Details. In Section III-Cl. we mentioned be efficient. Second, such an operation must be supported by that we ran Vo = 28 independent crawls for each algorithm, the system, which is not the case in many OSNs. FB currently namely MHRW, BFS and RW, all seeded at the same initial, allows to verify the existence of an arbitrary user ID and retrieve randomly selected nodes Vo. The number of independent crawls comes from the number of different machines used. We "The number of Facebook users at the time of our study (2.let) was let each independent crawl continue until exactly 8IK samples comparable to the size of the userID) space (43e!), resulting in about one are collected. In addition to the 28x3 crawls (BFS, RW and user retrieved per 22 attempts on average. If the userID) was 64hits long (i.e., MHRW). we ran the UNI sampling until we collected 984K to hinder efforts of data collection or to allocate more userID) space in the future) or consisting of strings of arbitrary length, UNI would be infeasible. valid users, which is comparable to the 957K unique users E.p., Orkut has a 64bit user ID) and his uses a concatenation of userID)+Name. collected with MHRW.MHRW PW HE'S LINI of valid users growing. We believe that this assumption is a valid approxima- # of anique uses STK 2.19M # of rique neighbors 72.2M 120. IN tion in practice for several reasons. First, the FB characteristics Crawling period 04/18-04/23 05/03-0908 04/22-04/30 change in longer timescales than the duration of our walks. HEE Median Degree During the period that we did our crawls (see table D). Facebook TABLE I was growing at a rate of 450K/day as reported by websites COLLECTED DATASETS BY DIFFERENT ALGORITHMS. THE CRAWLING such as [1]. [32]. With a population of ~ 200M users during that ALGORITHMS (MHRW, RW AND BFS) CONSIST OF 28 PARALLEL WALKS period, this translates to a growth of 0.22% of users/day. Each EACH, WITH THE SAME 28 RANDOMLY SELECTED STARTING POINTS. UNI of our crawls lasted around 4-7 days (during which, the total IS THE UNIFORM SAMPLE OF USERIDS. FB growth was 0.9%-1.5%); in fact, our convergence analysis A crawler does HTML scraping to extract the basic node in- shows that the process converged even faster, i.e., in only one formation (Fig. 1) of each visited node u. A server coordinates day. Therefore, the growth of Facebook was negligible during the crawls so as to avoid downloading duplicate information our crawls. Second, the FB social (not interaction) graph is of previously visited users. This coordination brings many much more static than P2P systems that are known to have high benefits: it takes advantage of the parallel chains to speed up churn; in the latter case, dealing with dynamic graphs becomes the process, avoids overloading the FB platform with duplicate important [8], [33]. Third, we obtained empirical evidence by requests, and the crawling process continues in a faster pace comparing our metrics of interest between the UNI sample since each request to FB servers returns new information. of Table I and a similarly sized UNI sample obtained 45 days Ego Networks. Elaborate topological measures, such as later. The distributions we obtained were virtually identical; we clustering coefficient and assortativity, cannot be estimated omit more details due to lack of space. Thus, while issues of based purely on a single-node view. For this reason, after dynamics are important to consider when sampling changing finishing the BFS, RW. MHRW crawls, we also collected a graphs, they appear not to be problematic for this particular number of ego nets for a sub-sample of the MHRW dataset study. only (which is a representative one). The ego net is defined in V. EVALUATION OF SAMPLING TECHNIQUES the social networks literature [31], as follows: full information In this section, we evaluate all candidate methodologies, (edges and node properties) about a user and all its one-hop namely BFS, RW and RWRW, MHRW, in terms of conver- neighbors. This requires visiting 100 nodes per node (ego) on gence and estimation bias. First, in Section V-A, we study average, which is impossible to do for all visited nodes. For in detail the convergence of the random walk methods, with this reason, during 04/24-05/01 we collect the ego-nets of my 37K nodes, randomly selected from all nodes in MHRW. respect to several properties of interest. We find a burn-in pe- riod of 6K samples, which we exclude from each independent Data sets description. The datasets collected for this paper are summarized in Table 1. This information refers to all crawl. The remaining 75K x 28 sampled nodes is our main sampled nodes, before discarding any "burn-in". The MHRW sample dataset; for a fair comparison we also exclude the same number of burn-in samples from all datasets. Second, dataset contains 957K unique nodes, which is less than the in Section V-B we examine the quality of the estimation based 28 x 81K = 2.2GM iterations in all 28 random walks; this on each sample. Finally, in Section V-C, we summarize our our is because MHRW may repeat the same node in a walk. findings and provide recommendations for the use of sampling The number of rejected nodes in the MHRW process, without methods in practice. repetitions, adds up to 645K nodes. For the UNI sampling, we checked 18.53M user IDs picked A. Convergence Analysis uniformly at random from [0, 2" - 1). Among them, only There are several crucial parameters that affect the conver- 1216K users existed, the rest were discarded. Also 232K valid gence of MCMC, which apply to the random walk methods userIDs had zero friends; we discarded these isolated users to under study (but not to BFS). be consistent with our problem statement. This results in a set) How to count: Counting samples in BFS is trivial since of 984K valid users with at least one friend each. Considering nodes are visited at most once. However, in the random walks, that the percentage of zero degree nodes is unusually high, nodes can be revisited and repetitions must be included in we manually confirmed that 200 of the discarded users have the sample in order to ensure the desired statistical properties. indeed zero friends. For RW the same cannot be immediately visited twice.sampled bodes, before discarding any "burn-In . The MARty same number of burn-in samples from all datasets. Second, dataset contains 957K unique nodes, which is less than the in Section V-B we examine the quality of the estimation based 28 x 81K = 2.26M iterations in all 28 random walks; this on each sample. Finally, in Section V-C, we summarize our our is because MHRW may repeat the same node in a walk. findings and provide recommendations for the use of sampling The number of rejected nodes in the MHRW process, without methods in practice. repetitions, adds up to 645K nodes. For the UNI sampling, we checked 18.53M user IDs picked A. Convergence Analysis uniformly at random from [0, 2 2 - 1). Among them, only There are several crucial parameters that affect the conver- 1216K users existed, the rest were discarded. Also 232K valid gence of MCMC, which apply to the random walk methods userIDs had zero friends; we discarded these isolated users to under study (but not to BFS). be consistent with our problem statement. This results in a set I) How to count: Counting samples in BFS is trivial since of 984K valid users with at least one friend each. Considering nodes are visited at most once. However, in the random walks, that the percentage of zero degree nodes is unusually high, nodes can be revisited and repetitions must be included in we manually confirmed that 200 of the discarded users have the sample in order to ensure the desired statistical properties. indeed zero friends. For RW the same node cannot be immediately visited twice, Finally, we collected ~ 37K egonets, a randomly chosen but non-consecutive repetitions are possible. In practice, that sub-sample of the ~ IM MHRW sample, which contain basic happens infrequently in the RW sample (as can be seen from node information (see Fig. 1) for 5.83M unique neighbors. the number of unique nodes given in table I). On the other Overall, we crawled 11.6M unique nodes with basic node hand, MHRW repeatedly samples some (typically low degree) information. However, the total number of unique users for nodes, a property which is intrinsic to its operation. For which we have basic privacy and network membership infor- instance, if some node o has only one neighbor Un, then the mation (which includes the sampled nodes and their neighbors) chain stays at (repeatedly samples) 1 for an average of kys is immense: we have such data for ~ 172M unique Facebook iterations (by is the degree of node v). Where ky, is large users. This is a significant sample by itself given that Facebook (e.g., O(10-) or more), the number of repetitions may be had close to 200M active users at the time of the measurements. locally large. While counterintuitive, this behavior is essential Timescale of crawls. We treat the FB graph as static during for convergence to the uniform distribution. In our MHRW the execution of our crawls, despite the fact that Facebook is sample, roughly 45% of the proposed moves are acceptedMHRM 10' RWRN RWRW Fal -. HHRW 010 MHRW Far Kulback-Leber diagnos 100 1000 10000 100OOD 1.5 Number of friends 1.4 Regional Network 10 1. Privacy Australis Membership in (0, 1) gu York Membership in (0,1) 10" 1-1 Fig. 3. Efficiency of the random walk techniques (RWRW, MHRW) in estimating the degree distribution of I'll, in terms of the KL. (Kullback-Leibler) 100 1000 10000 divergence. We observe that (i) RWRW converges faster than MHRW and Fig. 2. Convergence of the MHRW techniques. (Top): Geweke a score for approximates UNI slightly better at the end (0.0021 for MHRW vs 0.01013 node degree. Each line shows the Geweke score for each of the 28 parallel for KWRW) (ii) KWKW-Fair is also more efficient than MHRW-Fair. The chains. (Bottom) Gelman-Rubin score for four different metrics. "Fair" versions of the algorithms count the real bandwidth god of contacting a previously unseen neighbor, either for sampling (in RW) or to learn its degree (in MHRW), based on our measurements. (the acceptance rate in MCMC terms). As a result, a typical MHRW visits fewer unique nodes than a RW or BFS sequence and MHRW. of the same length. This raises the question: what is a fair way In addition, we compared the random walk techniques in to compare the results of MHRW with RW and BFS? Since terms of their distance from the true uniform (UNI) distribution queries are only made for new nodes, if ky = 1 and MHRW as a function of the iterations. In Fig. 3. we show the distance stays at w for some { > 1 iterations when crawling an OSN, the of the estimated distribution from the ground truth in terms bandwidth consumed is equal in cost to one iteration (assuming of the KL (Kullback-Leibler) metric that captures the distance that we cached the visited neighbor of "). This suggests that of the 2 distributions accounting for the bulk of the distribu- an appropriate practical comparison should be based not on tions. Similar results hold for the Kolmogorov-Smirnov (KS) the total number of iterations, but rather on the number of statistic that captures the maximum vertical distance of two visited unique nodes. In our subsequent comparisons, we will distributions; we omit them due to lack of space. We should denote RW and MHRW indices as "RW-Fair" and "MHRW- note here that the usage of distance metrics such as KL and Fair" when we compare using the number of visited unique KS cannot replace the role of the formal diagnostics which are nodes, as this represents the methods in terms of equivalent able to determine convergence online and most importantly in bandwidth costs. the absence of the ground truth. 2) Convergence Texts: A decision we have to make is about 3) The choice of metric matters: MCMC is typically used to the number of iterations for which we run the algorithms. This estimate some feature/metric, ie., a function of the underlying length should be appropriately long to ensure that we are at random variable. The choice of this metric can greatly affect equilibrium (in the case of random walks). the convergence time. The choice of metrics used in the online 1 before dia postios in Fig. 2 was guided by the following inciples.queries are only made for new nodes, if ky = 1 and MAKW as a function of the iterations. In Fig. 3, we show the distance stays at w for some { > 1 iterations when crawling an OSN. the of the estimated distribution from the ground truth in terms bandwidth consumed is equal in cost to one iteration (assuming of the KL (Kullback-Leibler) metric that captures the distance that we cached the visited neighbor of "). This suggests that of the 2 distributions accounting for the bulk of the distribu- an appropriate practical comparison should be based not on tions. Similar results hold for the Kolmogorov-Smirnov (KS) the total number of iterations, but rather on the number of statistic that captures the maximum vertical distance of two visited unique nodes. In our subsequent comparisons, we will distributions; we omit them due to lack of space. We should denote RW and MHRW indices as "RW-Fair" and "MHRW- note here that the usage of distance metrics such as KL and Fair" when we compare using the number of visited unique KS cannot replace the role of the formal diagnostics which are nodes. as this represents the methods in terms of equivalent able to determine convergence online and most importantly in bandwidth costs. the absence of the ground truth. 2) Convergence Tests: A decision we have to make is about 3) The choice of metric matters: MCMC is typically used to the number of iterations for which we run the algorithms. This estimate some feature/metric, i.e., a function of the underlying length should be appropriately long to ensure that we are at random variable. The choice of this metric can greatly affect equilibrium (in the case of random walks). the convergence time. The choice of metrics used in the online The iterations taken before reaching (approximate) equilib diagnostics in Fig. 2 was guided by the following principles. rium are known as "burn-in" draws, and should be discarded We chose the node degree because it is one of the metrics we to remove bias due to the choice of initial seed node. We ran want to estimate; therefore we need to ensure that the MCMC the Geweke and Gelman-Rubin diagnostics on RW, KWKW has converged at least with respect to it. The distribution of and MHRW to determine the burn-in period. The Geweke the node degree is also typically heavy tailed, and thus slow diagnostic was run separately on each of the 28 chains for to converge. We also used several additional metrics (e-g., the metric of node degree. Fig. 2(top) presents the results for network ID, privacy and network membership), which have no the convergence of the average node degree in the MHRW necessary relationship to the node degree and to each other, sample. We declare convergence when all 28 values fall in and thus provide additional assurance for convergence. the [-1, 1] interval, which happens at roughly iteration 500. Let us focus on two of these metrics of interest, namely In contrast, the Gelman-Rubin diagnostic analyzes all the node degree and sizes of geographical network and study their 28 chains at once. In Fig. 2 we plot the R score for four convergence in more detail. The results for both metrics and all different metrics in the MHRW sample, namely (i) node degree four methods are shown in Fig. 4. We expected node degrees (ii) regional network iii) privacy settings (iv) membership to not depend strongly on geography, while the relative size in specific regional networks. After 3000 iterations all the of geographical networks to strongly depend on geography. If R scores drop below 1.02, the typical target value used for our expectation is right, then (i) the degree distribution will convergence indicator. We omit the plots for RW and RWRW converge fast to a good uniform sample even if the chain has since results look similar. poor mixing and stays in the same region for a long time; (ii) We declare convergence when all tests have detected it. a chain that mixes poorly will take long time to barely reach The Gelman-Rubin test converges around 3K nodes. In each the networks of interests, not to mention producing a reliable independent chain we conservatively discard 6K nodes, out of network size estimate. The results presented in the bottom part 81K total. In the remainder of the paper, we work only with the of Fig. 4 confirm our expectations. E.g. MHRW performs much remaining 75K nodes per independent chain for RW, RWRW better when estimating the probability of a node having a givenBFS awngs crawl II CONT RW union RWRW RWRM MHEW ud'om II' node degree k regional network IV Fig. 4. Histograms of visits at node of a specific degree (left) and in a specific regional network (right). We consider four sampling techniques: BFS, RW. RWRW and MHRW. We present how often a specific type of node is visited by the 28 crawlers ('crawls"), and by the uniform UNI sampler ('uniform"). We also plot the visit frequency averaged over all the 28 crawlers ("average crawl"). Finally, 'sive" represents the red size of each regional network normalized by the the total I'll size. We used all the SIK nodes visited by each crawl, except the first 6k burn-in modes. The metrics of interest cover roughly the same number of nodes (about 0.1% to 14), which allows for a fair comparison. degree, than the probability of a node belonging to a specific networks. The results are presented in Fig. 4 (right). BFS regional network. One MHRW crawl overestimates the size performs very poorly. which is expected due to its local of 'New York, NY" by roughly 100%. The probability that a coverage. RW also produces biased results, possibly because of perfect uniform sampling makes such an error (or larger) is a slight positive correlation that we observed between network E ()p'(1 - p) ~ 43 - 10-, where in = 1K, n = size and average node degree. In contrast, MHRW and RWRW 81K and p =0.006. Even given such single-chain deviations, perform very well. the multiple-chain average for the MHRW and RWRW crawls provides an excellent estimate of the true population size. C. Findings and Practical Recommendations B. Unbiased Estimation Choosing between methods. First and most important,of "New York, NY" by roughly 100%%. The probability that a coverage. RW also produces biased results, possibly because of perfect uniform sampling makes such an error (or larger) is a slight positive correlation that we observed between network End ()p'(1 - p) ~ 43 - 10 , where in = 1K, n = size and average node degree. In contrast, MHRW and RWRW 81K and p =0.006. Even given such single-chain deviations, perform very well. the multiple-chain average for the MHRW and RWRW crawls provides an excellent estimate of the true population size. C. Findings and Practical Recommendations B. Unbiased Extimation Choosing between methods. First and most important, This section presents the main results of this paper. First, the the above comparison demonstrates that both MHRW and MHRW and RWRW methods perform very well: they estimate RWRW succeed in estimating several Facebook properties of two distributions of interest (namely node degree, regional net- interest virtually identically to UNI. In contrast, commonly work size) essentially identically to the UNI sampler. Second, used baseline methods (BFS and simple RW) fail, i.e., deviate the baseline algorithms (BFS and RW) deviate substantively significantly from the truth and lead to substantively erroneous from the truth and lead to misleading estimates. estimates. Moreover, the bias of BFS and RW shows up I) Node degree distribution: In Fig. 5 we present the degree not only when estimating directly node degrees (which was distributions estimated by BFS. RW, RWRW and MHRW. The expected), but also when we consider other metrics seemingly average MHRW crawl's pdf, shown in Fig. 5(a) is virtually uncorrelated metrics (such as the size of regional network), identical to UNI. Moreover, the degree distribution found by which end up being correlated to node degree. This makes each of the 28 chains separately are almost perfect. In contrast, the case for moving from "Ist generation" traversal methods RW and BFS shown in Fig. 5(b) and (c) introduce a strong bias such as BFS, which have been predominantly used in the towards the high degree nodes. For example, the low-degree measurements community so far [5]-[7], to more principled, nodes are under-represented by two orders of magnitude. As "2nd generation", sampling techniques whose bias can be a result, the estimated average node degree is &, ~ 95 for analyzed and/or corrected for. The random walks considered MHRW and UNI, and ky ~ 330 for BFS and RW. Interestingly, in this paper, RW, RWRW and MHRW, are well-known in this bias is almost the same in the case of BFS and RW, but the field of Monte Carlo Markov Chains (MCMC). We apply BFS is characterized by a much higher variance. Notice that and adapt these methods to Facebook, for the first time, and that BFS and RW estimate wrong not only the parameters but we demonstrate that, when appropriately used, they perform also the shape of the degree distribution, thus leading to wrong remarkably well on real-world OSNs. information. Re-weighting the simple RW corrects for the bias Adding convergence diagnostics and parallel crawls. A results to RWRW, which performs almost identical to UNI, as key ingredient of our implementation - not previously em- shown in 5(b). As a side observation we can also see that the ployed in network sampling - was the use of formal online true degree distribution clearly does not follow a power-law. convergence diagnostic tests. We tested these on several metrics 2) Regional networks: Let us now consider a geography- of interest within and across chains, showing that conver- dependent sensitive metric, ie., the relative size of regional gence was obtained within a reasonable number of iterations.MARW - Metropolis-Hastings Random Walk RW - Random Walk BFS - Breadth First Search 10- 10' 102 10' 10" = k 10' 10 " 10" 2 10' P(k, : 105 10' 10' 28 crawls 10 10' " Average crawl 10" - 28 a wa I Average gaw 10 ILL 10 10 10 10' 10 10 10- 10 102 node degree k node degree k node degree k Fig. 5. Degree distribution (pdf) estimated by the sampling techniques and the ground truth (uniform sampler). MHRW and RWRW agree almost perfectly with the UNI sample; while BFS and RW deviate significantly. We believe that such tests can and should be used in field our present purposes, these trade-offs favor MHRW, and we implementations of walk-based sampling methods to ensure employ it here for producing a uniform ready-to-use sample that samples are adequate for subsequent analysis. Another of users. However, both approaches are viable alternatives in key ingredient of our implementation, which we recommend, many settings, thus we present and analyze both in this paper. was the use of parallel crawlers/chains (started from several random independent starting points, unlike [9], [16] who use VI. FACEBOOK CHARACTERIZATION a single starting point), which both improved convergence and In this section, we use the unbiased sample of IM nodes, decreased the duration of the crawls. collected through MHRW, and the subsample of 37K egonets MHRW vs. RWRW. Both MHRW and RWRW performed to study some features of Facebook. In contrast to previous excellently in practice. When comparing the two, RWRW is work, which focused on particular regions [3], [4] or used slightly more efficient for the applications considered here, larger but potentially biased samples [5], [7], our results are consistent with the findings in [9], [16]; this appears to be representative of the entire FB graph. Due to lack of space, we due to faster mixing in the latter Markov chain, which (unlike outline observations about topological characteristics only and the former) does not require large numbers of rejections during refer the interested reader to our tech. report [19] for additional the initial sampling process. However, when choosing between details as well as other features (e.g. privacy) omitted here. the two methods there are additional trade-offs to consider. Degree distribution. In Fig. 5, we present the node degree First, MHRW yields an asymptotically uniform sample, which distribution of Facebook. Interestingly, and unlike previous requires no additional processing for subsequent analysis. By studies of crawled datasets in online social networks in [5]-[7], contrast, RWRW samples are heavily biased towards high- we observe that node degree is not a power law. Instead, we can des of eighting identify 1

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Entrepreneurship

Authors: Andrew Zacharakis, William D Bygrave

5th Edition

9781119563099

Students also viewed these Mathematics questions