summaries the next paragraph in one prepare in word document 1 Introduction Undoubtedly, the world is shrinking into a small village owing to the tangible influence of social media It connects people from different parts of the world, ages, and nationalities and allows them to share their opinions, experiences, feelings, hobbies, pictures, and videos This has opened the door for public and private organizations from all domains to promote, benefit, analyze, learn, and improve their organizations based on the data provided in social media Thus, the significance of social media for academia and industry is quite conspicuous in the amount of research done by these two sectors, seeking answers to pivotal questions The structure of the social media data is unorganized and is displayed in different forms such as text, voice, images, and videos 1 Moreover, the social media provides an enormous amount of continuous real time data that makes traditional statistical methods unsuitable to analyze this massive data 2 Therefore, the data mining techniques can play an important role in overcoming this problem In spite of the large number of empirical research about data mining techniques and social media, a scant number of studies compare data mining techniques in terms of accuracy, performance, and suitability For instance, it was observed that the accuracy of certain machine learning techniques is calculated in various methods which makes it difficult to find answers to the suitability of the data mining techniques Many researchers have selected their data mining techniques based solely on expert judgment (A31, A56) Few surveys have been conducted in this area without giving full justification for using data mining techniques in social media 3 , 4 However, some studies discussed certain areas in the used data mining techniques in social media In 5 , Vuori, et al , discussed the information gathering and knowledge and information sharing through social media for companies In 6 , Rafeeque, et al , the work and challenges related to short text analysis have been reviewed Akin to this study, 7 , Tsytsarau, et al , reviewed the opinion mining and sentiment analysis development, providing a summary about the proposed methods of contradiction analysis In 8 , Gole, et al , discussed mining big data in social media and its challenges as a result of big data features such as Volume, Velocity, Variety, Veracity and Value To the best of our knowledge, there is no previous study that systematically concentrates on the implemented data mining techniques in social media research, which has triggered the idea of the present survey The review presented in this paper discusses the published research in the period from January 1, 2003 to January 7, 2015 The goal of this study is to probe the available articles with regards to (I) the data mining techniques used to extract social media data, (II) the research area that requires mining data from social media, (III) a comparison between machine learning and non machine learning data mining techniques, (IV) a comparison between different data mining techniques, and (V) the strength and weakness of the recommended data mining techniques in social media This manuscript is divided into five sections Section 2 explains the implemented methodology Section 3 describes our findings Section 4 discusses the limitation of this review Finally, Section 5 presents our findings, recommendations, and future work 2 Methodology In this review, we conducted a survey based on the Systematic Literature Review (SLR) proposed by Kitchenham and Charters 9 methodology which consists of planning, conducting, and reporting phases where each phase consists of several stages At the planning phase we created a review protocol which consists of six stages specifying research questions, designing the search strategy, identifying the study selection procedures, specifying the quality assessment rules, detailing the data extraction strategy, and synthesizing the extracted data Fig 1 shows the review protocol stages Fig 1 Download Download high res image (277KB)Download Download full size image Fig 1 Review protocol stages The research questions have been specified based on the objectives of this review At the next stage, we designed the search strategy referring to the first stage to retrieve the required and related articles We also identified the search terms and article selection process, which is required for an accurate search Stage three covered the selection criteria which specify the inclusion and exclusion rules we also included more related articles from the references in the articles we used to enrich our literature resources related to the research questions Stage four included the quality questions to filter the related articles In stage five, we described the extraction strategy used to obtain the required data which could answer the research questions Finally, in the last stage, we identified the methodologies used to synthesize the extracted data As indicated by Kitchenham and Charters 9 , the review protocol is considered to be a critical element of any SLR Therefore, to avoid researcher bias and to ensure the quality of the review protocol, regular meetings have continued between the authors The following 2 1 Research questions, 2 2 Search strategy, 2 3 Study selection, 2 4 Quality Assessment Rules (QARs), 2 5 Data extraction strategy, 2 6 Synthesis of extracted data will illustrate in detail the review protocol followed in this review 2 1 Research questions Summarizing and providing evidence of implementing the data mining techniques in social media is our main goal in this work Thus, we identified the following five research questions (RQs) 1 RQ1 Which data mining techniques have been used in Social Media The role of this question is to specify the data mining techniques that were implemented in mining social network data 2 RQ2 In which research areas have data mining techniques been applied The aim of this question is to identify the domains where the data mining techniques were applied and the research objectives among these domains The most frequent domain will be identified as well as any new domains suggested 3 RQ3 Do machine learning perform better than non machine learning in data mining techniques RQ3 compares machine learning and non machine learning methods implemented in mining social media in term of accuracy Few articles made a comparison between machine learning and non machine learning methods As mentioned in 10 , 11 , only statistical techniques were considered as non machine learning, whereas the other computational techniques are considered as machine learning methods 4 RQ4 Is there any comparison that has been performed among different data mining techniques The aim of RQ4 is to specify the data mining technique with high performance The results produced by the answer of this question will be considered as evidence of the recommended techniques 5 RQ5 What are the strengths and weaknesses of the implemented data mining techniques in social media This question will prove the suitable practice of the selected data mining techniques in social media such as text mining, media mining, content based mining, context aware mining, graph data mining, and multimedia mining 2 2 Search strategy The search strategy that we followed in this survey is explained in detail as follows 2 2 1 Search terms To construct the search terms we followed the following procedure 9 1 The main terms have been concluded from the research questions 2 We defined new terms which replace the main terms such as jargon, alternative spellings, and synonyms 3 The top ten data mining algorithms were selected from published papers and books 12 , 13 4 We used Boolean search operators (ANDs and ORs) to limit the search results in addition to for specific phrases We included in our search terms the top ten data mining techniques identified by 12 , 13 Fig 1 shows the stages of the review protocol The search terms used to retrieve the related publications are as follows Note that different search terms have been used to get more related publications The last search date was conducted on January 9, 2015 data mining AND techniques OR technique AND social media data mining AND machine learning AND social media social media AND fuzzy AND data mining social media OR social network AND (C4 5 OR J48 OR K Means OR SVM OR support vector machines OR Apriori OR EM OR expectation maximization OR PageRank OR AdaBoost OR KNN OR k NN OR k nearest neighbors OR Naive Bayes OR CART) 2 2 2 Survey resources The following digital libraries were searched for the required articles IEEE Explorer Google Scholar Science Direct ACM Digital Library Computing Research Repository Web of Science Spie The first search process included journals, and Tier I social network related conferences, such as International Conference on Advances in Social Networks Analysis and Mining (ASONAM), ACM Conference on Online Social Networks (COSN), International World Wide Web Conference (WWW), and International Conference on Data Engineering (ICDE), from the above mentioned digital libraries The search terms considered cover any part of the articles (metadata) and were restricted to articles published between January, 2003 and 2015, because the most popular social networks (Facebook, Twitter, LinkedIn, and MySpace) began after 2002 14 2 2 3 Search phases We used the specified search terms to retrieve the primary related articles from these digital libraries Moreover, a quick scan of the reference from the paper we selected helped to enrich the resources to answer the research questions The inclusion criteria are explained in detail in Section 2 3 The Google document platform was used to share and manage the search results and documents among authors Based on the inclusion criteria, 147 relevant publications were chosen as candidate publications 83 journal papers, 64 conference papers Fig 2 illustrates the breakdown of the identified articles at each search and selection phase Fig 2 Download Download high res image (305KB)Download Download full size image Fig 2 Search and selection process 2 3 Study selection We obtained 1187 articles in the first search process Because many articles did not provide sufficient information to answer the research questions, we performed another filtration step (see Fig 2) The filtration process was conducted individually by the authors and the results were discussed in scheduled meetings to ensure the accuracy and to resolve any differences The selection and filtration steps are explained below 1 Step 1 remove the duplicated articles obtained by authors and or different libraries 2 Step 2 apply inclusion and exclusion criteria to the candidate papers to avoid any irrelevant articles 3 Step 3 apply the quality assessment rules to include the qualified articles that give the best answers to the research questions 4 Step 4 search for additional related articles from the article references obtained from step 3 and repeat step 3 on the extra articles The inclusion and exclusion criteria applied in this survey are defined below Inclusion criteria Use data mining techniques in social media Use machine learning and non machine learning data mining techniques in social media Comparative studies that compare among data mining techniques Comparative studies that compare between data mining and non data mining techniques Consider the latest edition of the article of the same research (if different versions are available) Consider only articles published between January 2003 and 2015 Exclusion criteria Exclude articles that include data mining that is not related to social media Exclude articles that do not include data mining but are related to social media Exclude non journal and non conferences articles Finally, after applying all filtration steps, 66 articles were considered as the resources for this review The selected articles are listed in Appendix (A), Table A1 2 4 Quality Assessment Rules (QARs) The QARs were applied in the selected studies to evaluate article suitability in accordance with the research questions Ten QARs were identified, and each one is worth 1 mark out of 10 Each QAR is scored as follows fully answered 1, above average 0 75, average 0 5, below average 0 25, not answered 0 The overall score of the article will be the summation of the marks obtained for the 10 QARs If the result was 5 or higher, the article was considered otherwise it was excluded 1 QAR1 Are the research objectives clearly defined 2 QAR2 Is the data mining background clearly addressed 3 QAR3 Are the data mining techniques used clearly defined 4 QAR4 Is the design of the experiment suitable and acceptable 5 QAR5 Is the study performed on sufficient social media data 6 QAR6 Is the data mining technique measured and reported 7 QAR7 Is the proposed data mining technique compared with other techniques 8 QAR8 Are the conclusions of the experiment clearly identified and reported 9 QAR9 Are the methods used to analyze the results appropriate 10 QAR10 Does the experiment enrich academia or industry The scores that resulted from applying the QARs on the selected articles are shown in Appendix (A), Table A2 2 5 Data extraction strategy In this stage, we explored the articles selected to extract the information required to answer the research questions Therefore, we have designed an extraction form (see Table 1) to extract the needed data 9 Table 1 Data extraction form Article ID Data extractor Data checker Publication year Authors Article source Article title Article type Domain RQ1 RQ2 RQ3 RQ4 RQ5 Based on the extraction form, two authors played the role of extraction and checking In case of a disagreement between the extractor and checker, group meetings were conducted between all authors to resolve any issue Some difficulties occurred during the extraction process For instance, different terminology was used for the same data mining technique such as C4 5 algorithm is the new name of the J48 technique 15 however, the WEKA tool (which is commonly used by researchers) is still using the old name J48 (A26) Moreover, some articles used different abbreviations of the same technique such as KNN, K NN, Nearest Neighbor (A12, A34), Nave Bayes, Naive Bayes, NB (A2, A37) Furthermore, many researchers were comparing between their techniques and other common techniques without mentioning technique names or, if mentioned, the reason behind picking certain technique (A31, A42, A53, A55) Not all selected articles answered all the five RQs Appendix (A), Table A3 illustrate the RQs that were answered by each selected study 2 6 Synthesis of extracted data To synthesize the data extracted from the selected articles, we used different procedures to aggregate evidence that will answer the RQs The following explains the synthesis procedure we followed in detail For RQ1 and RQ2, we used the narrative synthesis method 9 were the extracted information was tabulated according to RQ1 and RQ2 For the data extracted (quantitative) in RQ3 and RQ4, which came from different articles that have various accuracy calculation techniques, we used binary outcomes to measure the results, which are demonstrated in a comparable way 9 In RQ5, the strengths and weaknesses of the data mining techniques have the same meaning but are written in different ways Therefore, to unify these points, we followed the reciprocal translation method 9 which is considered as one of the techniques that can be used for synthesizing the qualitative data 3 Results and discussion In this section, we will discuss the results obtained from this review The first subsection gives an overview of the selected articles The result of each RQ will be discussed in detail in the next five 3 1 Types of data mining techniques (RQ1), 3 2 Data mining techniques research areas (RQ2), 3 3 Machine learning versus non machine learning methods in mining social media data (RQ3), 3 4 Data mining techniques versus other data mining techniques (RQ4), 3 5 Strengths and weaknesses of data mining techniques (RQ5) The total number of the selected studies was 66 articles (see Appendix (A), Table A4) that implemented data mining techniques used in social media The selected articles were retrieved only from journals published between January 2003 and 2015 Appendix (A), Table A4 shows the number of articles and the percentage grouped by publisher name The types of articles considered in this survey are experiment, case study, and survey Table 2 shows the distribution of the selected articles among the three types Table 2 Selected articles' types distribution Article type Freq Case study 4 Experiment 60 Survey 2 Grand total 66 With regards to the quality of the selected articles, we applied a quality assessment criterion to stream the articles based on the marks gained The articles with grade five or greater (out of ten) were taken into consideration (see Table 3) Table 3 Candidate articles' quality distribution Calcification criteria Freq Between 0 and 2 5 53 36 Between 2 75 and 4 75 28 19 Between 5 and 6 75 35 24 Between 7 and 8 5 22 15 Between 8 75 and 10 9 6 Grand total 147 100 3 1 Types of data mining techniques (RQ1) We identified 19 data mining techniques that had been applied by researchers in the area of social media The list of these techniques is below AdaBoost Artificial Neural Network (ANN) Apriori Bayesian Networks (BN) Decision Trees (DT) Density Based Algorithm (DBA) Fuzzy Genetic Algorithm (GA) Hierarchical Clustering (HC) K Means k nearest Neighbors (k NN) Linear Discriminant Analysis (LDA) Linear Regression (Lin R) Logistic Regression (LR) Markov Maximum Entropy (ME) Novel Support Vector Machine (SVM) Wrapper Fig 3 shows that SVM, BN, and DT are the most applied techniques in the area of social media with a percentage of 51 of the selected articles Novel techniques with the percentage of 9 were not considered as the one of the highest because each article has its dedicated novel technique Table 4, includes detailed information about the frequencies of data mining techniques used by the selected articles in this review Fig 3 Download Download high res image (221KB)Download Download full size image Fig 3 Data mining techniques among selected papers Table 4 Data mining techniques frequencies among articles Technique Frequencies Technique Frequencies AdaBoost 2 k NN 9 ANN 8 LDA 9 Apriori 1 Lin R 1 BN 26 LR 4 DT 11 Markov 1 DBA 3 ME 2 Fuzzy 1 Novel 12 GA 1 SVM 29 HC 2 Wrapper 1 K Means 6 Appendix (A), Fig A1 shows further demonstration about the findings, it illustrates the distribution of the data mining techniques per year during the considered period Based on the figure, it can be clearly seen that the number of data mining techniques adopted by researchers in the social media area has increased dramatically in 2012 and 2014 with 39 and 35 techniques respectively The number dropped slightly to 24 techniques in 2013 Moreover, it is worthwhile to mention that many novel techniques have arisen between 2012 to early 2015 with a total number of 12 new techniques 3 2 Data mining techniques research areas (RQ2) From the selected articles, we identified six general domains which applied various techniques in nine different research areas to mine the flow of big data gathered from social media The list of these domains follows Business and Management (BM) Education (EDU) Finance (FIN) Government and Public (GP) Medical and Health (MH) Social Networks (SN) Fig 4 shows that social networks and business and management were the most active domains used by data mining techniques, with a percentage of 79 among all domains Government and public with a percentage of 9 represents the third active domain Appendix (A), Table A5, includes detailed information about all domains Fig 4 Download Download high res image (188KB)Download Download full size image Fig 4 Domains among articles For further analysis of Table 2, we investigated the experiments of the selected articles and plotted Fig 5 which demonstrates the popularity of various types in social media application researches Some experiments were conducted to mine and analyze one or more social media applications' data Microblogging applications such as Twitter was the most popular application for researchers with 31 experiments followed by social networks such as (Facebook) with 12 experiments Appendix (A), Table A6, includes detailed information about the frequencies of social media applications used by the selected articles in this review Fig 5 Download Download high res image (129KB)Download Download full size image Fig 5 Popularity of various social media application in researches Fig 6 demonstrates further information about the findings by illustrating the distribution of the domains applying data mining techniques per year Based on the figure, it can be clearly seen that the number of publications has increased dramatically in 2012 and 2014 with 19 articles in 5 domains for both periods In 2013, the number went down to 12 articles in 5 domains The social network data analysis remains the most active domain among the considered period Fig 6 Download Download high res image (166KB)Download Download full size image Fig 6 Domains distribution per year Among the selected articles, we identified 9 active research objectives adopted data mining techniques The list of these research objectives follows Biometric Content Analysis Cyber Crime Disease Awareness Geolocating Quality Improvement Risk Management Semantic Analysis Sentiment Analysis Fig 7 illustrates the distribution of these research areas The sentiment analysis and quality improvement were the most active areas among articles with a frequencies of 21 and 14 respectively Fig 7 Download Download high res image (245KB)Download Download full size image Fig 7 Research objective among domains 3 3 Machine learning versus non machine learning methods in mining social media data (RQ3) Data mining techniques are the process of extracting hidden knowledge from the data 16 This can be done in many ways such as KNN, K Means, and SVM as machine learning methods Also the statistical methods in some cases are considered as non machine learning methods which used to discover patterns As Berson, et al mentioned 11 , statistical techniques are driven by the data and are used to discover patterns and build predictive models Out of the 66 papers identified, only three papers contain either experimental or theoretical knowledge about non machine learning methods Two of these papers (A11, A19) integrated non machine learning methods with machine learning methods to improve the result of their proposed solution The third paper (A53) mentioned that text mining techniques that depend on machine learning methods are different than non machine learning methods because of (i) in traditional quantitative analysis methods, conclusions are derived from the population sample, whereas machine learning methods allow the researcher to derive conclusions from the entire population, (ii) traditional quantitative methods require the researcher to analyze the data using a theoretical platform, while machine learning methods give the researcher the ability to extract the actual meaning of the mined data contained in natural language text (iii) Machine learning methods investigate the textual data without human interaction, whereas traditional quantitative methods need the researcher to interpret the data before analyzing However, we disagree with the authors of paper A53 because the definition of data mining consists of three concepts 17 Statistics, Data (Big or Small), and Machine Learning and Lifting Thus, data mining includes all statistics (Descriptive and non inferential parts of the classical statistics) and Exploratory Data Analysis (EDA) for the data using the power of computers for the purpose of lifting and learning the patterns of the data 17 Consequently, machine learning data mining techniques and non machine learning data mining techniques such as traditional quantitative methods in statistics are complementary to each other in data mining 3 4 Data mining techniques versus other data mining techniques (RQ4) This RQ compares different data mining techniques that have been used in the selected articles Since most of the articles based their findings on either weak statistical analysis or without using any statistics, we built our comparison based on their judgments, which relied on the experiment they made or by referring to their article references For instance, papers (A31, A53) indicate that the SVM technique is one of the best categorization and feature selection techniques available relying on references published in 1998 and 2003 however, the paper was published in 2013 Further details are provided in Section 5 After reviewing the papers selected, we found that many papers have common findings on the same data mining techniques For instance, papers (A31, A45, A53, A59) found that SVM outperforms other techniques such as Nave Bayes In contrast, papers (A41, A51) claimed that Nave Bayes and MLP are performed better than SVM Some other papers (A3, A20, A35) claimed that K Means performed better than other techniques such as C4 5 Finally, (A42, A60) found that the DBA technique outperforms other techniques in terms of working with noisy data 3 5 Strengths and weaknesses of data mining techniques (RQ5) This part of the review represents a good source of information where the best practices of the primary data mining techniques could be implemented Table 5 summarizes the data mining techniques that could be implemented in the social media area In addition to the traditional data mining techniques, Appendix (A), Table A7, summarizes the description and the main features of the novel techniques proposed by the researchers

The Answer is in the image, click to view ...

Question: summaries the next paragraph in one prepare in word document 1. Introduction Undoubtedly, the world is shrinking into a small village owing to the tangible

summaries the next paragraph in one prepare in word document

1. Introduction

Undoubtedly, the world is shrinking into a small village owing to the tangible influence of social media. It connects people from different parts of the world, ages, and nationalities and allows them to share their opinions, experiences, feelings, hobbies, pictures, and videos. This has opened the door for public and private organizations from all domains to promote, benefit, analyze, learn, and improve their organizations based on the data provided in social media. Thus, the significance of social media for academia and industry is quite conspicuous in the amount of research done by these two sectors, seeking answers to pivotal questions.

The structure of the social media data is unorganized and is displayed in different forms such as: text, voice, images, and videos [1]. Moreover, the social media provides an enormous amount of continuous real time data that makes traditional statistical methods unsuitable to analyze this massive data [2]. Therefore, the data mining techniques can play an important role in overcoming this problem.

In spite of the large number of empirical research about data mining techniques and social media, a scant number of studies compare data mining techniques in terms of accuracy, performance, and suitability. For instance, it was observed that the accuracy of certain machine learning techniques is calculated in various methods which makes it difficult to find answers to the suitability of the data mining techniques.

Many researchers have selected their data mining techniques based solely on expert judgment (A31, A56). Few surveys have been conducted in this area without giving full justification for using data mining techniques in social media [3], [4]. However, some studies discussed certain areas in the used data mining techniques in social media. In [5], Vuori, et al., discussed the information gathering and knowledge and information sharing through social media for companies. In [6], Rafeeque, et al., the work and challenges related to short text analysis have been reviewed. Akin to this study, [7], Tsytsarau, et al., reviewed the opinion mining and sentiment analysis development, providing a summary about the proposed methods of contradiction analysis. In [8], Gole, et al., discussed mining big data in social media and its challenges as a result of big data features such as: Volume, Velocity, Variety, Veracity and Value.

To the best of our knowledge, there is no previous study that systematically concentrates on the implemented data mining techniques in social media research, which has triggered the idea of the present survey. The review presented in this paper discusses the published research in the period from January 1, 2003 to January 7, 2015. The goal of this study is to probe the available articles with regards to: (I) the data mining techniques used to extract social media data, (II) the research area that requires mining data from social media, (III) a comparison between machine learning and non-machine learning data mining techniques, (IV) a comparison between different data mining techniques, and (V) the strength and weakness of the recommended data mining techniques in social media.

This manuscript is divided into five sections. Section 2 explains the implemented methodology. Section 3 describes our findings. Section 4 discusses the limitation of this review. Finally, Section 5 presents our findings, recommendations, and future work.

2. Methodology

In this review, we conducted a survey based on the Systematic Literature Review (SLR) proposed by Kitchenham and Charters [9] methodology which consists of: planning, conducting, and reporting phases where each phase consists of several stages. At the planning phase we created a review protocol which consists of six stages: specifying research questions, designing the search strategy, identifying the study selection procedures, specifying the quality assessment rules, detailing the data extraction strategy, and synthesizing the extracted data. Fig. 1 shows the review protocol stages.

Fig. 1

Download : Download high-res image (277KB)Download : Download full-size image

Fig. 1. Review protocol stages.

The research questions have been specified based on the objectives of this review. At the next stage, we designed the search strategy referring to the first stage to retrieve the required and related articles. We also identified the search terms and article selection process, which is required for an accurate search. Stage three covered the selection criteria which specify the inclusion and exclusion rules; we also included more related articles from the references in the articles we used to enrich our literature resources related to the research questions. Stage four included the quality questions to filter the related articles. In stage five, we described the extraction strategy used to obtain the required data which could answer the research questions. Finally, in the last stage, we identified the methodologies used to synthesize the extracted data.

As indicated by Kitchenham and Charters [9], the review protocol is considered to be a critical element of any SLR. Therefore, to avoid researcher bias and to ensure the quality of the review protocol, regular meetings have continued between the authors.

The following 2.1 Research questions, 2.2 Search strategy, 2.3 Study selection, 2.4 Quality Assessment Rules (QARs), 2.5 Data extraction strategy, 2.6 Synthesis of extracted data will illustrate in detail the review protocol followed in this review.

2.1. Research questions

Summarizing and providing evidence of implementing the data mining techniques in social media is our main goal in this work. Thus, we identified the following five research questions (RQs):

RQ1: Which data mining techniques have been used in Social Media?

The role of this question is to specify the data mining techniques that were implemented in mining social network data.

RQ2: In which research areas have data mining techniques been applied?

The aim of this question is to identify the domains where the data mining techniques were applied and the research objectives among these domains. The most frequent domain will be identified as well as any new domains suggested.

RQ3: Do machine learning perform better than non- machine learning in data mining techniques?

RQ3 compares machine learning and non-machine learning methods implemented in mining social media in term of accuracy. Few articles made a comparison between machine learning and non-machine learning methods. As mentioned in [10], [11], only statistical techniques were considered as non-machine learning, whereas the other computational techniques are considered as machine learning methods.

RQ4: Is there any comparison that has been performed among different data mining techniques?

The aim of RQ4 is to specify the data mining technique with high performance. The results produced by the answer of this question will be considered as evidence of the recommended techniques.

RQ5: What are the strengths and weaknesses of the implemented data mining techniques in social media?

This question will prove the suitable practice of the selected data mining techniques in social media such as text mining, media mining, content-based mining, context-aware mining, graph data mining, and multimedia mining.

2.2. Search strategy

The search strategy that we followed in this survey is explained in detail as follows:

2.2.1. Search terms

To construct the search terms we followed the following procedure [9]:

The main terms have been concluded from the research questions.

We defined new terms which replace the main terms: such as jargon, alternative spellings, and synonyms.

The top ten data mining algorithms were selected from published papers and books [12], [13].

We used Boolean search operators (ANDs and ORs) to limit the search results in addition to for specific phrases.

We included in our search terms the top ten data mining techniques identified by [12], [13]. Fig. 1 shows the stages of the review protocol.

The search terms used to retrieve the related publications are as follows. Note that different search terms have been used to get more related publications. The last search date was conducted on January 9, 2015.

data mining AND techniques OR technique AND social media.

data mining AND machine learning AND social media.

social media AND fuzzy AND data mining.

social media OR social network AND (C4.5 OR J48 OR K-Means OR SVM OR support vector machines OR Apriori OR EM OR expectation maximization OR PageRank OR AdaBoost OR KNN OR k-NN OR k-nearest neighbors OR Naive Bayes OR CART).

2.2.2. Survey resources

The following digital libraries were searched for the required articles:

IEEE Explorer

Google Scholar

Science Direct

ACM Digital Library

Computing Research Repository

Web of Science

Spie

The first search process included journals, and Tier I social network related conferences, such as International Conference on Advances in Social Networks Analysis and Mining (ASONAM), ACM Conference on Online Social Networks (COSN), International World Wide Web Conference (WWW), and International Conference on Data Engineering (ICDE), from the above mentioned digital libraries. The search terms considered cover any part of the articles (metadata) and were restricted to articles published between January, 2003 and 2015, because the most popular social networks (Facebook, Twitter, LinkedIn, and MySpace) began after 2002 [14].

2.2.3. Search phases

We used the specified search terms to retrieve the primary related articles from these digital libraries. Moreover, a quick scan of the reference from the paper we selected helped to enrich the resources to answer the research questions. The inclusion criteria are explained in detail in Section 2.3.

The Google document platform was used to share and manage the search results and documents among authors. Based on the inclusion criteria, 147 relevant publications were chosen as candidate publications: 83 journal papers, 64 conference papers. Fig. 2 illustrates the breakdown of the identified articles at each search and selection phase.

Fig. 2

Download : Download high-res image (305KB)Download : Download full-size image

Fig. 2. Search and selection process.

2.3. Study selection

We obtained 1187 articles in the first search process. Because many articles did not provide sufficient information to answer the research questions, we performed another filtration step (see Fig. 2).

The filtration process was conducted individually by the authors and the results were discussed in scheduled meetings to ensure the accuracy and to resolve any differences. The selection and filtration steps are explained below:

Step 1: remove the duplicated articles obtained by authors and/or different libraries.

Step 2: apply inclusion and exclusion criteria to the candidate papers to avoid any irrelevant articles.

Step 3: apply the quality assessment rules to include the qualified articles that give the best answers to the research questions.

Step 4: search for additional related articles from the article references obtained from step 3 and repeat step 3 on the extra articles.

The inclusion and exclusion criteria applied in this survey are defined below:

Inclusion criteria:

Use data mining techniques in social media.

Use machine learning and non-machine learning data mining techniques in social media.

Comparative studies that compare among data mining techniques.

Comparative studies that compare between data mining and non-data mining techniques.

Consider the latest edition of the article of the same research (if different versions are available).

Consider only articles published between January 2003 and 2015.

Exclusion criteria:

Exclude articles that include data mining that is not related to social media.

Exclude articles that do not include data mining but are related to social media.

Exclude non-journal and non-conferences articles.

Finally, after applying all filtration steps, 66 articles were considered as the resources for this review. The selected articles are listed in Appendix (A), Table A1.

2.4. Quality Assessment Rules (QARs)

The QARs were applied in the selected studies to evaluate article suitability in accordance with the research questions. Ten QARs were identified, and each one is worth 1 mark out of 10. Each QAR is scored as follows: fully answered=1, above average=0.75, average=0.5, below average=0.25, not answered=0. The overall score of the article will be the summation of the marks obtained for the 10 QARs. If the result was 5 or higher, the article was considered; otherwise it was excluded.

QAR1: Are the research objectives clearly defined?

QAR2: Is the data mining background clearly addressed?

QAR3: Are the data mining techniques used clearly defined?

QAR4: Is the design of the experiment suitable and acceptable?

QAR5: Is the study performed on sufficient social media data?

QAR6: Is the data mining technique measured and reported?

QAR7: Is the proposed data mining technique compared with other techniques?

QAR8: Are the conclusions of the experiment clearly identified and reported?

QAR9: Are the methods used to analyze the results appropriate?

10.

QAR10: Does the experiment enrich academia or industry?

The scores that resulted from applying the QARs on the selected articles are shown in Appendix (A), Table A2.

2.5. Data extraction strategy

In this stage, we explored the articles selected to extract the information required to answer the research questions. Therefore, we have designed an extraction form (see Table 1) to extract the needed data [9].

Table 1. Data extraction form.

Article ID

Data extractor

Data checker

Publication year

Authors

Article source

Article title

Article type

Domain

RQ1

RQ2

RQ3

RQ4

RQ5

Based on the extraction form, two authors played the role of extraction and checking. In case of a disagreement between the extractor and checker, group meetings were conducted between all authors to resolve any issue.

Some difficulties occurred during the extraction process. For instance, different terminology was used for the same data mining technique such as C4.5 algorithm is the new name of the J48 technique [15]; however, the WEKA tool (which is commonly used by researchers) is still using the old name J48 (A26). Moreover, some articles used different abbreviations of the same technique such as: KNN, K-NN, Nearest Neighbor (A12, A34), Nave Bayes, Naive Bayes, NB (A2, A37). Furthermore, many researchers were comparing between their techniques and other common techniques without mentioning technique names or, if mentioned, the reason behind picking certain technique (A31, A42, A53, A55).

Not all selected articles answered all the five RQs. Appendix (A), Table A3 illustrate the RQs that were answered by each selected study.

2.6. Synthesis of extracted data

To synthesize the data extracted from the selected articles, we used different procedures to aggregate evidence that will answer the RQs. The following explains the synthesis procedure we followed in detail:

For RQ1 and RQ2, we used the narrative synthesis method [9] were the extracted information was tabulated according to RQ1 and RQ2.

For the data extracted (quantitative) in RQ3 and RQ4, which came from different articles that have various accuracy calculation techniques, we used binary outcomes to measure the results, which are demonstrated in a comparable way [9].

In RQ5, the strengths and weaknesses of the data mining techniques have the same meaning but are written in different ways. Therefore, to unify these points, we followed the reciprocal translation method [9] which is considered as one of the techniques that can be used for synthesizing the qualitative data.

3. Results and discussion

In this section, we will discuss the results obtained from this review. The first subsection gives an overview of the selected articles. The result of each RQ will be discussed in detail in the next five 3.1 Types of data mining techniques (RQ1), 3.2 Data mining techniques research areas (RQ2), 3.3 Machine learning versus non-machine learning methods in mining social media data (RQ3), 3.4 Data mining techniques versus other data mining techniques (RQ4), 3.5 Strengths and weaknesses of data mining techniques (RQ5).

The total number of the selected studies was 66 articles (see Appendix (A), Table A4) that implemented data mining techniques used in social media. The selected articles were retrieved only from journals published between January 2003 and 2015. Appendix (A), Table A4 shows the number of articles and the percentage grouped by publisher name. The types of articles considered in this survey are: experiment, case study, and survey. Table 2 shows the distribution of the selected articles among the three types.

Table 2. Selected articles' types distribution.

Article type Freq.

Case study 4

Experiment 60

Survey 2

Grand total 66

With regards to the quality of the selected articles, we applied a quality assessment criterion to stream the articles based on the marks gained. The articles with grade five or greater (out of ten) were taken into consideration (see Table 3).

Table 3. Candidate articles' quality distribution.

Calcification criteria Freq. %

Between 0 and 2.5 53 36

Between 2.75 and 4.75 28 19

Between 5 and 6.75 35 24

Between 7 and 8.5 22 15

Between 8.75 and 10 9 6

Grand total 147 100

3.1. Types of data mining techniques (RQ1)

We identified 19 data mining techniques that had been applied by researchers in the area of social media. The list of these techniques is below.

AdaBoost

Artificial Neural Network (ANN)

Apriori

Bayesian Networks (BN)

Decision Trees (DT)

Density Based Algorithm (DBA)

Fuzzy

Genetic Algorithm (GA)

Hierarchical Clustering (HC)

K-Means

k-nearest Neighbors (k-NN)

Linear Discriminant Analysis (LDA)

Linear-Regression (Lin-R)

Logistic Regression (LR)

Markov

Maximum Entropy (ME)

Novel

Support Vector Machine (SVM)

Wrapper

Fig. 3 shows that SVM, BN, and DT are the most applied techniques in the area of social media with a percentage of 51% of the selected articles. Novel techniques with the percentage of 9% were not considered as the one of the highest; because each article has its dedicated novel technique. Table 4, includes detailed information about the frequencies of data mining techniques used by the selected articles in this review.

Fig. 3

Download : Download high-res image (221KB)Download : Download full-size image

Fig. 3. Data mining techniques among selected papers.

Table 4. Data mining techniques frequencies among articles.

Technique Frequencies Technique Frequencies

AdaBoost 2 k-NN 9

ANN 8 LDA 9

Apriori 1 Lin-R 1

BN 26 LR 4

DT 11 Markov 1

DBA 3 ME 2

Fuzzy 1 Novel 12

GA 1 SVM 29

HC 2 Wrapper 1

K-Means 6

Appendix (A), Fig. A1 shows further demonstration about the findings, it illustrates the distribution of the data mining techniques per year during the considered period. Based on the figure, it can be clearly seen that the number of data mining techniques adopted by researchers in the social media area has increased dramatically in 2012 and 2014 with 39 and 35 techniques respectively. The number dropped slightly to 24 techniques in 2013. Moreover, it is worthwhile to mention that many novel techniques have arisen between 2012 to early 2015 with a total number of 12 new techniques.

3.2. Data mining techniques research areas (RQ2)

From the selected articles, we identified six general domains which applied various techniques in nine different research areas to mine the flow of big data gathered from social media. The list of these domains follows:

Business and Management (BM)

Education (EDU)

Finance (FIN)

Government and Public (GP)

Medical and Health (MH)

Social Networks (SN)

Fig. 4 shows that social networks and business and management were the most active domains used by data mining techniques, with a percentage of 79% among all domains. Government and public with a percentage of 9% represents the third active domain. Appendix (A), Table A5, includes detailed information about all domains.

Fig. 4

Download : Download high-res image (188KB)Download : Download full-size image

Fig. 4. Domains among articles.

For further analysis of Table 2, we investigated the experiments of the selected articles and plotted Fig. 5 which demonstrates the popularity of various types in social media application researches. Some experiments were conducted to mine and analyze one or more social media applications' data. Microblogging applications such as Twitter was the most popular application for researchers with 31 experiments followed by social networks such as (Facebook) with 12 experiments. Appendix (A), Table A6, includes detailed information about the frequencies of social media applications used by the selected articles in this review.

Fig. 5

Download : Download high-res image (129KB)Download : Download full-size image

Fig. 5. Popularity of various social media application in researches.

Fig. 6 demonstrates further information about the findings by illustrating the distribution of the domains applying data mining techniques per year. Based on the figure, it can be clearly seen that the number of publications has increased dramatically in 2012 and 2014 with 19 articles in 5 domains for both periods. In 2013, the number went down to 12 articles in 5 domains. The social network data analysis remains the most active domain among the considered period.

Fig. 6

Download : Download high-res image (166KB)Download : Download full-size image

Fig. 6. Domains distribution per year.

Among the selected articles, we identified 9 active research objectives adopted data mining techniques. The list of these research objectives follows:

Biometric

Content Analysis

Cyber Crime

Disease Awareness

Geolocating

Quality Improvement

Risk Management

Semantic Analysis

Sentiment Analysis

Fig. 7 illustrates the distribution of these research areas. The sentiment analysis and quality improvement were the most active areas among articles with a frequencies of 21 and 14 respectively.

Fig. 7

Download : Download high-res image (245KB)Download : Download full-size image

Fig. 7. Research objective among domains.

3.3. Machine learning versus non-machine learning methods in mining social media data (RQ3)

Data mining techniques are the process of extracting hidden knowledge from the data [16]. This can be done in many ways such as KNN, K-Means, and SVM as machine learning methods. Also the statistical methods in some cases are considered as non-machine learning methods which used to discover patterns. As Berson, et al. mentioned [11], statistical techniques are driven by the data and are used to discover patterns and build predictive models.

Out of the 66 papers identified, only three papers contain either experimental or theoretical knowledge about non-machine learning methods. Two of these papers (A11, A19) integrated non-machine learning methods with machine learning methods to improve the result of their proposed solution. The third paper (A53) mentioned that text mining techniques that depend on machine learning methods are different than non-machine learning methods because of: (i) in traditional quantitative analysis methods, conclusions are derived from the population sample, whereas machine learning methods allow the researcher to derive conclusions from the entire population, (ii) traditional quantitative methods require the researcher to analyze the data using a theoretical platform, while machine learning methods give the researcher the ability to extract the actual meaning of the mined data contained in natural language text. (iii) Machine learning methods investigate the textual data without human interaction, whereas traditional quantitative methods need the researcher to interpret the data before analyzing.

However, we disagree with the authors of paper A53 because the definition of data mining consists of three concepts [17]: Statistics, Data (Big or Small), and Machine Learning and Lifting. Thus, data mining includes all statistics (Descriptive and non-inferential parts of the classical statistics) and Exploratory Data Analysis (EDA) for the data using the power of computers for the purpose of lifting and learning the patterns of the data [17].

Consequently, machine learning data mining techniques and non-machine learning data mining techniques such as traditional quantitative methods in statistics are complementary to each other in data mining

3.4. Data mining techniques versus other data mining techniques (RQ4)

This RQ compares different data mining techniques that have been used in the selected articles. Since most of the articles based their findings on either weak statistical analysis or without using any statistics, we built our comparison based on their judgments, which relied on the experiment they made or by referring to their article references. For instance, papers (A31, A53) indicate that the SVM technique is one of the best categorization and feature selection techniques available relying on references published in 1998 and 2003; however, the paper was published in 2013. Further details are provided in Section 5.

After reviewing the papers selected, we found that many papers have common findings on the same data mining techniques. For instance, papers (A31, A45, A53, A59) found that SVM outperforms other techniques such as Nave Bayes. In contrast, papers (A41, A51) claimed that Nave Bayes and MLP are performed better than SVM. Some other papers (A3, A20, A35) claimed that K-Means performed better than other techniques such as C4.5. Finally, (A42, A60) found that the DBA technique outperforms other techniques in terms of working with noisy data.

3.5. Strengths and weaknesses of data mining techniques (RQ5)

This part of the review represents a good source of information where the best practices of the primary data mining techniques could be implemented. Table 5 summarizes the data mining techniques that could be implemented in the social media area. In addition to the traditional data mining techniques, Appendix (A), Table A7, summarizes the description and the main features of the novel techniques proposed by the researchers.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

What makes this campaign a community-based social marketing campaign? A Community-Based Social Marketing Anti-littering Campaign: Be the Street You Want to See 23 Mine ok Hughes, Will McConnell and...

What makes this campaign a community-based social marketing campaign? Find other anti-littering campaigns and briefly describe them. Analyze what type of approach they use. Do they use...

Who is chief knowledge officer? What the primary role? A senior executive in an organization responsible for ensuring that firm fully utilizes the value it gets through knowledge- which is the most...

Please help me with this assignment, 100% human! Reference book George, J. M. (2024). Contemporary management (12th ed.). McGraw-Hill Education. keiser library Syahbinah, S., & Suhardianto, N....

The New World Reality of Benefits Communication Alexander, Sheri. Employee Benefit Plan Review 68.11 (May 2014): 13-14. One of the biggest challenges of modern benefits is explaining them to...

MKT500 Assignment 4 Requirements. Please design and create a PowerPoint Presentation that meets the criteria stated in the attachment. The slides should have bullet points and the notes or talking...

Strategic Management Frank Rothaermel,6eRelease: 6th Edition Please include a word count of your post (excluding citations and references), no matter whether it is an initial post or a reply, at the...

i want complete solution for my assignment and it should be without plagiarism COIT20274: Information Systems for Business Professionals, Term One 2016 Assignments 1 & 2 Requirements Assignment 1 -...

When a 3600-lb automobile runs out of gas, it is pushed by its unhappy driver and a friend a quarter of a mile (0.250 mi). To keep the car rolling, they must exert a constant force of 175 lb. (a) How...

One Stop Auto is an auto repair shop. Cam, the mechanic, and owner uses 1,000 units of X2435, which are currently ordered in batches of 100. It costs One Stop $50 to place an order, and the carrying...

After watching the film, Two American Families summarize some major issues raised in the film. Consider these questions in your response: How did economic restructuring and other economic changes in...

CT Corp Comprehensive Question Canadian Tire Corporation, Limited ( Canadian Tire ) is a family of companies that includes a retail segment and a financial services division, among others. The retail...

c. What are common stereotypes about the group? How did these stereotypes originate?

8. Why do some people in the United States prefer not to talk about history? What views of social reality and intercultural communication does this attitude encourage?

1. Cultural-Group History. This exercise can be done by individual students or in groups. Choose a cultural group in the United States that is unfamiliar to you. Study the history of this group, and...