Answered step by step
Verified Expert Solution
Question
1 Approved Answer
summaries the next paragraph in one prepare in word document 1. Introduction Undoubtedly, the world is shrinking into a small village owing to the tangible
summaries the next paragraph in one prepare in word document
1. Introduction
Undoubtedly, the world is shrinking into a small village owing to the tangible influence of social media. It connects people from different parts of the world, ages, and nationalities and allows them to share their opinions, experiences, feelings, hobbies, pictures, and videos. This has opened the door for public and private organizations from all domains to promote, benefit, analyze, learn, and improve their organizations based on the data provided in social media. Thus, the significance of social media for academia and industry is quite conspicuous in the amount of research done by these two sectors, seeking answers to pivotal questions.
The structure of the social media data is unorganized and is displayed in different forms such as: text, voice, images, and videos [1]. Moreover, the social media provides an enormous amount of continuous real time data that makes traditional statistical methods unsuitable to analyze this massive data [2]. Therefore, the data mining techniques can play an important role in overcoming this problem.
In spite of the large number of empirical research about data mining techniques and social media, a scant number of studies compare data mining techniques in terms of accuracy, performance, and suitability. For instance, it was observed that the accuracy of certain machine learning techniques is calculated in various methods which makes it difficult to find answers to the suitability of the data mining techniques.
Many researchers have selected their data mining techniques based solely on expert judgment (A31, A56). Few surveys have been conducted in this area without giving full justification for using data mining techniques in social media [3], [4]. However, some studies discussed certain areas in the used data mining techniques in social media. In [5], Vuori, et al., discussed the information gathering and knowledge and information sharing through social media for companies. In [6], Rafeeque, et al., the work and challenges related to short text analysis have been reviewed. Akin to this study, [7], Tsytsarau, et al., reviewed the opinion mining and sentiment analysis development, providing a summary about the proposed methods of contradiction analysis. In [8], Gole, et al., discussed mining big data in social media and its challenges as a result of big data features such as: Volume, Velocity, Variety, Veracity and Value.
To the best of our knowledge, there is no previous study that systematically concentrates on the implemented data mining techniques in social media research, which has triggered the idea of the present survey. The review presented in this paper discusses the published research in the period from January 1, 2003 to January 7, 2015. The goal of this study is to probe the available articles with regards to: (I) the data mining techniques used to extract social media data, (II) the research area that requires mining data from social media, (III) a comparison between machine learning and non-machine learning data mining techniques, (IV) a comparison between different data mining techniques, and (V) the strength and weakness of the recommended data mining techniques in social media.
This manuscript is divided into five sections. Section 2 explains the implemented methodology. Section 3 describes our findings. Section 4 discusses the limitation of this review. Finally, Section 5 presents our findings, recommendations, and future work.
2. Methodology
In this review, we conducted a survey based on the Systematic Literature Review (SLR) proposed by Kitchenham and Charters [9] methodology which consists of: planning, conducting, and reporting phases where each phase consists of several stages. At the planning phase we created a review protocol which consists of six stages: specifying research questions, designing the search strategy, identifying the study selection procedures, specifying the quality assessment rules, detailing the data extraction strategy, and synthesizing the extracted data. Fig. 1 shows the review protocol stages.
Fig. 1
Download : Download high-res image (277KB)Download : Download full-size image
Fig. 1. Review protocol stages.
The research questions have been specified based on the objectives of this review. At the next stage, we designed the search strategy referring to the first stage to retrieve the required and related articles. We also identified the search terms and article selection process, which is required for an accurate search. Stage three covered the selection criteria which specify the inclusion and exclusion rules; we also included more related articles from the references in the articles we used to enrich our literature resources related to the research questions. Stage four included the quality questions to filter the related articles. In stage five, we described the extraction strategy used to obtain the required data which could answer the research questions. Finally, in the last stage, we identified the methodologies used to synthesize the extracted data.
As indicated by Kitchenham and Charters [9], the review protocol is considered to be a critical element of any SLR. Therefore, to avoid researcher bias and to ensure the quality of the review protocol, regular meetings have continued between the authors.
The following 2.1 Research questions, 2.2 Search strategy, 2.3 Study selection, 2.4 Quality Assessment Rules (QARs), 2.5 Data extraction strategy, 2.6 Synthesis of extracted data will illustrate in detail the review protocol followed in this review.
2.1. Research questions
Summarizing and providing evidence of implementing the data mining techniques in social media is our main goal in this work. Thus, we identified the following five research questions (RQs):
1.
RQ1: Which data mining techniques have been used in Social Media?
The role of this question is to specify the data mining techniques that were implemented in mining social network data.
2.
RQ2: In which research areas have data mining techniques been applied?
The aim of this question is to identify the domains where the data mining techniques were applied and the research objectives among these domains. The most frequent domain will be identified as well as any new domains suggested.
3.
RQ3: Do machine learning perform better than non- machine learning in data mining techniques?
RQ3 compares machine learning and non-machine learning methods implemented in mining social media in term of accuracy. Few articles made a comparison between machine learning and non-machine learning methods. As mentioned in [10], [11], only statistical techniques were considered as non-machine learning, whereas the other computational techniques are considered as machine learning methods.
4.
RQ4: Is there any comparison that has been performed among different data mining techniques?
The aim of RQ4 is to specify the data mining technique with high performance. The results produced by the answer of this question will be considered as evidence of the recommended techniques.
5.
RQ5: What are the strengths and weaknesses of the implemented data mining techniques in social media?
This question will prove the suitable practice of the selected data mining techniques in social media such as text mining, media mining, content-based mining, context-aware mining, graph data mining, and multimedia mining.
2.2. Search strategy
The search strategy that we followed in this survey is explained in detail as follows:
2.2.1. Search terms
To construct the search terms we followed the following procedure [9]:
1.
The main terms have been concluded from the research questions.
2.
We defined new terms which replace the main terms: such as jargon, alternative spellings, and synonyms.
3.
The top ten data mining algorithms were selected from published papers and books [12], [13].
4.
We used Boolean search operators (ANDs and ORs) to limit the search results in addition to for specific phrases.
We included in our search terms the top ten data mining techniques identified by [12], [13]. Fig. 1 shows the stages of the review protocol.
The search terms used to retrieve the related publications are as follows. Note that different search terms have been used to get more related publications. The last search date was conducted on January 9, 2015.
data mining AND techniques OR technique AND social media.
data mining AND machine learning AND social media.
social media AND fuzzy AND data mining.
social media OR social network AND (C4.5 OR J48 OR K-Means OR SVM OR support vector machines OR Apriori OR EM OR expectation maximization OR PageRank OR AdaBoost OR KNN OR k-NN OR k-nearest neighbors OR Naive Bayes OR CART).
2.2.2. Survey resources
The following digital libraries were searched for the required articles:
IEEE Explorer
Google Scholar
Science Direct
ACM Digital Library
Computing Research Repository
Web of Science
Spie
The first search process included journals, and Tier I social network related conferences, such as International Conference on Advances in Social Networks Analysis and Mining (ASONAM), ACM Conference on Online Social Networks (COSN), International World Wide Web Conference (WWW), and International Conference on Data Engineering (ICDE), from the above mentioned digital libraries. The search terms considered cover any part of the articles (metadata) and were restricted to articles published between January, 2003 and 2015, because the most popular social networks (Facebook, Twitter, LinkedIn, and MySpace) began after 2002 [14].
2.2.3. Search phases
We used the specified search terms to retrieve the primary related articles from these digital libraries. Moreover, a quick scan of the reference from the paper we selected helped to enrich the resources to answer the research questions. The inclusion criteria are explained in detail in Section 2.3.
The Google document platform was used to share and manage the search results and documents among authors. Based on the inclusion criteria, 147 relevant publications were chosen as candidate publications: 83 journal papers, 64 conference papers. Fig. 2 illustrates the breakdown of the identified articles at each search and selection phase.
Fig. 2
Download : Download high-res image (305KB)Download : Download full-size image
Fig. 2. Search and selection process.
2.3. Study selection
We obtained 1187 articles in the first search process. Because many articles did not provide sufficient information to answer the research questions, we performed another filtration step (see Fig. 2).
The filtration process was conducted individually by the authors and the results were discussed in scheduled meetings to ensure the accuracy and to resolve any differences. The selection and filtration steps are explained below:
1.
Step 1: remove the duplicated articles obtained by authors and/or different libraries.
2.
Step 2: apply inclusion and exclusion criteria to the candidate papers to avoid any irrelevant articles.
3.
Step 3: apply the quality assessment rules to include the qualified articles that give the best answers to the research questions.
4.
Step 4: search for additional related articles from the article references obtained from step 3 and repeat step 3 on the extra articles.
The inclusion and exclusion criteria applied in this survey are defined below:
Inclusion criteria:
Use data mining techniques in social media.
Use machine learning and non-machine learning data mining techniques in social media.
Comparative studies that compare among data mining techniques.
Comparative studies that compare between data mining and non-data mining techniques.
Consider the latest edition of the article of the same research (if different versions are available).
Consider only articles published between January 2003 and 2015.
Exclusion criteria:
Exclude articles that include data mining that is not related to social media.
Exclude articles that do not include data mining but are related to social media.
Exclude non-journal and non-conferences articles.
Finally, after applying all filtration steps, 66 articles were considered as the resources for this review. The selected articles are listed in Appendix (A), Table A1.
2.4. Quality Assessment Rules (QARs)
The QARs were applied in the selected studies to evaluate article suitability in accordance with the research questions. Ten QARs were identified, and each one is worth 1 mark out of 10. Each QAR is scored as follows: fully answered=1, above average=0.75, average=0.5, below average=0.25, not answered=0. The overall score of the article will be the summation of the marks obtained for the 10 QARs. If the result was 5 or higher, the article was considered; otherwise it was excluded.
1.
QAR1: Are the research objectives clearly defined?
2.
QAR2: Is the data mining background clearly addressed?
3.
QAR3: Are the data mining techniques used clearly defined?
4.
QAR4: Is the design of the experiment suitable and acceptable?
5.
QAR5: Is the study performed on sufficient social media data?
6.
QAR6: Is the data mining technique measured and reported?
7.
QAR7: Is the proposed data mining technique compared with other techniques?
8.
QAR8: Are the conclusions of the experiment clearly identified and reported?
9.
QAR9: Are the methods used to analyze the results appropriate?
10.
QAR10: Does the experiment enrich academia or industry?
The scores that resulted from applying the QARs on the selected articles are shown in Appendix (A), Table A2.
2.5. Data extraction strategy
In this stage, we explored the articles selected to extract the information required to answer the research questions. Therefore, we have designed an extraction form (see Table 1) to extract the needed data [9].
Table 1. Data extraction form.
Article ID
Data extractor
Data checker
Publication year
Authors
Article source
Article title
Article type
Domain
RQ1
RQ2
RQ3
RQ4
RQ5
Based on the extraction form, two authors played the role of extraction and checking. In case of a disagreement between the extractor and checker, group meetings were conducted between all authors to resolve any issue.
Some difficulties occurred during the extraction process. For instance, different terminology was used for the same data mining technique such as C4.5 algorithm is the new name of the J48 technique [15]; however, the WEKA tool (which is commonly used by researchers) is still using the old name J48 (A26). Moreover, some articles used different abbreviations of the same technique such as: KNN, K-NN, Nearest Neighbor (A12, A34), Nave Bayes, Naive Bayes, NB (A2, A37). Furthermore, many researchers were comparing between their techniques and other common techniques without mentioning technique names or, if mentioned, the reason behind picking certain technique (A31, A42, A53, A55).
Not all selected articles answered all the five RQs. Appendix (A), Table A3 illustrate the RQs that were answered by each selected study.
2.6. Synthesis of extracted data
To synthesize the data extracted from the selected articles, we used different procedures to aggregate evidence that will answer the RQs. The following explains the synthesis procedure we followed in detail:
For RQ1 and RQ2, we used the narrative synthesis method [9] were the extracted information was tabulated according to RQ1 and RQ2.
For the data extracted (quantitative) in RQ3 and RQ4, which came from different articles that have various accuracy calculation techniques, we used binary outcomes to measure the results, which are demonstrated in a comparable way [9].
In RQ5, the strengths and weaknesses of the data mining techniques have the same meaning but are written in different ways. Therefore, to unify these points, we followed the reciprocal translation method [9] which is considered as one of the techniques that can be used for synthesizing the qualitative data.
3. Results and discussion
In this section, we will discuss the results obtained from this review. The first subsection gives an overview of the selected articles. The result of each RQ will be discussed in detail in the next five 3.1 Types of data mining techniques (RQ1), 3.2 Data mining techniques research areas (RQ2), 3.3 Machine learning versus non-machine learning methods in mining social media data (RQ3), 3.4 Data mining techniques versus other data mining techniques (RQ4), 3.5 Strengths and weaknesses of data mining techniques (RQ5).
The total number of the selected studies was 66 articles (see Appendix (A), Table A4) that implemented data mining techniques used in social media. The selected articles were retrieved only from journals published between January 2003 and 2015. Appendix (A), Table A4 shows the number of articles and the percentage grouped by publisher name. The types of articles considered in this survey are: experiment, case study, and survey. Table 2 shows the distribution of the selected articles among the three types.
Table 2. Selected articles' types distribution.
Article type Freq.
Case study 4
Experiment 60
Survey 2
Grand total 66
With regards to the quality of the selected articles, we applied a quality assessment criterion to stream the articles based on the marks gained. The articles with grade five or greater (out of ten) were taken into consideration (see Table 3).
Table 3. Candidate articles' quality distribution.
Calcification criteria Freq. %
Between 0 and 2.5 53 36
Between 2.75 and 4.75 28 19
Between 5 and 6.75 35 24
Between 7 and 8.5 22 15
Between 8.75 and 10 9 6
Grand total 147 100
3.1. Types of data mining techniques (RQ1)
We identified 19 data mining techniques that had been applied by researchers in the area of social media. The list of these techniques is below.
AdaBoost
Artificial Neural Network (ANN)
Apriori
Bayesian Networks (BN)
Decision Trees (DT)
Density Based Algorithm (DBA)
Fuzzy
Genetic Algorithm (GA)
Hierarchical Clustering (HC)
K-Means
k-nearest Neighbors (k-NN)
Linear Discriminant Analysis (LDA)
Linear-Regression (Lin-R)
Logistic Regression (LR)
Markov
Maximum Entropy (ME)
Novel
Support Vector Machine (SVM)
Wrapper
Fig. 3 shows that SVM, BN, and DT are the most applied techniques in the area of social media with a percentage of 51% of the selected articles. Novel techniques with the percentage of 9% were not considered as the one of the highest; because each article has its dedicated novel technique. Table 4, includes detailed information about the frequencies of data mining techniques used by the selected articles in this review.
Fig. 3
Download : Download high-res image (221KB)Download : Download full-size image
Fig. 3. Data mining techniques among selected papers.
Table 4. Data mining techniques frequencies among articles.
Technique Frequencies Technique Frequencies
AdaBoost 2 k-NN 9
ANN 8 LDA 9
Apriori 1 Lin-R 1
BN 26 LR 4
DT 11 Markov 1
DBA 3 ME 2
Fuzzy 1 Novel 12
GA 1 SVM 29
HC 2 Wrapper 1
K-Means 6
Appendix (A), Fig. A1 shows further demonstration about the findings, it illustrates the distribution of the data mining techniques per year during the considered period. Based on the figure, it can be clearly seen that the number of data mining techniques adopted by researchers in the social media area has increased dramatically in 2012 and 2014 with 39 and 35 techniques respectively. The number dropped slightly to 24 techniques in 2013. Moreover, it is worthwhile to mention that many novel techniques have arisen between 2012 to early 2015 with a total number of 12 new techniques.
3.2. Data mining techniques research areas (RQ2)
From the selected articles, we identified six general domains which applied various techniques in nine different research areas to mine the flow of big data gathered from social media. The list of these domains follows:
Business and Management (BM)
Education (EDU)
Finance (FIN)
Government and Public (GP)
Medical and Health (MH)
Social Networks (SN)
Fig. 4 shows that social networks and business and management were the most active domains used by data mining techniques, with a percentage of 79% among all domains. Government and public with a percentage of 9% represents the third active domain. Appendix (A), Table A5, includes detailed information about all domains.
Fig. 4
Download : Download high-res image (188KB)Download : Download full-size image
Fig. 4. Domains among articles.
For further analysis of Table 2, we investigated the experiments of the selected articles and plotted Fig. 5 which demonstrates the popularity of various types in social media application researches. Some experiments were conducted to mine and analyze one or more social media applications' data. Microblogging applications such as Twitter was the most popular application for researchers with 31 experiments followed by social networks such as (Facebook) with 12 experiments. Appendix (A), Table A6, includes detailed information about the frequencies of social media applications used by the selected articles in this review.
Fig. 5
Download : Download high-res image (129KB)Download : Download full-size image
Fig. 5. Popularity of various social media application in researches.
Fig. 6 demonstrates further information about the findings by illustrating the distribution of the domains applying data mining techniques per year. Based on the figure, it can be clearly seen that the number of publications has increased dramatically in 2012 and 2014 with 19 articles in 5 domains for both periods. In 2013, the number went down to 12 articles in 5 domains. The social network data analysis remains the most active domain among the considered period.
Fig. 6
Download : Download high-res image (166KB)Download : Download full-size image
Fig. 6. Domains distribution per year.
Among the selected articles, we identified 9 active research objectives adopted data mining techniques. The list of these research objectives follows:
Biometric
Content Analysis
Cyber Crime
Disease Awareness
Geolocating
Quality Improvement
Risk Management
Semantic Analysis
Sentiment Analysis
Fig. 7 illustrates the distribution of these research areas. The sentiment analysis and quality improvement were the most active areas among articles with a frequencies of 21 and 14 respectively.
Fig. 7
Download : Download high-res image (245KB)Download : Download full-size image
Fig. 7. Research objective among domains.
3.3. Machine learning versus non-machine learning methods in mining social media data (RQ3)
Data mining techniques are the process of extracting hidden knowledge from the data [16]. This can be done in many ways such as KNN, K-Means, and SVM as machine learning methods. Also the statistical methods in some cases are considered as non-machine learning methods which used to discover patterns. As Berson, et al. mentioned [11], statistical techniques are driven by the data and are used to discover patterns and build predictive models.
Out of the 66 papers identified, only three papers contain either experimental or theoretical knowledge about non-machine learning methods. Two of these papers (A11, A19) integrated non-machine learning methods with machine learning methods to improve the result of their proposed solution. The third paper (A53) mentioned that text mining techniques that depend on machine learning methods are different than non-machine learning methods because of: (i) in traditional quantitative analysis methods, conclusions are derived from the population sample, whereas machine learning methods allow the researcher to derive conclusions from the entire population, (ii) traditional quantitative methods require the researcher to analyze the data using a theoretical platform, while machine learning methods give the researcher the ability to extract the actual meaning of the mined data contained in natural language text. (iii) Machine learning methods investigate the textual data without human interaction, whereas traditional quantitative methods need the researcher to interpret the data before analyzing.
However, we disagree with the authors of paper A53 because the definition of data mining consists of three concepts [17]: Statistics, Data (Big or Small), and Machine Learning and Lifting. Thus, data mining includes all statistics (Descriptive and non-inferential parts of the classical statistics) and Exploratory Data Analysis (EDA) for the data using the power of computers for the purpose of lifting and learning the patterns of the data [17].
Consequently, machine learning data mining techniques and non-machine learning data mining techniques such as traditional quantitative methods in statistics are complementary to each other in data mining
3.4. Data mining techniques versus other data mining techniques (RQ4)
This RQ compares different data mining techniques that have been used in the selected articles. Since most of the articles based their findings on either weak statistical analysis or without using any statistics, we built our comparison based on their judgments, which relied on the experiment they made or by referring to their article references. For instance, papers (A31, A53) indicate that the SVM technique is one of the best categorization and feature selection techniques available relying on references published in 1998 and 2003; however, the paper was published in 2013. Further details are provided in Section 5.
After reviewing the papers selected, we found that many papers have common findings on the same data mining techniques. For instance, papers (A31, A45, A53, A59) found that SVM outperforms other techniques such as Nave Bayes. In contrast, papers (A41, A51) claimed that Nave Bayes and MLP are performed better than SVM. Some other papers (A3, A20, A35) claimed that K-Means performed better than other techniques such as C4.5. Finally, (A42, A60) found that the DBA technique outperforms other techniques in terms of working with noisy data.
3.5. Strengths and weaknesses of data mining techniques (RQ5)
This part of the review represents a good source of information where the best practices of the primary data mining techniques could be implemented. Table 5 summarizes the data mining techniques that could be implemented in the social media area. In addition to the traditional data mining techniques, Appendix (A), Table A7, summarizes the description and the main features of the novel techniques proposed by the researchers.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started