Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Oct 11, 2024

Ineeded help with my assignment. It was returned saying needs revision. I am not sure what needs to be done. If I have a sample

Ineeded help with my assignment. It was returned saying needs revision. I am not sure what needs to be done. If I have a sample document, then I could understand what the questionmeans.

I. DATA ANALYSIS: UNIVARIATE STATISTICS

Not Evident

The submission does not identify the distribution of variables using univariate statistics.

M. DATA ANALYSIS: VISUAL REPRESENTATION

Not Evident

Not EvidentThe submission does not justify the methods chosen to visually present the data.

This aspect will be assessed once appropriate justifications for the use of the correlation plot, bar charts, and univariate statistics graphs are in place.

N. DATA SUMMARY: PHENOMENON

Not Evident

Not EvidentThe submission does not explain how the data shows discrimination.

EVALUATOR COMMENTS: ATTEMPT 2

This aspect will be assessed once a determination whether the variables in the dataset were able to determine the phenomenon to be predicted is in place.

Please see the attached document. I would appreciate any help.

Introduction

We have been asked to analyze the customer data for a telecommunications company. The company is concerned about the churn rate of its customers.Customer churn occurs when customers stop doing business with a company. We are going to use the "R" program to predict the customer churn using the provided telecom dataset. R is a very good programming tool to do data mining and prediction. R and Python have similar capabilities in statistical data analytics. Python and R are free to the user community while SAS is a commercial software and is expensive. As far as I know, R has more advanced graphical capabilities than SAS or Python. As a standalone software, R and Python can be easily installed on a user's computer. For this dataset, I have chosen R for all the capabilities explained above and my familiarity with the software.

Before data analysis, the data has to be cleaned to be consistent, uniform and normalized. Data needs to be also checked for outliers like missing values. These outliers could skew the data analysis results negatively. Otherthing to check, is that the categorical values are meaning full and consistent. There should not be same values repeated in different forms. When using categorical data we need to ensure that the factors/groups are less than 5 for accurate classification/prediction. The target variable "Churn" has a 'Yes' or 'No' value. We can convert this to binary 0 or 1 to help with logistic regression and other analysis. The idea is that we train the dataset to get a good process for predicting the target "Churn" value. We will evaluate the accuracy of the prediction and also evaluate which variables are most important for the prediction. I will use various visual tools to look at the data categorization like bar charts and box plots.

We will be using several supervised machine learning methods to discover trends and patterns in this data. Since the target variable is "Yes"/No"(two categories) and there are only 20 variables, Logistic regression is the best method to be used . It can be used to predict the probability of Churn rate happening or not. We will not use PCA method here since there are not much complex dimensions in the data. Another method, that I will use is the "Decision tree". The key purpose of using Decision Tree is to build a training model used to predict values of target variables by learning decision rules. The rules are inferred from prior data (the training data). It works like a flow chart, separating data points into two similar categories at a time from the "tree trunk" to "branches," to "leaves," where the categories become more finitely similar. Lastly, I will also try the Random Forest method, which is an extension of the decision tree and can provide more accurate results. It is also used for identifying the most important features among the all available features in the training dataset. In this case, we need to identify the most important features that result in Churn to help the customer make better business decisions to retain their customers. Lastly, I will use the confusion matrix to evaluate accuracy of the prediction.

A.Data Processing

The raw data consistsof 7043 rows and 21 columns. Most of the columns have different types of values. In order to do good prediction, data needs to be uniform and converted to groups/factors or 5 or less. We need to look for outliers and clean or delete them.

1.The raw dataset consists of 7043 rows and 21 (variables) columns of data. See below Figure 1 to see the structure of the data.As the customer is interested in the churn rate, the target variable is the "Churn" column (dependent variable) and all others are the independent variables. Lets pull some basic stats from the data.

The data was checked for missing values in each column.We found 11 missing values inthe "Total Charges" column. These rows of data were removed.

2.Lets plot all the categorical variables by their count and see how their values are distributed. I will use a bar chart to explore these values.

Gender is equally distributed by male and female. SeniorCitizen= '0' is much more than '1'. This value needs to change to Yes and No to make it uniform. Partner is equally distributed. Dependents are low on 'Yes' values.

PhoneService is more on 'Yes' values compared to 'No'. Multiplelines are equally distributed between No and Yes. But have a small set of 'No phone service'. We need to convert those to No, since no phone service and no are the same. OnlineSecurity has more 'No' compared to 'Yes'. 'No internet service' is also a value in this data which means 'No', so we will convert those values.

OnlineBackup, DeviceProtection and StreamingTV has more 'No' compared to 'Yes'. Again, these has an extra 'No internet service' which needs to be changed to 'No'.

StreamingMovies has more 'Yes' compared to 'No'. Again, this column have an extra 'No internet service' which needs to be changed to 'No'. Contract seems to be more for month-to-month values. All other seems to be uniformly distributed.

3.Data manipulation:

a.The below columns need to be manipulated so that the data is not skewedand remove some outliers.

OnlineSecurity

OnlineBackup

DeviceProtection

TechSupport

StreamingTV

StreamingMovies

b.Next, we will change the "MultipleLines" column value of "No phoneservice" to "No" .

c.Let's look at the tenure values before processing it.

The tenure column has several values and need to be converted to an interval. The min value for tenure is 1 and the maximum is 72. So the values can be grouped into 5 groups. 0-12, 12-24, 24-48, 48-60 and >60.

d.We need to change the values in the column "SeniorCitizen" from 0 or 1 to "No" or Yes" to make it categorical.

e.Lastly, we will remove the columns we don't need. "Customerid" is just a sequence identifier and tenure has been converted to "tenure_group". So we don'tneed these columns for analysis.

B.Exploratory Data Analysis

4.Lets first look at the numeric data type columns - "TotalCharges" and "MonthlyCharges" using a correlation plot. Two types of graphs have been plotted.

Since both of them are correlated, we will remove the "TotalCharges" column.

5.Next, we will plot all the categorical variables/columns against Churn (Dependent/Target).

i)Variables: gender, SeniorCitizen, Partner, Dependents

The churn rate for male and female customers are almost the same

The churn rate for non senior citizens are more than the Senior Citizens

The Churn rate for non partners are more than the partners.

The Churn rate for non dependents are more than the dependents.

ii)Variables: PhoneService,MultipleLines, InternetService, OnlineSecurity

The Churn rate for phone service customers are more than the non phone service customers.

The Churn rate for multiple line customers are more than the non multiple linecustomers.

The Churn rate for Fiber optic customers is more than the DSL.

The Churn rate for Online Security customers is less.

iii)Variables: OnlineBackup,DeviceProtection, TechSupport, StreamingTV

The Churn rate for multiple Online Backup customers are less than the non online backup customers.

The Churn rate for device protection customers are less than the non device protection customers.

The Churn rate for tech support customers are less than the non tech support customers.

The Churn rate for streaming TV customers arealmost same as the non streaming TV customers.

iv)Variables: StreamingMovies, Contract, PaperlessBilling, PaymentMethod

The Churn rate for Streaming movies customers arealmost the same as the non Streaming movies customers.

The Churn rate for month to month contract customers are more than theother customers.

The Churn rate for paperless billing customers are more than theother customers.

The Churn rate for electronic check payment method customers are more than the other customers.

v)Tenure_group

The Churn rate for 6-12 months tenure group is more than the others.

Looks like all the variables are broadly distributed , so we will keep all of them.

C.Logistic Regression (Rai)

1.Lets first convert the Churn column values to binary . 1 for 'Yes' and 0 for "No"

2.Next, let's do logistic regression.Data is split into 70% training and30% testing set. We will check the dimension to see if it is accurate.

3.Runthe regression model.

We can see that gender,"SeniorCitizen", Partner, Dependents, "MultipleLines", "InternetService", "DeviceProtection" , StreamingTV, StreamingMovies and PaperlessBillingvariables are not statistically significant.MonthlyCharges, PhoneServices, contract and tenure has the lowest p-value.

4.Feature Analysis- ANOVA

Analyzing the deviance table we can see the drop in deviance when adding each variable one at a time. Adding MonthlyCharges, Contract , OnlineSecurity and tenure significantly reduces the residual deviance.

5.Accessing the Predictive ability of the Logistic Regression model

The logistic regression accuracy is 0.81.

6.Logistic Regression Confusion Matrix

True positive is 1409 - True (Churn =Yes).

True negative is 299 - False (Churn =Yes)

False positive is 139 -True (Churn = No)

False negative is 261 -False (Churn = No)

D.Odds Ratio

Odds of an event happening:

OR >1 indicates increased occurrence of an event

OR <1 indicates decreased occurrence of an event

E.Decision Tree (Rai)

Lets create a decision tree based on the three most relevant features:Contract, OnlineSecurity, and tenure_group.

So contract is the most important variable to predict Churn or not Churn . One year or 2 year contract customers are less likely to churn even if they have online security. But a month to month contractcustomer who is in the 0-12 month tenure group andusing online security is more likely to churn.

F.Decision Tree Confusion Matrix (Rai)

Looks like the decision tree accuracy has not improved.

G.Random Forest (Rai)

1.Initial Model: Let's do some analysis in Random forest.

Here ntree=500 and mtry=4. The error rate is 22.01%,about 78% accuracy. The error rate for "No"(0) is lower than the error rate for "Yes"(1).

2.Random ForestPrediction and Confusion Matrix with train data

We are going to use all the variables to produce confusion matrix table and make predictions.For training data, accuracy is 93%.

Testing Data:

Testing data: Prediction was 1386 for "No churn" and 286 for "Yes Churn", which are the correct classification.Only 162 Yes and 274 No are misclassified. The accuracy is 0.79. So misclassification error is 21%. Theconfidence interval is 95% and range is 77% to 81%. This result is good. Also, the sensitivity is 90%.

Lets now plot the error rate.

So as the number of trees grow, the out-of-bag errorinitially drops down and then it becomes more or less constant. So we are not able to improve this error after about 100 to 300 trees.

3.Tune Random Forrest Model: On tuning the model gives the below graph.

The out-of-bar (OOB) error is lowest when mtry is 2. Therefore we will choose mtry to be 2 in the model.

4.Fit the Random Forest Model After Tuning

Let's run the model again with the updated mtry=2. After playing with the different combinations, I found ntree=400 and mtry=2 gave the lowest error rate. Theerror rate decreased to 20.69% from 21.42%.

OOB error rate decreased to 21.77% from 22.01% earlier.

5.Random Forest Predictions and Confusion Matrix After Tuning

The accuracy has not increased, but sensitivity has increased to 94% from the initial model (90%).

6.Let's look at the random forest top features.

Using importance (Sagar)

H.Summary,

For this dataset, it looks like the Logistic Regression and Random Forestperformed better than Decision Tree. Tenure, monthly charges, contract, online security are some of the important features to predict churn rate. Gender, partner , dependents, internet service and paperless billing does not seem to have any relationship to churn.Contract is the most important variable to predict Churn or not Churn . One year or 2 year contract customers are less likely to churn even if they have online security. But a month to month contractcustomer who is in the 0-12 month tenure group andusing online security is more likely to churn.