Question
Predictive data Minning models Continuous - 'Children', 'Income', 'Churn', 'Tenure', 'MonthlyCharge', 'Item1', 'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8' Categorical - 'Marital', 'Gender',
Predictive data Minning models
Continuous - 'Children', 'Income', 'Churn', 'Tenure', 'MonthlyCharge', 'Item1', 'Item2', 'Item3',
'Item4', 'Item5', 'Item6', 'Item7', 'Item8'
Categorical - 'Marital', 'Gender', 'Techie', 'Contract', 'Port_modem', 'Tablet', 'InternetService',
'Phone', 'Multiple', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'PaymentMethod'
Steps to prepare the data for analysis are:
# Import libraries needed for analysis
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Import dataset
data = 'churn_clean.csv'
df = pd.read_csv(data
# Convert the predictor variable into a binary numeric variable
df['Churn'].replace(to_replace='Yes', value=1, inplace=True)
df['Churn'].replace(to_replace='No', value=0, inplace=True)
# Balance labels so there are equal churn vs non-churn
churners_number = len(df[df['Churn'] == 1])
print("Number of churners", churners_number)
churners = (df[df['Churn'] == 1])
non_churners = df[df['Churn'] == 0].sample(n=churners_number)
print("Number of non-churners", len(non_churners))
df2 = churners.append(non_churners)
# Convert categorical variables to binary and drop unnecessary rows
df2 = df.drop(['CaseOrder', 'Customer_id','Interaction', 'UID', 'Job', 'City','State', 'County',
'Zip','Lat','Lng','Population','Area','TimeZone', 'Age', 'Outage_sec_perweek', 'Email', 'Contacts',
'Yearly_equip_failure', 'Bandwidth_GB_Year'], axis =1)
ml_dummies = pd.get_dummies(df2)
ml_dummies.fillna(value=0, inplace=True)
ml_dummies.head()
# Add a random column to the dataframe
ml_dummies['randomColumn'] = np.random.randint(0,1000, size=len(ml_dummies))
# Perform KNN classification
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(ml_dummies, label, test_size=0.25,
random_state = 8)
# Classifiers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
classifiers = [
KNeighborsClassifier(5),
DecisionTreeClassifier(max_depth=5)]
# Iterate over classifiers
for item in classifiers:
classifier_name = ((str(item)[:(str(item).find("("))]))
print (classifier_name)
# Create classifier, train and test it
clf = item
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
score = clf.score(X_test, y_test)
print (round(score,3),"\n", "- - - - - ", "\n")
# Scale all variables to a range of 0 to 1
from sklearn.preprocessing import MinMaxScaler
features = ml_dummies.columns.values
scaler = MinMaxScaler(feature_range = (0,1))
scaler.fit(ml_dummies)
ml_dummies = pd.DataFrame(scaler.transform(ml_dummies))
ml_dummies.columns = features
# Create train/test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(ml_dummies, label, test_size=0.25, random_state = 8)
# Run logistic regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
result = model.fit(X_train, y_train)
# Print the prediction accuracy
from sklearn import metrics
prediction_test = model.predict(X_test)
print (metrics.accuracy_score(y_test, prediction_test))
# To get weights of Impactful variables
weights = pd.Series(model.coef_[0],
index=ml_dummies.columns.values)
# Random Forest Algorithm
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(ml_dummies, label, test_size=0.25, random_state = 8)
model_rf = RandomForestClassifier(n_estimators=1000 , oob_score = True, n_jobs = -1,
random_state =50, max_features = "auto",
max_leaf_nodes = 30)
model_rf.fit(X_train, y_train)
# Make predictions
prediction_test = model_rf.predict(X_test)
print (metrics.accuracy_score(y_test, prediction_test))
# Graph of Random Forest results
importances = model_rf.feature_importances_
weights = pd.Series(importances,
index=ml_dummies.columns.values)
weights.sort_values()[-10:].plot(kind = 'barh')
Part V. Data Summary and Implications
E1. The accuracy of our prediction can be found in the snip below:
Mean Squared Error was computed using the following snip of code:
1. kindly review all the codes and correct it
2. Discuss one limitation of your random forest data analysis.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started