Question
3. (16 points) In this question we will be understanding correlation between the features in the dataset credit risk dataset.csv. Load this dataset from shared/data/credit
3. (16 points) In this question we will be understanding correlation between the features in the dataset credit risk dataset.csv. Load this dataset from shared/data/credit risk dataset.csv. More information about the data can be found here: https://www.kaggle.com/datasets/ laotse/credit-risk-dataset/data (a) (2 points). Check whether there are any missing values i.e. NAs in the data. For this, explore dataframe.isna() function. i. Report the column names having NAs. ii. Drop all those rows which have NAs. (b) (2 points). Now we will be analyzing only a subset of dataframe. Create a subset of dataframe, containing only the columns person age, person income, loan amnt, loan percent income, cb person cred hist length (c) (4 points). Find correlation between the columns in the data using dataframe.corr(). Pick a pair of covariates and interpret their correlations. Which two predictors are the most highly correlated? The least? Does these correlations make sense in context? (d) (1 points) Using matplotlib.pyplot, plot a scatter plot that includes person income on X-axis and loan amnt on Y-axis. (e) (3 points) Study the plot from Q.3(d) i. Do you identify any outliers? ii. If yes, then suggest a transformation of the data that would reduce the influence of those outlier
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started