Question

1 Approved Answer

Posted on Oct 09, 2024

ssignment you will work on datasets that are related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to

ssignment

you will work on datasets that are related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit. The bank did marketing campaigns based on phone calls. Often, more than one contact to the same client was required, in order to assess if bank term deposit would be ('yes') or not ('no') subscribed.This assignment is an individual assignment.I. Datasets: There are three datasets: demographics. csv, campaign.csv and deposit.csv. You need to merge these three datasets to produce a "big table" which will be the training dataset. The dependent variable is the variable "deposit" contained in deposit.csv Source of the data:[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014Variable Information:-Independent variables:-cid: customer ID (nominal) exits in all three tables.-Other independent variables include:# Table: demographics:1age (numeric)2 job : type of job (categorical)3 marital : marital status (categorical)4education (categorical)5default: has credit in default? (categorical)6balance: account balance (numeric)7housing: has housing loan? (categorical)8loan: has personal loan? (categorical)# Table: campaign: related with the last contact of the current campaign:9duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.10campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)11pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; -1 means client was not previously contacted)12 previous: number of contacts performed before this campaign and for this client (numeric)13 poutcome: outcome of the previous marketing campaign (categorical)# Table: Deposite:14deposit (Categorical: 'yes', 'no'); To work on this assignment, if you are not familiar with SAS, you need to watch the tutorial videos posted on the course video site. The slides/dataset/code I developed for the tutorial is available on D2L under content -> SAS tutorial. II. Your tasks: Base SAS ProgramingYou need to submit your assignment in just ONE Microsoft Word or PDF file. For each task below (except task 1), you need to copy and paste your code/output to the word/pdf document. You need to clearly label your code/output to indicate which code/output corresponds to which task/subtask. (C) means that you need to submit the SAS code for the task. (O) means that you need to submit the SAS output for the task. 1.On D2L under content -> Assignments -> Assignment 1, you will find a zip file called "Assignment1Data.zip". Download and unzip the file, you will find three data files "demographics. csv", "campaign.csv" and "deposit.csv". Import these three data files to SASand name the SAS datasets "demographics", "campaign" and "deposit" respectively. You don't need to submit anything for this task. Please note: to use a permanent SAS dataset, you must produce a library. You can use "explorer" and right-click "libraries" to produce a new library, but a more commonly-used way of creating a library is to submit a libname statementpointing to the folder where you saved your data, be sure you point to the correct folder on your computer:LIBNAME ASSGN1 '...the directory where you store your datasets'; /* please modify this line of code. */ You run this statement first, and then import the dataset.2.Sort these three datasets by "cid". You have data in three tables and need to merge them to produce your training dataset. In order to merge tables, you must first sort the tables by the variable you want merge the tables on. You cannot merge the table when they are unsorted. (C). /* below is the code for sorting the dataset demographics */PROCSORTDATA=ASSGN1.DEMOGRAPHICS; BY CID;RUN;/* ----please write your code for sorting CAMPAIGN and DEPOSIT*/3.Merge the three datasets to create your training dataset. All these three datasets include a variable "cid". We merge the datasets based on the common variable "cid". If you have takena relational database course before, you probably know that there are different kinds of "join"operations. In your database course, you have mainly used "inner join". In data mining, we usually use "left join". "Left join" returns all rows from the left table (table1 in the figure below), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match. You need to first find the table that contains the dependent variable ("deposit" in our case), and use that table to merge the other two tables. Please read https://en.wikipedia.org/wiki/Join_(SQL)#Left_outer_join for more detailed explanation. (C)The code for left-joining "deposit" and "demographics" is as follows:DATA ASSGN1.TRAINING; /* we produce a new dataset called training by merging two datasets, deposit and demographics */MERGE ASSGN1.DEPOSIT(IN=A) ASSGN1.DEMOGRAPHICS(IN=B);/* deposit is now your "A" table, and demographics is your "B" table */BY CID; /* we merge the two datasets based on the common variable cid */IF A;/* if a record is in "A" table, output */RUN;/*You have merged deposit and demographics and created the dataset training. Now, write your code for left-joining ASSGN1.TRAINING with ASSGN1.CAMPAIGN.*/4. After the previous step, you should have created your training dataset ASSGN1.TRAINING. Please print the first 20 rows of ASSGN1.TRAINING; (C)(O)/* Print statement.*/ PROCPRINTDATA= ASSGN1.TRAINING; /* You need modify this statement - you want to print just first 20 records of ASSGN1.COMPLETE */RUN;5.Run "proc contents" to get a summary of the training dataset. (C)(O)6.Write SAS code to drop two columns including "cid" and "duration" from the training dataset. Obviously "cid" is not useful for the prediction. "duration" cannot be included in the model. (C)7.For each categorical variable in the dataset, please use "proc freq" to get descriptive stats. (C)(O)For instance, for the variable "education", you write: procfreqdata= ASSGN1.TRAINING;tables education;run;The output should show how many categories each categorical variable includes and if the variable has missing values. For instance, the variable "education" includes 4 categories including "primary", "secondary", "tertiary" and "unknown", as well as 5 missing values. Please use "proc freq" to obtain the descriptive stats for all categorical variables.8.For each numeric variable in the dataset, please use "proc means" to get descriptive stats. Foreach numeric variable, please write code to output the mean, min, max, 1 percentile, 5 percentile, 10 percentile, 25 percentile, 50 percentile, 75 percentile, 90 percentile, 95 percentile, 99 percentile and number of missing values. You can find the example code for proc means in the tutorial slides (C)(O)9.Categorical variable recoding and missing value imputation. (C)We need to consider the following situations:1)For all categorical variables that are binary and have no missing values, you use 1 and 0 to represent the two categories. For instance, the dependent variable "deposit" is binary with no missing values, you can simply replace 'yes' with 1 and 'no' with 0. (C)2)For all categorical variables that have missing values and also have an "unknown" category, you treat missing values as belonging to the "unknown" category and replace missing values with "unknown". For instance, "education" includes 4 categories "primary", "secondary", "tertiary", and "unknown", and it also has missing values. Please replace the missing values with "unknown". (C)3)For all categorical variables that have missing values but do not have an "unknown" category, you need to replace missing values with the string "missing". For instance, the variable "housing" is binary (including "yes" and "no") and has some missing values, you need to replace the missing values with "missing". (C)4)After 1), 2) and 3), all your categorical variables should have no missing values. For these categorical variables, as I will discuss in one of the lecture recordings, you are not required to do categorical/nominal variable dummy coding if you use SAS, but inthis step, I ask you to write SAS code to create dummy variables for just one variable,"poutcome". This variable includes four categories "success", "failure", "other" and"unknown", you need to create four dummy variables including "poutcome_success","poutcome_failure", "poutcome_other" and "poutcome_unknown". Please assign valuesto the dummy variables according to the following table using if-else statements in SAS.For instance, if a training example's poutcome is "success", we set "poutcome_success"to be 1 and all the other three dummy variables to be 0. (C)poutcomepoutcome_successpoutcome_failurepoutcome_otherpoutcome_unknownsuccess1000failure0100other0010unknown00015)After doing dummy coding for the categorical variable "poutcome", you need to drop thevariable "poutcome". Also, to avoid "dummy variable trap" (please google to find outwhat it means), for the four dummy variables you have created, you need to drop one ofthem. Let's drop "poutcome_failure" (C).10.There are also numeric variables with missing values in the dataset. For instance,"balance"(originally in the demographics dataset) should be included in the analysis as a numeric variableand this variable contains some missing values. For these numeric variables with missing values,we want to do mean imputation - you need to replace the missing values with the mean of thevariable. Please use PROC STDIZE to do mean imputation. Please google to find out how to dothis. (C) 11.Run logistic regression using ASSGN1.TRAINING. The dependent variable is "deposit", andthe other variables are all independent variables. Please note you need to include the dummy variables you created in step 9.4. You need to put all categorical independent variables thatinclude more than 2 categories (including the missing or unknown category) in the CLASSstatement. (C)(O). For those categorical variables with just 2 categories (including thedummy variables you just created), it's not required to add them to the CLASS statement. PROCLOGISTICDATA= ASSGN1.REG_DATA;/* You need write two lines of code to specify your logistic regression model. Please take a look at the documentation of the procedure logistic. https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#logistic_toc.htmIn the example section, you can find some example code (example 51.2 is very relevant to this task). */RUN;