Answered step by step
Verified Expert Solution
Link Copied!

Question

...
1 Approved Answer

Please use R Programming and R Studio for this question. Link to the file of the data for this question: https://drive.google.com/file/d/18kGNrHUfgcVv2hMKqL5E05L40xCl6e1M/view?usp=sharing Problem 1 (12 poi

Please use R Programming and R Studio for this question.

Link to the file of the data for this question: https://drive.google.com/file/d/18kGNrHUfgcVv2hMKqL5E05L40xCl6e1M/view?usp=sharing

image text in transcribedimage text in transcribed
Problem 1 (12 poi nts}: For this question, we will use the US census dataset from 1994, which is in adultcsv. a. Show the descriptive and summary statistics for this dataset. Based on those metrics, what can we say:I about the distribution of age and educationnun-ii'I Hint: Create a histogram [4 points] b. How create a scatterplot matrix of the numerical variables. Are there anv strong correlations between anv two variables? If so, what are thev? Hint: Refer to Tutorial 2 scatterplot section. [2 points] c. Based on descriptive and summary! statistics and box plots for age, educationnum and hoursper week, are there anv differences between males and females? Hint: Use the "lter\" function from the Tidvverse to create males and females subsets, and create box plot following Tutorial 2 ['5 points} Question 2: We will use SVM in this problem, showing how it often gets used even when the data are not suitable, by first engineering the numerical features we need. There is a Star Wars dataset in the dplgr library. Load that library and you will be able to see it {headlstarwarsl}. a. There are some variables we will not use, so first remove films, vehicles, starshjpg and name. Also remove rows with missing values b. Several variables are categorical. We will use dummy variables to make it possible for SVM to use these. Show the resulting head of the dummy variables including the target column gender. c. Use SUM to predict gender and report the accuracy. First, create the dataset for 66% training and 34% testing and a seed of 94 for the random partitioning. d: Given that we have so many variables, it makes sense to consider using PCA. Run PCA on the data and determine an appropriate number of components to use from the graph. Create a reduced version of the data with that number of principle components by first finding and removing near zero variance predictors using the following code: nzv c nearZeroVar{numeric train] W W filtered

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Economics

Authors: R. Glenn Hubbard

6th edition

978-0134106243

Students also viewed these Mathematics questions

Question

What is Simpsons paradox?

Answered: 1 week ago