Question

1 Approved Answer

Posted on Oct 16, 2024

subject - data mining Please answer the 4,5,6,7,8 questions. In this last exploration we're going to make use of the excellent distance metrics provided with

subject - data mining

Please answer the 4,5,6,7,8 questions.

In this last exploration we're going to make use of the excellent distance metrics provided with Scikit-learn. In class we talked about several distance metrics, one of the most useful for numeric data being Euclidean, defined by: $D_{euclidean}(x, y) = \sqrt{ \sum_i ( x_i -y_i )^2 }$.

We're going to use the sklearn.metrics libraries to understand the similarity of certain diamonds to a reference set I will be giving you. As we go through this exercise, you'll be building up your intuitions that will help carry you into clustering concepts and other methods of data exploration.

First, familiarize yourself with the sklearn.metrics.pairwise_distances and sklearn.preprocessing.normalize methods. These will be critical for completing these tasks. You will also need to read up on the pandas.get_dummies() method.

In class, I talked a bit about nomimal and categorical versus numeric variables. Categorical being the the variables that take on two or more named categories, like "Male" or "Female", or color grades like "E", "G" or "H". Most machine learning and data mining algorithms do not deal with such variables easily as strings and thus we will often have to convert such variables to numeric or binary values, where 1 represents the presence and 0 the absence of the variable, or a scale of values like 1 to 10, etc.

Luckily Pandas provides this capability for us with the pandas.get_dummies() method, which will automatically convert these categorical and nominal values to binary values. Sometimes this is also called binarizing or binarization. What the technique effectively does is map a series of outcome variables like color grades to numerical indicatator variables - with a 1 that the variable is present and 0 for all others. You may notice that this can create very large, and sparse datasets if you have a lot of variable outcomes.

You will need to load the diamonds dataset and convert the categorical variable to numeric (cut, clarity, color). Once this is done, you will be able to do what will be asked of you.

Once you have used get_dummies() to create a binary (numeric feature) for the variable's presence or absence then you can perform more interesting operations such as comparing one set of data to another.

In the interest of being mindful of our computation restrictions on the Hub, we are going to randomly sample the data in order to develop some distance measures. There are a number of ways to get a random sample of rows from our DataFrame, but one of the easiest is the DataFrame.sample() method, which will take a parameter n to indicate the size of the random sample. We will choose 10% of the total data or 5400 to be nice on the cloud computational resources. If you run this on your own machine, you may increase that to a much larger number, say 30%.

Now you will perform a final step - normalization. Normalization is the process of bringing data values into a common scale. Scikit-learn offers two common methods L1-norm and L2-norm. L1-norm is also known as the least absolute deviations or least absolute errors method and attempts to minimize _i|y_if(x_i)| where f(x_i) is the estimated values for the target y_i. L2-norm is known as least squares and minimizes _i(y_if(x_i))² or the sum of squares. Each method has their advantages and disadvantages, but L2-norm is the stable and computationally efficient default within the sklearn.preprocessing.normalize() method and so we will stay with that default in this part of the exercise.

1.Once the data is normalized and your sample has been made, please save the resulting DataFrame to a CSV file called normalized_datasample_5400k.txt. Also make sure the head of this dataset is also displayed in your notebook.

2.NOTE: We have sample data rows here in sample_data.csv for the next questions. You might find it just as easy to insert these data into your original dataset and normalize when answering the questions, but there are other ways of doing the same thing.

3.For diamond #2 (the 0.38 carat, Ideal, G, VS1, $759), please find the 5 most similar diamonds from the sample set that you have. You will need to learn to use the sklearn.metrics.pairwise_distances method obtain the distances and the numpy.argsort() method to determine the indices of the 5 closest diamonds. Use the default Euclidean metric to perform the distances calculation.

4.Run normalization again, but drop price from the sample before normalizing. Find the 5 closest diamonds. The price per carat for the diamond we were looking at is $1946.15. How does this compare with the price per carat of the 5 most similar diamonds? You will obviously need to keep the original dataset in order to determine the prices of the 5 most similar that you find.

5.When looking at the top 5, what commonalities do they have with the sample diamond #2? Differences?

6.Perform the same analysis on diamond #14, (1.13 carat, Ideal, F, VS2, $6283). Again how does the price per carat compare? Use the data you have as evidence to back up your answer.

7.What are the similarities / differences amongst the top 5? Please be specific in your answer.

8.Provide a reason (an intuition will do) for dropping the price feature from the data?