It is important to define or select similarity measures in data mining applications, such as clustering, outlier analysis, and nearest neighbor classification However, the studies show that there is no single similarity measure approach that consistently outperforms other approaches in all situations Nonetheless, seemingly different similarity measures may be equivalent after some transformations Let us considered 5 data objects in Table 1 skin insu mass pedi x 1 19 88 33 6 0 627 x 2 20 188 26 6 0 351 x 3 28 128 23 3 0 672 x 4 21 94 28 1 0 167 x 5 34 168 43 1 2 288 Table 1 Diabetes Attribute information is listed below Triceps skin folds thickness in mm ( skin ) the minimum value is 0 and the maximum value is 99 2 Hour serum insulin in mu U ml ( insu ) the minimum value is 0 and the maximum value is 850 Body mass index measured as weight in kg (height in m) 2 ( mass ) the minimum value is 0 and the maximum value is 70 0 Diabetes pedigree function ( pedi ) the minimum value is 0 05 and the maximum value is 2 50 Given a new object (20, 98, 25 6, 0 201) as a query, rank the objects in Table 1 based on similarity with the query using Cosine similarity Then, identify which of the following is a true statement about the ranking x2, x1, x3, x4, x5 x1, x2, x5, x3, x4 x3, x2, x5, x4, x1 x3, x1, x2, x4, x5 Suppose a group of 12 students with the test scores 72, 50, 21, 65, 97, 36, 85, 69, 70, 77, 88, and 93 into four intervals, using the equal width approach Do the partition, and identify the second smallest and largest values (among those values that appear in the list above) in each of the intervals Then, find the true statement in the list below a 36 is the second smallest value in its interval b 88 is the second largest value in its interval c 72 is the second largest value in its interval d 93 is the second smallest value in its interval Suppose a group of 12 students with the test scores 72, 50, 21, 65, 97, 36, 85, 69, 70, 77, 88, and 93 into four intervals, using the equal frequency approach Do the partition, and identify the smallest and largest values (among those values that appear in the list above) in each of the intervals Then, find the true statement in the list below a 36 is the smallest value in its interval b 65 is the largest value in its interval c 88 is the smallest value in its interval d 93 is the largest value in its interval Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the following result age 8 12 13 13 13 14 14 16 17 18 19 20 20 20 21 21 22 25 fat 9 5 6 5 7 8 16 5 30 2 25 3 26 4 26 1 30 5 33 5 41 5 26 6 12 5 28 5 25 3 12 3 14 0 15 0 The five number summary of a distribution provides a good summary of the shape of the distribution The five number summary of the data of the fat is a 7 80,26 68, 30 70, 33 93, 41 50 b 8 00, 13 25, 17 50, 20 00, 25 00 c 6 50, 12 88, 25 30, 28 03, 41 50 d 41 50, 33 93, 28 78, 26 68, 7 80 Give anyone other term used for Input variable Target variable Attribute Row n what ways is data mining different from statistics Choose the correct from following a Statistics tends to employ simpler algorithms b Data mining tends to employ simpler algorithms c In classical statistical inference, the same sample is used to make an estimate, and also to determine how reliable that estimate might be In data mining, different samples are used d Data mining tends not to involve the strict limits around the question being addressed that classical inference requires From a statistical perspective, accurate models can be built in data mining with as few as several hundred records a The statement is true because what we need to build a model is 'good' representation of the population, which we can often get with a few hundred records b The statement is contradictory to the idea of data mining sifting through large amounts of data to gain useful information and therefore false c The statement is false because several hundred records are very unlikely to lead to an accurate model d The statement is false because using a small number of records would lead to overfitting List, in the correct order, the essential steps for building a data mining model 1 2 3 4 5 Run several modeling techniques, choosing one on the basis of its performance on the validation data Results with the test data are an indicator of how well it will do with the rest of the database 1 2 3 4 5 Sampling from a larger database 1 2 3 4 5 Explore, Clean, Preprocess and Reduce the Data, including treatment of outliers and missing data 1 2 3 4 5 Develop the understanding of variables and selection of variables for building a model 1 2 3 4 5 Data partitioning into training, validation and test data sets

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 25, 2024

It is important to define or select similarity measures in data mining applications, such as clustering, outlier analysis, and nearest-neighbor classification. However, the studies show

It is important to define or select similarity measures in data mining applications, such as clustering, outlier analysis, and nearest-neighbor classification. However, the studies show that there is no single similarity measure approach that consistently outperforms other approaches in all situations. Nonetheless, seemingly different similarity measures may be equivalent after some transformations. Let us considered 5 data objects in Table 1:

	skin	insu	mass	pedi
x₁	19	88	33.6	0.627
x₂	20	188	26.6	0.351
x₃	28	128	23.3	0.672
x₄	21	94	28.1	0.167
x₅	34	168	43.1	2.288

Table 1: Diabetes

Attribute information is listed below:

Triceps skin folds thickness in mm (skin): the minimum value is 0 and the maximum value is 99.

2-Hour serum insulin in mu U/ml (insu): the minimum value is 0 and the maximum value is 850.

Body mass index measured as weight in kg/(height in m)^2 (mass): the minimum value is 0 and the maximum value is 70.0.

Diabetes pedigree function (pedi): the minimum value is 0.05 and the maximum value is 2.50.

Given a new object (20, 98, 25.6, 0.201) as a query, rank the objects in Table 1 based on similarity with the query using Cosine similarity. Then, identify which of the following is a true statement about the ranking.

		x2, x1, x3, x4, x5
		x1, x2, x5, x3, x4
		x3, x2, x5, x4, x1
		x3, x1, x2, x4, x5

Suppose a group of 12 students with the test scores 72, 50, 21, 65, 97, 36, 85, 69, 70, 77, 88, and 93 into four intervals, using the equal-width approach. Do the partition, and identify the second smallest and largest values (among those values that appear in the list above) in each of the intervals. Then, find the true statement in the list below.

36 is the second smallest value in its interval.

88 is the second largest value in its interval.

72 is the second largest value in its interval.

93 is the second smallest value in its interval.

Suppose a group of 12 students with the test scores 72, 50, 21, 65, 97, 36, 85, 69, 70, 77, 88, and 93 into four intervals, using the equal-frequency approach. Do the partition, and identify the smallest and largest values (among those values that appear in the list above) in each of the intervals. Then, find the true statement in the list below.

	a.	36 is the smallest value in its interval.
	b.	65 is the largest value in its interval.
	c.	88 is the smallest value in its interval.
	d.	93 is the largest value in its interval

Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the following result:

age	8	12	13	13	13	14	14	16	17	18	19	20	20	20	21	21	22	25
%fat	9.5	6.5	7.8	16.5	30.2	25.3	26.4	26.1	30.5	33.5	41.5	26.6	12.5	28.5	25.3	12.3	14.0	15.0

The five number summary of a distribution provides a good summary of the shape of the distribution. The five-number summary of the data of the fat is:

	a.	7.80,26.68, 30.70, 33.93, 41.50
	b.	8.00, 13.25, 17.50, 20.00, 25.00
	c.	6.50, 12.88, 25.30, 28.03, 41.50
	d.	41.50, 33.93, 28.78, 26.68, 7.80

Give anyone other term used for:

Input variable:

Target variable:

Attribute:

Row:

n what ways is data mining different from statistics? Choose the correct from following.

	a.	Statistics tends to employ simpler algorithms
	b.	Data mining tends to employ simpler algorithms
	c.	In classical statistical inference, the same sample is used to make an estimate, and also to determine how reliable that estimate might be. In data mining, different samples are used.
	d.	Data mining tends not to involve the strict limits around the question being addressed that classical inference requires.

From a statistical perspective, accurate models can be built in data mining with as few as several hundred records.

	a.	The statement is true because what we need to build a model is 'good' representation of the population, which we can often get with a few hundred records.
	b.	The statement is contradictory to the idea of data mining sifting through large amounts of data to gain useful information and therefore false.
	c.	The statement is false because several hundred records are very unlikely to lead to an accurate model.
	d.	The statement is false because using a small number of records would lead to overfitting. List, in the correct order, the essential steps for building a data mining model. - 1. 2. 3. 4. 5. Run several modeling techniques, choosing one on the basis of its performance on the validation data. Results with the test data are an indicator of how well it will do with the rest of the database. - 1. 2. 3. 4. 5. Sampling from a larger database. - 1. 2. 3. 4. 5. Explore, Clean, Preprocess and Reduce the Data, including treatment of outliers and missing data. - 1. 2. 3. 4. 5. Develop the understanding of variables and selection of variables for building a model. - 1. 2. 3. 4. 5. Data partitioning into training, validation and test data sets.