Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Apr 24, 2024

How many training instances are in the dataset? How many test instances? How many features are in the training data? What is the distribution

student submitted image, transcription available below

How many training instances are in the dataset? How many test instances? How many features are in the training data? What is the distribution of labels in the training data? That is, what percentage of instances are 'Android' versus 'iPhone'? Assignment overview In this assignment, you will build a classifier that tries to infer whether tweets from @realDonaldTrump were written by Trump himself or by a staff person. This is an example of binary classification on a text dataset. It is known that Donald Trump uses an Android phone, and it has been observed that some of his tweets come from Android while others come from other devices (most commonly iPhone). It is widely believed that Android tweets are written by Trump himself, while iPhone tweets are written by other staff. For more information, you can read this blog post by David Robinson, written prior to the 2016 election, which finds a number of differences in the style and timing of tweets published under these two devices. (Some tweets are written from other devices, but for simplicity the dataset for this assignment is restricted to these two.) This is a classification task known as "authorship attribution", which is the task of inferring the author of a document when the authorship is unknown. We will see how accurately this can be done with linear classifiers using word features. In this assignment, you will experiment with perceptron and logistic regression in sklearn. Much of the code has already been written for you. We will use a class called SGDClassifier (which you should read about in the sklearn documentation), which implements stochastic gradient descent (SGD) for a variety of loss functions, including both perceptron and logistic regression, so this will be a way to easily move between the two classifiers. The code below will load the datasets. There are two data collections: the "training" data, which contains the tweets that you will use for training the classifiers, and the "testing" data, which are tweets that you will use to measure the classifier accuracy. The test tweets are instances the classifier has never seen before, so they are a good way to see how the classifier will behave on data it hasn't seen before. However, we still know the labels of the test tweets, so we can measure the accuracy. For this problem, we will use what are called "bag of words" features, which are commonly used when doing classification with text. Each feature is a word, and the value of a feature for a particular tweet is number of times the word appears in the tweet (with value 0 if the word does not appear in the tweet). Run the block of code below to load the data. You don't need to do anything yet. Move on to "Problem 1" next. [12] import pandas as pd import numpy as np from sklearn.feature_extraction.text import countVectorizer #training set df_train = pd.read_csv('tweets.train.tsv', sep='\t', header=None) Y_train = df_train.iloc[0, 0].values text_train = df_train.iloc[0:, 1].values vec = CountVectorizer() x_train = vec.fit_transform(text_train) feature_names = np.asarray(vec.get_feature_names_out()) #testing set df_test = pd.read_csv('tweets.test.tsv', sep='\t', header=None) Y_test = df_test.iloc[0:, 0].values text_test = df_test.iloc[0:, 1].values x_test = vec.transform(text_test) Problem 1: Understand the data Before doing anything else, take time to understand the code above. The variables df_train and df_test are dataframes that store the training (and testing) datasets, which are contained in tab-separated files where the first column is the label and the second column is the text of the tweet. The CountVectorizer class converts the raw text into a bag-of-words into a feature vector representation that sklearn can use. You should print out the values of the variables and write any other code needed to answer the following questions. Deliverable 1.1: How many training instances are in the dataset? How many test instances? [your answer here] [2] # Your code can go here Deliverable 1.2: How many features are in the training data? [your answer here] [ ] # Your code can go here Deliverable 1.3: What is the distribution of labels in the training data? That is, what percentage of instances are 'Android' versus 'iPhone'? [your answer here] [ ] # Your code can go here

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Machine Learning For Business Analytics

Authors: Galit Shmueli, Peter C. Bruce, Amit V. Deokar, Nitin R. Patel

1st Edition

1119828791, 9781119828792

More Books

Students also viewed these Databases questions

Question

Describe food allergy at Air Canada and how each one helps the customers and their actions

Answered: 1 week ago

Question

1 For Producing 10 Kg. of Units, the standard material requirements is: Material Quantity Rate per Kg. B 8 Kg 6 Kg D 4 Kg 4 Kg During March, 1000 Kg of Units were produced. The actual consumption of...

Answered: 1 week ago

Question

Case 2 Consumers Take a Shine to Apple, Inc.* Synopsis: Few companies have been able to master the arts of product innovation, a "cool" brand image, and customer evangelism like Apple. After nearly...

Answered: 1 week ago

Question

★★★★★

Write a memo to an employee whose work and behavior have been unsatisfactory. (You are the department manager.) Try to motivate the employee to improve his performance. Be brief, direct, honestand as...

Answered: 1 week ago

Question

★★★★★

Determine the values of n and h required to approximate To within 105 and compute the approximation. Use a. Composite Trapezoidal rule. b. Composite Simpson's rule. c. Composite Midpoint rule. dx Jo...

Answered: 1 week ago

Question

★★★★★

What is a leveraged lease and who are the parties involved?

Answered: 1 week ago

Question

★★★★★

Examine alternative approaches to behavior therapy.

Answered: 1 week ago

Question

★★★★★

Electro Ltd. has just commenced operations under ideal conditions of uncertainty. Its cash flows will depend crucially on the state of the economy. On January 1, 2015, the company acquired plant and...

Answered: 1 week ago

Question

★★★★★

Please have a written steps of things were done Q5. Find the minterm and maxterm expansions for the following expression Hint: you may use a truth table or a k-map)

Answered: 1 week ago

Question

★★★★★

Problem 15-6B Accounting for share investments LO4 CHECK FIGURE: 2. Carrying value per share, $19.63 River Outdoor Supply Corporation (River Corp.) was organized on January 2, 2023. River Corp....

Answered: 1 week ago

Question

★★★★★

3. If you were analyzing the financial statements each year as Grant issued them, when would you have begun worrying about the company? It will be helpful to show numbers to answer this

Answered: 1 week ago

Question

★★★★★

Gary is awarding a gift card randomly to a list of names on a spreadsheet. To keep it fair he wants to draw a number and then award it to the person located at that number on the list. What function...

Answered: 1 week ago

Question

★★★★★

According to the ISA 300 'Planning and Audit of Financial Statement'. Outline the overall content of this audit strategy and the audit plan.

Answered: 1 week ago

Question

★★★★★

John is an elderly Italian man who lost his wife, Debra, to cancer 4 months ago. Since her death, John has become very isolated and is having regular thoughts of suicide. John's neighbour rings the...

Answered: 1 week ago

Question

★★★★★

Suppose that the integral I = (a) Determine b. f(x) da can be represented by the following Riemann sum using right endpoints and ni subintervals: 49 (A) 2 2 2 7 7 7n 7 49 + 49 n n n n n n (b)...

Answered: 1 week ago

Question

★★★★★

Derive the differential equations describing the electrical and mechanical dynamics of a shunt- connected DC motor, shown in Figure P14.36; and draw a simulation block diagram of the system. The...

Answered: 1 week ago

Question

★★★★★

Canyon Tours showed the following components of working capital last year: Beginning of Year End of Year Accounts receivable $ 24,600 $ 23,300 Inventory 12,300 13,100 Accounts payable 14,800 17,100...

Answered: 1 week ago

Question

★★★★★

Fred Farmer needs to prepare a balance sheet for his bank. He spent the day getting the following information. Fred needs your help to build a balance sheet and evaluate it. The information was...

Answered: 1 week ago

Question

★★★★★

The time plot in Figure 19 . 17 shows the series of quarterly shipments (in million dollars) of US household appliances between 1985 and 1989 (data are available in ApplianceShipments.csv, data...

Answered: 1 week ago

Question

★★★★★

Consider the case of a website that caters to the needs of a specific farming community and carries classified ads intended for that community. Anyone, including robots, can post an ad via a web...

Answered: 1 week ago

Question

★★★★★

Forecasting transportation demand is important for multiple purposes such as staffing, planning, and inventory control. The public transportation system in Santiago de Chile has gone through a major...

Answered: 1 week ago

Question

★★★★★

119 Baseball batting averages versus wins. Is the number of games won by a major league baseball team in a season related to the teams batting average? Consider data from the Baseball Almanac on the...

Answered: 1 week ago

Question

★★★★★

117 Arsenic in soil. In Denver, Colorado, environmentalists have discovered a link between high arsenic levels in soil and a crabgrass killer used in the 1950s and 1960s ( Environmental Science &...

Answered: 1 week ago

Question

★★★★★

118 Predicting sale prices of homes. Real-estate investors, home buyers, and homeowners often use the appraised value of property as a basis for predicting the sale of that property. Data on sale...

Answered: 1 week ago

Previous Question Next Question