Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

**PROVIDE SCREENSHOTS OF WHAT IS ASKED** Run a recommender on the MoveLens dataset. (Create a directory for movie lens dataset) mkdir MovieLens cd MovieLens wget

**PROVIDE SCREENSHOTS OF WHAT IS ASKED**

Run a recommender on the MoveLens dataset. (Create a directory for movie lens dataset) mkdir MovieLens cd MovieLens wget http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/ml-1m.zip (Unzip the dataset, this one happens to be compressed with Zip rather than GZip) unzip ml-1m.zip cd .. Take a look at the data file: more MovieLens/ml-1m/ratings.dat (you can press q or Ctrl-C to exit, more command shows the first few lines worth of text. Each line contains user ID, movie ID, user rating and the timestamp of the rating, as already discussed in class) The next step is to use aa Linux command to convert :: separated file into a commaseparated file. First part (cat) will simply output the file. Second part substitutes , for :: and third part of the command extracts just 3 attributes relevant to us (no timestamp) cat MovieLens/ml-1m/ratings.dat | sed -e s/::/,/g| cut -d, -f1,2,3 > MovieLens/ml- 1m/ratings.csv (NOTE: if you wanted to extract all 4 columns from the original data set, you could run the same command with 1,2,3,4 instead of 1,2,3). Create a movielens directory and copy the articles over to HDFS into that directory: $HADOOP_HOME/bin/hadoop fs -mkdir movielens $HADOOP_HOME/bin/hadoop fs -put MovieLens/ml-1m/ratings.csv movielens Split the data set into the 90% training set and 10% evaluation set. In this case we are using Hadoop to perform the split. Naturally, you can change the percentages here to any other value instead of 0.9/0.1. bin/mahout will only work from the $MAHOUT_HOME directory, or you can change it as others. bin/mahout splitDataset --input movielens/ratings.csv --output ml_dataset -- trainingPercentage 0.9 --probePercentage 0.1 --tempDir dataset/tmp Verify and report the file sizes of the input ratings.csv file and the two sampled files (the two files are in the /user/ec2-user/ml_dataset/trainingSet/ and /user/ec2- user/ml_dataset/probeSet directories on HDFS side). Do the sampled file sizes add up to the original input file size? Factorize the rating matrix based on the training set. As always, this is a single line command, be sure to run it as such. The --numfeatures value configures the set of hidden variables or the dimension size to use in matrix factorization. --numIterations sets how many passes to perform; we expect a better match with more iterations time bin/mahout parallelALS --input ml_dataset/trainingSet/ --output als/out --tempDir als/tmp --numFeatures 20 --numIterations 3 --lambda 0.065 Measure the prediction against the training set: bin/mahout evaluateFactorization --input ml_dataset/probeSet/ --output als/rmse/ -- userFeatures als/out/U/ --itemFeatures als/out/M/ --tempDir als/tmp What is the resulting RMSE value? (rmse.txt file in /user/ec2-user/als/rmse/ on HDFS) Finally, lets generate some predictions: bin/mahout recommendfactorized --input als/out/userRatings/ --output recommendations/ --userFeatures als/out/U/ --itemFeatures als/out/M/ --numRecommendations 6 --maxRating 5 Look at recommendations/part-m-00000 and report the first 10 rows by running the following command. These are top-6 recommendations (note that --numRecommendation setting in the previous command) for each user. Each recommendation consists of movieID and the estimated rating that the user might give to that movie. $HADOOP_HOME/bin/hadoop fs -cat recommendations/part-m-00000 | head What is the top movie recommendation (movie ID) for users 4, 5, and 6?

image text in transcribed

5 Run a recommender on the MoveLens dataset. (Create a directory for movie lens dataset) mkdie Movielen ed Movielens w9et http://fannm07 .c tei.cti.dePaul.edu/OSOSSS ml. I m.sip Unzip the dataset, this one happens to be coupressed with Zip rather than GZip) unzip mm.zip Factonize the rating matix based on the training set. As always, this is a single line command, be sure to run it as such. The --numfeatures value configures the set of hidden vaiables or the dimension size to use in matix factoization. --numlterations sets how many passes to perforn; we expect a better match with more iterations time bin/mahout parallelALs-input ml dataettrainingset/-output alrout-tempDir al tmp-numfeaturer 20 -numlterations 3 -lombda 0.065 Measure the prediction against the training set bia/mahout evaluakefoeternati0? nput mi-data et, probeset-output al. tme/- What is the resulring RMMSE value (setxt file in aserec2-userals se on HDFS) Finally, let's generate some predictions, bin/mahout ed.. Take a look at the data file: more movieleaml-Imratings.dat (you can pres q or Ctl-C to exit, more command shows the firt few lines worth of tex Each line contains user ID, movie ID, user rating and the timestamp of the rating, as Look at recommendations/pat-m-00000 and report the frst 10 rows by running the already discuszed in clasa) following command. These are top-6 recommendations (note that -numRecommendation setting in the previous command) for each user. Each recommendation consists of The next step is to use aa Linux command to convertseparated file into a comma- movielD and the estimated rating that the user might sive to that movie separated file. First part (cat) will simply output the file. Second part substitutes, for :: and third part of the command extacts just 3 attibutes relevant to us (no timestamp) eat movieleniml-Im/ratings.dat Iedelgl evtd 1.2. movielenzim Im/ratings.en What is the top movie (movie ID) for users 4, 5, and 6? NOTE: if you wanted to extract all 4 column: from the original data set, you could run the same command with 1,2,3,4 instead of "1,2,3) Create a movielens directory and copy the articles over to HDFS into that directory HADOOP-HOM&bin/hodeepE,-put men.len/ml-Im/ea ing/.em menelen, Split the data set into te 90% training set and 10% evaluat on set. In this case we are using Hadoop to perform the split. Naturally, you can change the percentages here to any other value instead of 0.90.1. bin mahou will only work from the SMAHOUT_HOME directory, or you can change it as others. bin/mahout splitDabaret-input movielenrainge-output ml_dabaret- Verify aud report the file sizes of the input ratings.csv file and the two sampled files (the two files are in the ue ec2-uermldataset trainingSet and /ucer ec user ml dataset probeSet directories on HDFS side). Do the sampled file sizes add up to the original input file size? 5 Run a recommender on the MoveLens dataset. (Create a directory for movie lens dataset) mkdie Movielen ed Movielens w9et http://fannm07 .c tei.cti.dePaul.edu/OSOSSS ml. I m.sip Unzip the dataset, this one happens to be coupressed with Zip rather than GZip) unzip mm.zip Factonize the rating matix based on the training set. As always, this is a single line command, be sure to run it as such. The --numfeatures value configures the set of hidden vaiables or the dimension size to use in matix factoization. --numlterations sets how many passes to perforn; we expect a better match with more iterations time bin/mahout parallelALs-input ml dataettrainingset/-output alrout-tempDir al tmp-numfeaturer 20 -numlterations 3 -lombda 0.065 Measure the prediction against the training set bia/mahout evaluakefoeternati0? nput mi-data et, probeset-output al. tme/- What is the resulring RMMSE value (setxt file in aserec2-userals se on HDFS) Finally, let's generate some predictions, bin/mahout ed.. Take a look at the data file: more movieleaml-Imratings.dat (you can pres q or Ctl-C to exit, more command shows the firt few lines worth of tex Each line contains user ID, movie ID, user rating and the timestamp of the rating, as Look at recommendations/pat-m-00000 and report the frst 10 rows by running the already discuszed in clasa) following command. These are top-6 recommendations (note that -numRecommendation setting in the previous command) for each user. Each recommendation consists of The next step is to use aa Linux command to convertseparated file into a comma- movielD and the estimated rating that the user might sive to that movie separated file. First part (cat) will simply output the file. Second part substitutes, for :: and third part of the command extacts just 3 attibutes relevant to us (no timestamp) eat movieleniml-Im/ratings.dat Iedelgl evtd 1.2. movielenzim Im/ratings.en What is the top movie (movie ID) for users 4, 5, and 6? NOTE: if you wanted to extract all 4 column: from the original data set, you could run the same command with 1,2,3,4 instead of "1,2,3) Create a movielens directory and copy the articles over to HDFS into that directory HADOOP-HOM&bin/hodeepE,-put men.len/ml-Im/ea ing/.em menelen, Split the data set into te 90% training set and 10% evaluat on set. In this case we are using Hadoop to perform the split. Naturally, you can change the percentages here to any other value instead of 0.90.1. bin mahou will only work from the SMAHOUT_HOME directory, or you can change it as others. bin/mahout splitDabaret-input movielenrainge-output ml_dabaret- Verify aud report the file sizes of the input ratings.csv file and the two sampled files (the two files are in the ue ec2-uermldataset trainingSet and /ucer ec user ml dataset probeSet directories on HDFS side). Do the sampled file sizes add up to the original input file size

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Machine Learning And Knowledge Discovery In Databases European Conference Ecml Pkdd 2015 Porto Portugal September 7 11 2015 Proceedings Part 2 Lnai 9285

Authors: Annalisa Appice ,Pedro Pereira Rodrigues ,Vitor Santos Costa ,Joao Gama ,Alipio Jorge ,Carlos Soares

1st Edition

3319235249, 978-3319235240

More Books

Students also viewed these Databases questions