Consider the KASANDR data set from the UCI Machine Learning Repository, which can be downloaded from https://archive.ics.uci.edu/ml/machine-learningdatabases/

Question:

00385/de.tar.bz2.

This archive file has a size of 900Mb, so it may take a while to download.

Uncompressing the file (e.g., via 7-Zip) yields a directory de containing two large CSV files: test_de.csv and train_de.csv, with sizes 372Mb and 3Gb, respectively. Such large data files can still be processed efficiently in pandas, provided there is enough memory. The files contain records of user information from Kelkoo web logs in Germany as well as meta-data on users, offers, and merchants. The data sets have 7 attributes and 1919561 and 15844717 rows, respectively. The data sets are anonymized via hex strings.

(a) Load train_de.csv into a pandas DataFrame object de, using read_csv('train_de.csv', delimiter = '\t').

If not enough memory is available, load test_de.csv instead. Note that entries are separated here by tabs, not commas. Time how long it takes for the file to load, using the time package. (It took 38 seconds for train_de.csv to load on one of our computers.)

(b) How many unique users and merchants are in this data set?

Fantastic news! We've Found the answer you've been seeking!