Question

1 Approved Answer

Posted on Sep 26, 2024

need help with this In this project, students are to program some data pre-processing techniques on gene expression datasets. The dataset (p1data.csv) provided in the

need help with this

In this project, students are to program some data pre-processing techniques on gene expression datasets.

The dataset (p1data.csv) provided in the project folder contains 62 samples collected from colon-cancer patients divided into two classes; there are 22 positive tuples and 40 negative ones. Each tuple (row) consists of the readings for the genes and the class (which is the last column) on one biopsy. Each gene is an attribute. The columns are separated by ",". We number the genes 1 to N in the left-to-right order; we refer to the genes using gi where i is a column number; for example the first gene (column) is called g1.

Your program should work on other datasets with similar formats but they may have different number of rows and different number of columns (perhaps also different class names). [You can assume that there are exactly two classes.] Your program may need a scan of the data to determine the number of genes and the number of instances/rows.

Your (compiled) program will be run using the following command-line command: java P1DM datafilename k m Task 1. Discretize, rank, and select the top-k genes of the data using the entropy-based method (for 2 intervals). This task produces three files; only the k highest ranked genes (in information gain order) will be included in these files:

(a) A file (called geneRankEntropy.txt) containing the entropy based ranking of the k genes, in decreasing information gain order. Each row of the file should contain the gene number, the split value determined by the entropy-based binning method for the gene, and the information gain of the split.
(b) A file (called entropyItemMap.txt), where each row contains a tuple of the form (gi, lb,rb,j), where gi is a gene ID, lb is the left bound and rb is right bound of an interval/bin, and j is the integer to be used to represent the interval in the itemized data.
(c) A file (called itemizedDataEntropy.txt) containing the itemized data for the top-k genes.

You need to write your program in Java. When complied your program will produce an executable program called P1DM. The program should be able to do the two tasks described above using a command of the form java P1DM datafilename 3 4 when it is run with the datafilename file is in the same folder where the program is. The 3 is the k (number of genes) and the 4 is the m (number of intervals for equidensity).