Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 22, 2024

For the execution of this work, you will have to make sure to activate the truncation of the words. Formal concepts and Galois lattices In

For the execution of this work, you will have to make sure to activate the truncation of the words.

Formal concepts and Galois lattices

In the sense of "Galois lattice" and in our particular context, a formal concept will simply be a set of words and a set of documents in which all the words are found in all the documents and, conversely, in which all the documents contain all the words. For example, if the texts A, B and C contain the words steam and house, then the sets (steam, house) and (A,B,C) form a formal concept. A Galois (or concept) lattice is built from all the concepts found and it constitutes a representation of the concordance between the documents and the words.

For example, consider the following occurrence matrix:

homemade steam

B X

C X X

Therefore, the concept grouping the most words is C x (steam, house). The concept grouping the most documents, besides (A,B,C,D), is (B,C) x (steam).

Reminder: timing a task in Java

When you are asked to calculate the execution time of a task in Java by your device, you can use the following sample code:

long before = System.currentTimeMillis();

my function();

long after = System.currentTimeMillis();

System.out.println("It took "+ (after-before)/1000.0+" seconds.");

Obviously, you should avoid using other applications while your program is running!

Sometimes, to get a better estimate of time, one can loop through the work and average, as in this example:

long before = System.currenTimeMillis();

for (int k = 0; k < 1000; ++k)

my function();

long after = System.currenTimeMillis();

System.out.println("It took "+ (after-before)/1000.0+" seconds");

System.out.println("An average of "+ (after-before)/(1000*1000.0)+" seconds.");

Reminder: read a list in a text file

If you have a file in Java text format, you can read its contents and put it in an array with the following code:

import java.util.*;

import java.io.*;

public class test {

public static void main(String arg[]) throws IOException {

BufferedReader sr = new BufferedReader(

new FileReader(arg[0]));

String line;

Vector rows = new Vector();

while ( ( line = sr.readLine()) != null)

lines.add(line);

String[] li = rows.toArray(new String[0]);

for (int k = 0; k < li.length; ++k)

System.out.println(li[k]);

}

Reminder: access to files

Given an instance of the java.io.File class, you can check if it is a folder with the isDirectory() method. In this case, you can access the files it contains with the listFiles() method. In case it is a file, you can check the extension of a file with a call like getName().endsWith(".txt").

The getPath() method allows you to have access to the file path, which is useful when indexing. For example, you can compose a Lucene field like this:

new StringField("path", file.getPath(), Field.Store.YES);

Reminder: Read the contents of a file

You can read the contents of a file in Java with the java.io.FileReader class. For performance reasons, it is most often combined with the BufferedReader class in this way

Reader r = new BufferedReader(new FileReader(file));

Lucene allows you to index a file using an instance of the Reader class, like this:

new TextField("content",reader);

States

- First index only 10 documents, then 20, then 30, then all the documents with the txt extension a CD-ROM then calculate the time required for the execution of each step, then calculate the size (on disk) of each index as well as the respective sum of the file sizes. Also do, in a loop, for each index, 100 times 10 keyword searches, then divide the sum obtained by 1000 to obtain the search time. Make a five-column table with the following headings: number of files, total file size, indexing time, index size, and average search time.

- From the table of the previous exercise, can you tell if the performance of Lucene decreases with the size of the index? Suppose your index is 1000 times larger, approximately what would be the average search time?

- From your table, give the size of the index and the time needed to build the index if you have to index 8 billion documents averaging 20 KB. Is it possible to consider this? What would you do if you were given the mandate to do it?