Lucene for Information Retrieval Research and Evaluation
In ~/lucene4ir/data, there are a number of folders contain different data sets (or part there of).
In ~/lucene4ir/src/, there are a currently three apps, an Indexing Application (IndexerApp), a Retrieval Application (RetrievalApp) and an application that pulls out various statistics (ExampleStatsApp). Each of these apps are configured based on XML parameter files (see ~/lucene4ir/params for examples).
The code is based on examples developed by https://github.com/isoboroff/trec-demo and https://github.com/lintool/Anserini
We are using IntelliJ by JetBrains (https://www.jetbrains.com/idea/download/), so once you clone the repo you'll need to see up a few things so that you can run and compile the apps.
Please follow the link provided to clone github using IntelliJ https://www.jetbrains.com/help/webstorm/cloning-a-repository-from-github.html
Assuming you are using IntelliJ, go to File, Project Structure, then the Project tab.
Next, go to the Modules tab and mark the following directories as follows.
Also, +Add Content Root, to be ~/lucene4ir/
Then, go to the SDKs tab, and add in the jars in ~/lucene4ir/jars. We are using Lucene 6.2 for this demo code and JSoup 1.6.2.
Again assuming that you are using the IntelliJ IDE, create the following run configurations. Go to the Run menu, and select Edit Configurations. In the top left hand side of the Run/Debug Configurations window, click the add button (+) to add a new configuration, and select Application. Repeat and set up for the following apps.
If the following error prompted when running the app Error:java: javacTask: source release 1.8 requires target release 1.8 intellij Please change the target bytecode into 1.8. You can follow the steps explained in the link https://stackoverflow.com/questions/29888592/errorjava-javactask-source-release-8-requires-target-release-1-8
For executing App, you can use the command line or terminal and execute the following command
Please make sure the directory position when you execute the jar because it will need files in data\cacm_file_list
Classes with main method can be executed by right clicking on class name, selecting Run As option and selecting Java application. If a class needs command line parameters, right click and select Run configuration. Add the parameters in Arguments tab of Run Configurations window. Example parameter arguments for IndexerApp and RetrievalApp are as follows:
IndexerApp: params/index_params.xml RetrievalApp: params/retrieval_params.xml
run mvn package
from the shell, executable jar will be in target/lucene4ir-0.0.1-SNAPSHOT.jar
TBA
Now that you have these applications set up, you can try them out. First, run the IndexerApp, which given index_params.xml, will index the CACM collection. It will take about 30 seconds.
Then you can run the ExampleStatsApp, this will read in the CACM index, and spit out some different statistics. Finally, you can run the RetrievalApp, which will take a list of queries, and run them against the index using BM25, and save the results to a result file (~/lucene4ir/data/cacm/bm25_results.res).
To evaluate the output you will need to download and install the trec_eval from NIST, http://trec.nist.gov/trec_eval/
In ~/lucene4ir/data/cacm/ the list of documents relevant to each query is in the file, cacm.qrels, using trec_eval, we can measure the precision, recall, etc:
trec_eval ~/lucene4ir/data/cacm/cacm.qrels ~/lucene4ir/data/cacm/bm25_results.res
which will output something like:
num_q all 51
num_ret all 4317
num_rel all 795
num_rel_ret all 264
map all 0.1477
gm_ap all 0.0590
R-prec all 0.1798
bpref all 0.4347
recip_rank all 0.4573
. . . .
P5 all 0.2549
P10 all 0.2059
P15 all 0.1660
P20 all 0.1422
P30 all 0.1105
P100 all 0.0518
P200 all 0.0259
P500 all 0.0104
P1000 all 0.0052
The Indexer Application lets you specific the type of TREC documents to be indexed and location of the index.
Below is an example of the index parameters
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<indexParams>
<indexName>index</indexName>
<fileList>data/cacm_file_list</fileList>
<indexType>cacm</indexType>
</indexParams>
where:
In the data directory we have provided some sample files to show how the indexing works.
The Retrieval Application lets you specify the collection/index, the queries and the retrieval model, along with how it is parameterized.
Below is an example of the retrieval parameters.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<indexParams>
<indexName>path/to/the/index</indexName>
<queryFile>data/cacm/title.query</queryFile>
<maxResults>100</maxResults>
<model>bm25</model>
<resultFile>data/cacm/bm25_results.res</resultFile>
<b>0.75</b>
<k>1.2</k>
</indexParams>
where:
If no model is given, the default model is selected. If no parameters are provided, default values are used. If no resultsfile is provided, a result file name is auto generated from the model name.
This application provides a number of examples on how to access various statistics given the index. It is a work in progress and is designed to show how to read through postings lists, how to access fields, how to get document/field counts, etc.
Once you have indexed a collection, simply specify the location of the index.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<exampleStatsParams>
<indexName>index</indexName>
</exampleStatsParams>