Using Indri to Search a Document Collection

Ephyra provides an interface for the Indri document retrieval system that allows searching a local document collection such as a TREC corpus. This tutorial describes how to build an index with Indri and how to set up Ephyra to use it.

  1. You can obtain Indri from http://sourceforge.net/projects/lemur/.

    Windows: Download and install the 'exe' file.

    Other systems: Download the source code and unzip it.
  1. Build Indri.

    Windows: Indri has already been built for you.

    Other systems: Go to the Indri folder and run the following commands:
    • ./configure --enable-java --with-javahome=<Java home directory> --with-swig=<location of swig, often /usr/bin/swig>
    • make
    • make install
  1. Copy the shared library from the Indri folder swig/obj/java/ to the Ephyra folder lib/search/.

    Windows: Copy indri_jni.dll.

    Linux: Copy libindri_jni.so.

    Mac OS: Copy libindri_jni.jnilib.
  1. If your Indri version does not match the version in lib/search/indri.version, overwrite the Java library in lib/search/ with the one in swig/src/java/indri.jar.
  1. Prepare your document collection for indexing. By default, Ephyra retrieves individual paragraphs rather than whole documents. This means that in your document collection, all paragraphs need to be annotated with <P>...</P> tags.

    The package info.ephyra.indexing contains preprocessors for the AQUAINT, AQUAINT2 and Blog06 corpora that do this (and some more, refer to the Javadoc comments for details). You can develop your own preprocessor for another corpus based on this code.

    If you want to retrieve entire documents instead, you can use the knowledge annotator info.ephyra.search.searchers.IndriDocumentKM instead of info.ephyra.search.searchers.IndriKM in step 9.
  1. Build the index by running buildindex/buildindex parameters.xml from the Indri directory, where parameters.xml is a parameter file. A sample parameter file is shown below. See the Indri documentation for details.
    <parameters>
        <corpus>
            <path>/path/to/your/corpus</path>
            <class>trectext</class>
        </corpus>
        <memory>1g</memory>
        <index>/target/path/of/your/index</index>
        <metadata>
            <field>docno</field>
        </metadata>
        <metadata>
            <field>doctype</field>
        </metadata>
        <metadata>
            <field>dateline</field>
        </metadata>
        <field>
            <name>title</name>
        </field>
        <field>
            <name>text</name>
        </field>
        <field>
            <name>p</name>
        </field>
        <stemmer>
            <name>krovetz</name>
        </stemmer>
    </parameters>
    
  1. If you want to run the index and Ephyra on different machines, you can start an indri server by running indrid/indrid -index=/path/to/your/index from the Indri directory.
  1. If your index is located on the same machine, set the environment variable INDRI_INDEX to point to the index directory. To integrate multiple indices, use INDRI_INDEX2, INDRI_INDEX3, etc.

    If your run an Indri server, set the environment variable INDRI_SERVER to point to the URL of the server. To integrate multiple servers, use INDRI_SERVER2, INDRI_SERVER3, etc.

    When you run Ephyra, you will also need to specify the VM argument -Djava.library.path=lib/search/ to allow Java to find the shared library for Indri.
  1. Finally, you need to add the Indri knowledge miner to one of the init-methods in the main class that you are running. For instance, if you want to use the Indri index to answer factoid questions, add the following line to the search part of the initFactoid() method in your main class:
    Search.addKnowledgeMiner(new IndriKM());

Comments about this tutorial? Please email Nico Schlaefer.