Montysolr contrib/invenio comes with the component which automatically updates the Solr index. I’ll show here how to activate it.

First, make sure you compile the contrib/invenio

cd contrib/invenio
ant

Or alternatively, if you did in the montysolr root

ant build-all

After the compilation, we should see:

build/contrib/invenio/montysolr-invenio-*.*-SNAPSHOT.jar

This jar can be added into your Solr installation lib folder. We’ll also need dataimporthandler.jar from the Solr’s contrib

If the jars are in place, we can configure montysolr.xml

<!-- Handler that keeps Invenio in-sync on request -->
  <requestHandler name="/invenio/update" default="false">
     <lst name="defaults">
       <str name="inveniourl">${solr.inveniourl:http://localhost/search}</str>
       <str name="importurl">http://localhost:8983/solr/invenio/import?command=full-import&amp;dirs=</str>
       <str name="updateurl">http://localhost:8983/solr/invenio/import?command=full-import&amp;dirs=</str>
       <str name="deleteurl">blankrecords</str>
     </lst>
  </requestHandler>
<requestHandler name="/invenio/import" class="solr.WaitingDataImportHandler">
    <lst name="defaults">
      <str name="config">data-config.xml</str>
      <bool name="clean">false</bool>
      <bool name="commit">false</bool>
    </lst>
  </requestHandler>

We have registered a new handler under url /invenio/update. This handler will automatically retrieve all the new/updated/deleted recids from Invenio and then it will decide what to do with them. It is invoked through url, here are some examples

  • http://yoursite/solr/invenio/update
    – retrieve recids of the new/changed/deleted records since the time of the last update operation (if this is the first time you invoke this url, it will retrieve and index all records)
  • http://yoursite/solr/invenio/update?last_recid=-1
    – to force reindexing of everything
  • http://yoursite/solr/invenio/update?last_recid=94
    – it will find out when the record with recid 94 was changed and will gather all the changes that happen after

But in order for this handler to work, we must correctly configure MontySolr. First of all, the PYTHONPATH. Make sure python can load the following modules: montysolr, monty_invenio, montysolr_java

eg.

export PYTHONPATH:/some/path/montysolr/src/python:/some/path/montysolr/contrib/invenio/src/python:/some/path/montysolr/build/dist:$PYTHONPATH

Then we must also correctly configure the /invenio/update handler.  The update handler has two modes of operation, it can either generate empty lucene documents for any existing Invenio recid. In that case we can say:

....
<lst name="defaults">
   <bool name="generate">true</bool>
</lst>

Or we may want that the update handler invokes various other handlers. In this case we do:

<lst name="defaults">
       <str name="importurl">http://localhost:8983/solr/invenio/import?command=full-import&amp;dirs=</str>
       <str name="updateurl">http://localhost:8983/solr/invenio/import?command=full-import&amp;dirs=</str>
       <str name="deleteurl">blankrecords</str>
       <str name="inveniourl">${solr.inveniourl:http://localhost/search}</str>
 </lst>

updateurl is a complete url to the Solr update handler (this handler should fetch updated source documents and index them). This handler is just a slightly modified version of the DataImportHandler. It will not revert changes if one record fails and it also postpones reply (blocks) until all records were processed. Besides that, it is just a normal http://wiki.apache.org/solr/DataImportHandler

importurl : complete url to the Solr update handler (this handler should fetch new source documents and index them). We use the same handler as for update. If we said blankrecords, empty lucene documents with invenio recid<->lucene docid would be created

deleteurl : complete url to the Solr update handler (this handler should remove deleted documents from Solr index). blankrecords means that we will simply delete the lucene docs.

inveniourl is the url to the Invenio search instance. When new records are discovered, they will be retrieved from there

For example, with the following parameters:

last_recid: 90
inveniourl: http://invenio-server/search
updateurl: http://localhost:8983/solr/update-dataimport?command=full-import&dirs=/proj/fulltext/extracted
importurl: http://localhost:8983/solr/waiting-dataimport?command=full-import&arg1=val1&arg2=val2
deleteurl: blankrecords

We ping this url: http://localhost:8983/solr/invenio/update?last_recid=90

… the handler asks Invenio and discovers following changes:

updated records: 53, 54, 55, 100
added records: 101,103
deleted records: 91,92,93,102

…which will results in 2 requests and 1 local delete operation

  1. http://localhost:8983/solr/update-dataimport?command=full-import&dirs=/proj/fulltext/extracted&url=http://invenio-server/search?p=recid:53->55 OR recid:100&rg=200&of=xm
  2. http://localhost:8983/solr/waiting-dataimport?command=full-import&arg1=val1&arg2=val2&url=http://invenio-server/search?p=recid:101 OR recid:103&rg=200&of=xm
  3. <local deletion of lucene docs that map to recids: 91,92,93,102>

However, changes will not be yet committed to the index, unless you specify

<lst name="defaults">
       <str name="importurl">http://localhost:8983/solr/invenio/import?command=full-import&amp;dirs=</str>
       <str name="updateurl">http://localhost:8983/solr/invenio/import?command=full-import&amp;dirs=</str>
       <str name="deleteurl">blankrecords</str>
       <str name="inveniourl">${solr.inveniourl:http://localhost/search}</str>
       <bool commit="true">
  </lst>

But unfortunately, the commit configuration may not affect the index as you would expect. Because the import of the documents is probably running in parallel, at the same time when we call commit. Commit is therefore useful only for the situation when you use blankrecord ‘urls’. And in general, you should configure the commit policy inside your dataimport handler or have a site-wide commit policy. Or, invoke the commit manually in the end:

http://localhost:8983/solr/update?commit=true

Final note

There are many parameters, so it’s daunting. But once everything is configured, you can actually forget that there exists any update handler. Just set a cron job to periodically invoke

http://your/solr/invenio/update

… or even better, we could make a change inside Invenio source to invoke certain url once a record is changed…

Advertisements