This entry shows how to mimic exactly the same behaviour of an Invenio query parser in Java. Using the modern ANTLR parser library and antlrqueryparser module of lucene and montysolr.

First, we need a query syntax grammar for ANTLR. As there existed no formal grammar before and Invenio was defined only via examples (#1, #2) we must create a formal grammar ourselves.

cd contrib/antlrqueryparser/grammars
vim Invenio.g

You can download the complete grammar from: https://github.com/romanchyla/montysolr/blob/master/contrib/invenio/grammars/Invenio.g

So we open up ANTLRWorks:

java -jar antlrworks.jar &

And we load the grammar…

In my experience, the ANTLRWorks is a very good GUI for designing the grammar, but it has some very serious problems. For example, you cannot rely on the Interpreter (depicted above) that it always shows the correct pass. Sometimes it fails, but pretends to work. Sometimes the other way round.

So, when I edit the grammar,  I used to do:

  1. Edit the grammar
  2. Ctrl+D (to check the grammar syntax)
  3. Run example in the interpreter tab
  4. verify parsing on the command line
ant try-view -Dgrammar=Invenio -Dquery="some #funny^0.4 token"
Buildfile: /dvt/workspace/montysolr/contrib/antlrqueryparser/build.xml
     [echo] Building antlrqueryparser...

antlr-generate:
     [echo]
     [echo]                 Regenerating: Invenio
     [echo]                 Output: /dvt/workspace/montysolr/contrib/antlrqueryparser/src/java/org/apache/lucene/queryParser/aqp/parser
     [echo]                 
    [javac] Compiling 2 source files to /dvt/workspace/montysolr/build/contrib/antlrqueryparser/classes/java
    [javac] InvenioParser.java:181: warning: [cast] redundant cast to java.lang.Object
    [javac]                     root_0 = (Object)adaptor.nil();
    [javac]                              ^
    ... a lot of runnning text
    [javac] 100 warnings

antlr:

compile:
    [javac] /dvt/workspace/montysolr/contrib/antlrqueryparser/build.xml:131: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
    [javac] Compiling 1 source file to /dvt/workspace/montysolr/build/contrib/antlrqueryparser/classes/java

compile-all:
    [javac] /dvt/workspace/montysolr/contrib/antlrqueryparser/build.xml:122: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds

dot:
     [echo]
     [echo]                     Generating DOT: Invenio  
     [echo]                     Query: some #funny^0.4 token
     [echo]                     Rule: mainQ       
     [echo]                 

tree:
     [echo]
     [echo]                 Generating TREE: Invenio  
     [echo]                 Query: some #funny^0.4 token
     [echo]                 Rule: mainQ       
     [echo]             
     [java] Grammar: Invenio rule:mainQ
     [java] query: some #funny^0.4 token
     [java] (DEFOP (DEFOP (MODIFIER (TMODIFIER (FIELD (QNORMAL some)))) (MODIFIER # (TMODIFIER (BOOST ^0.4) FUZZY (FIELD (QNORMAL funny))))) (MODIFIER (TMODIFIER (FIELD (QNORMAL token)))))

display:

If you have a dot viewer installed, you will see the following picture of the AST (Abstract Syntax Parse) tree:

The picture shows me the AST tree that is produces by ANTLR and everything looks OK – if there was an error, you would notice it on the command line and saw a blank picture.

If you can’t see any picture despite a correct grammar, verify the contrib/antlrqueryparser/build.properties file has a correct pat to your viewer.

dot_viewer=/usr/bin/xdot

Notice, that the debug output includes the textual representation of the AST tree.

(DEFOP (DEFOP (MODIFIER (TMODIFIER (FIELD (QNORMAL some)))) (MODIFIER # (TMODIFIER (BOOST ^0.4) FUZZY (FIELD (QNORMAL funny))))) (MODIFIER (TMODIFIER (FIELD (QNORMAL token)))))

This representation is the same as the picture. We can use it to test a whole batches of examples. You must create a file called Invenio.gunit and save it inside the grammars folder and run the following:

ant gunit -Dgrammar=Invenio -Dquery="some #funny^0.4 token"

We get output similar to this:

gunit:
     [echo]
     [echo]         Running GUNIT: Invenio        
     [echo]         
     [java] -----------------------------------------------------------------------
     [java] executing testsuite for grammar:Invenio with 118 tests
     [java] -----------------------------------------------------------------------
     [java] 2 failures found:
     [java] test19 (mainQ, line67) -
     [java] expected: FAIL
     [java] actual: OK
     [java]
     [java] test65 (mainQ, line193) -
     [java] expected:
     [java] actual: line 1:4 no viable alternative at input '+'
     [java] line 1:9 no viable alternative at input '-'
     [java]
     [java]
     [java]
     [java] Tests run: 118, Failures: 2
     [java] Java Result: 2

This test is much faster than using ANTLRWorks and you can see immediately whether grammar changes didn’t break something else. In our case, we have got two exceptions,  here is the offending portion:

"C++" -> "(MODIFIER (TMODIFIER (FIELD (QNORMAL C++))))"

"O\'Shea" -> "(MODIFIER (TMODIFIER (FIELD (QNORMAL O'Shea))))"

"$e^{+}e^{-}$" -> ""

"hep-ph/0204133" -> "(MODIFIER (TMODIFIER (FIELD (QNORMAL hep-ph/0204133))))"

"BlaCK hOlEs" -> "(DEFOP (MODIFIER (TMODIFIER (FIELD (QNORMAL BlaCK)))) (MODIFIER (TMODIFIER (FIELD (QNORMAL hOlEs)))))"

"пушкин" -> "(MODIFIER (TMODIFIER (FIELD (QNORMAL пушкин))))"

I’ll show you in one of the next entries how to deal with such problems in Invenio grammar.

Advertisements