Once we have an ANTLR grammar which produces AST (Abstract Syntax Tree), we can translate the user input (string query) into a Lucene Query objects.

Recall, the AST will produce something like:

We’ll use the lucene’s modern query parser, especially its child antlrqueryparser.

The new contrib parser is much more flexible than the traditional query parsing. But it is also a rather confusing if you see it for the first time. Basically, there are 4 types of components:

  1. parser configuration
  2. syntax parser
  3. query node processor
  4. query builder

To illustrate it, I’ll use this (ugly) drawing

The query parser is the box into which we put the configuration and which holds also the user input (in our case: this OR that*)

In the first stage, the query parser takes “this OR that*” and parses it through the ANTLR syntax parser. We obtain the AST tree.

AST tree will be transformed into QueryNodes by a series of NodeProcessors. We obtain the QNode tree which is much more compact (smaller) that the original AST tree. A lot of node of the original AST tree were consumed, discarded, but some new nodes may also be added.

The QueryNodes are then parsed to a QueryBuilder — and from there we obtain a Lucene Query instance. The query parsing has finished, the searching starts…

The most important parts of this query parser are the NodeProcessors and QueryNodeBuilders. The magic happens there. Let’s look at the processor pipeline first, the initialization:

  public AqpQueryNodeProcessorPipeline(QueryConfigHandler queryConfig) {
    super(queryConfig);

    add(new AqpDEFOPProcessor());
    add(new AqpTreeRewriteProcessor());
    add(new AqpMODIFIERProcessor());
    add(new AqpOPERATORProcessor());
    add(new AqpCLAUSEProcessor());
    add(new AqpTMODIFIERProcessor());
    add(new AqpBOOSTProcessor());
    add(new AqpFUZZYProcessor());
    add(new AqpQRANGEINProcessor());
    add(new AqpQRANGEEXProcessor());
    add(new AqpQNORMALProcessor());
    add(new AqpQPHRASEProcessor());
    add(new AqpQPHRASETRUNCProcessor());
    add(new AqpQTRUNCATEDProcessor());
    add(new AqpQRANGEINProcessor());
    add(new AqpQRANGEEXProcessor());
    add(new AqpQANYTHINGProcessor());
    add(new AqpFIELDProcessor());
    add(new AqpFuzzyModifierProcessor());
    add(new WildcardQueryNodeProcessor());
    add(new MultiFieldQueryNodeProcessor());
    add(new FuzzyQueryNodeProcessor());
    add(new MatchAllDocsQueryNodeProcessor());
    add(new LowercaseExpandedTermsQueryNodeProcessor());
    add(new ParametricRangeQueryNodeProcessor());
    add(new AllowLeadingWildcardProcessor());
    add(new AnalyzerQueryNodeProcessor());
    add(new PhraseSlopQueryNodeProcessor());
    add(new NoChildOptimizationQueryNodeProcessor());
    add(new RemoveDeletedQueryNodesProcessor());
    add(new RemoveEmptyNonLeafQueryNodeProcessor());
    add(new BooleanSingleChildOptimizationQueryNodeProcessor());
    add(new DefaultPhraseSlopQueryNodeProcessor());
    add(new BoostQueryNodeProcessor());
    add(new MultiTermRewriteMethodProcessor());
    add(new AqpGroupQueryOptimizerProcessor());
    add(new AqpOptimizationProcessor());
  }

Errrr, that looks scary, right? There is a number of transformers which massage the AST tree. For example, the AqpDEFOPProcessor takes the query node, looks at its “label” and if the label equals “DEFOP” then it will translate the node with a new QueryNode and set the default operator (and how do we know the default operator? The configuration tells us). If you want to see what is actually happening in the processor, we can turn on the debugging (I’ll describe that later) and see this output on the stdout:

query:        this OR that*
     0. starting
--------------------------------------------

<astOPERATOR label="DEFOP" name="OPERATOR" type="25" >
    <astOPERATOR label="OR" name="OPERATOR" type="25" >
        <astMODIFIER label="MODIFIER" name="MODIFIER" type="21" >
            <astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
                <astFIELD label="FIELD" name="FIELD" type="14" >
                    <astQNORMAL label="QNORMAL" name="QNORMAL" type="33" >
                        <astTERM_NORMAL value="this" start="0" end="3" name="TERM_NORMAL" type="45" />
                    </astQNORMAL>
                </astFIELD>
            </astTMODIFIER>
        </astMODIFIER>
        <astMODIFIER label="MODIFIER" name="MODIFIER" type="21" >
            <astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
                <astFIELD label="FIELD" name="FIELD" type="14" >
                    <astQTRUNCATED label="QTRUNCATED" name="QTRUNCATED" type="38" >
                        <astTERM_TRUNCATED value="that*" start="8" end="12" name="TERM_TRUNCATED" type="47" />
                    </astQTRUNCATED>
                </astFIELD>
            </astTMODIFIER>
        </astMODIFIER>
    </astOPERATOR>
</astOPERATOR>
     1. step class org.apache.lucene.queryParser.aqp.processors.AqpDEFOPProcessor
     Tree changed: YES
--------------------------------------------

<astOPERATOR label="OR" name="OPERATOR" type="25" >
    <astMODIFIER label="MODIFIER" name="MODIFIER" type="21" >
        <astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
            <astFIELD label="FIELD" name="FIELD" type="14" >
                <astQNORMAL label="QNORMAL" name="QNORMAL" type="33" >
                    <astTERM_NORMAL value="this" start="0" end="3" name="TERM_NORMAL" type="45" />
                </astQNORMAL>
            </astFIELD>
        </astTMODIFIER>
    </astMODIFIER>
    <astMODIFIER label="MODIFIER" name="MODIFIER" type="21" >
        <astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
            <astFIELD label="FIELD" name="FIELD" type="14" >
                <astQTRUNCATED label="QTRUNCATED" name="QTRUNCATED" type="38" >
                    <astTERM_TRUNCATED value="that*" start="8" end="12" name="TERM_TRUNCATED" type="47" />
                </astQTRUNCATED>
            </astFIELD>
        </astTMODIFIER>
    </astMODIFIER>
</astOPERATOR>
     2. step class org.apache.lucene.queryParser.aqp.processors.AqpTreeRewriteProcessor
     Tree changed: NO
--------------------------------------------

     3. step class org.apache.lucene.queryParser.aqp.processors.AqpMODIFIERProcessor
     Tree changed: YES
--------------------------------------------

<astOPERATOR label="OR" name="OPERATOR" type="25" >
    <astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
        <astFIELD label="FIELD" name="FIELD" type="14" >
            <astQNORMAL label="QNORMAL" name="QNORMAL" type="33" >
                <astTERM_NORMAL value="this" start="0" end="3" name="TERM_NORMAL" type="45" />
            </astQNORMAL>
        </astFIELD>
    </astTMODIFIER>
    <astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
        <astFIELD label="FIELD" name="FIELD" type="14" >
            <astQTRUNCATED label="QTRUNCATED" name="QTRUNCATED" type="38" >
                <astTERM_TRUNCATED value="that*" start="8" end="12" name="TERM_TRUNCATED" type="47" />
            </astQTRUNCATED>
        </astFIELD>
    </astTMODIFIER>
</astOPERATOR>
     4. step class org.apache.lucene.queryParser.aqp.processors.AqpOPERATORProcessor
     Tree changed: YES
--------------------------------------------
<boolean operation='OR'>
<modifier operation='MOD_NONE'>

<astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
    <astFIELD label="FIELD" name="FIELD" type="14" >
        <astQNORMAL label="QNORMAL" name="QNORMAL" type="33" >
            <astTERM_NORMAL value="this" start="0" end="3" name="TERM_NORMAL" type="45" />
        </astQNORMAL>
    </astFIELD>
</astTMODIFIER>
</modifier>
<modifier operation='MOD_NONE'>

<astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
    <astFIELD label="FIELD" name="FIELD" type="14" >
        <astQTRUNCATED label="QTRUNCATED" name="QTRUNCATED" type="38" >
            <astTERM_TRUNCATED value="that*" start="8" end="12" name="TERM_TRUNCATED" type="47" />
        </astQTRUNCATED>
    </astFIELD>
</astTMODIFIER>
</modifier>
</boolean>
     5. step class org.apache.lucene.queryParser.aqp.processors.AqpCLAUSEProcessor
     Tree changed: NO
--------------------------------------------

     6. step class org.apache.lucene.queryParser.aqp.processors.AqpTMODIFIERProcessor
     Tree changed: YES
--------------------------------------------
<boolean operation='OR'>
<modifier operation='MOD_NONE'>

<astFIELD label="FIELD" name="FIELD" type="14" >
    <astQNORMAL label="QNORMAL" name="QNORMAL" type="33" >
        <astTERM_NORMAL value="this" start="0" end="3" name="TERM_NORMAL" type="45" />
    </astQNORMAL>
</astFIELD>
</modifier>
<modifier operation='MOD_NONE'>

<astFIELD label="FIELD" name="FIELD" type="14" >
    <astQTRUNCATED label="QTRUNCATED" name="QTRUNCATED" type="38" >
        <astTERM_TRUNCATED value="that*" start="8" end="12" name="TERM_TRUNCATED" type="47" />
    </astQTRUNCATED>
</astFIELD>
</modifier>
</boolean>
     7. step class org.apache.lucene.queryParser.aqp.processors.AqpBOOSTProcessor
     Tree changed: NO
--------------------------------------------

     8. step class org.apache.lucene.queryParser.aqp.processors.AqpFUZZYProcessor
     Tree changed: NO
--------------------------------------------

     9. step class org.apache.lucene.queryParser.aqp.processors.AqpQRANGEINProcessor
     Tree changed: NO
--------------------------------------------

     10. step class org.apache.lucene.queryParser.aqp.processors.AqpQRANGEEXProcessor
     Tree changed: NO
--------------------------------------------

     11. step class org.apache.lucene.queryParser.aqp.processors.AqpQNORMALProcessor
     Tree changed: YES
--------------------------------------------
<boolean operation='OR'>
<modifier operation='MOD_NONE'>

<astFIELD label="FIELD" name="FIELD" type="14" ><field start='0' end='3' field='field' text='this'/>
</astFIELD>
</modifier>
<modifier operation='MOD_NONE'>

<astFIELD label="FIELD" name="FIELD" type="14" >
    <astQTRUNCATED label="QTRUNCATED" name="QTRUNCATED" type="38" >
        <astTERM_TRUNCATED value="that*" start="8" end="12" name="TERM_TRUNCATED" type="47" />
    </astQTRUNCATED>
</astFIELD>
</modifier>
</boolean>
     12. step class org.apache.lucene.queryParser.aqp.processors.AqpQPHRASEProcessor
     Tree changed: NO
--------------------------------------------

     13. step class org.apache.lucene.queryParser.aqp.processors.AqpQPHRASETRUNCProcessor
     Tree changed: NO
--------------------------------------------

     14. step class org.apache.lucene.queryParser.aqp.processors.AqpQTRUNCATEDProcessor
     Tree changed: YES
--------------------------------------------
<boolean operation='OR'>
<modifier operation='MOD_NONE'>

<astFIELD label="FIELD" name="FIELD" type="14" ><field start='0' end='3' field='field' text='this'/>
</astFIELD>
</modifier>
<modifier operation='MOD_NONE'>

<astFIELD label="FIELD" name="FIELD" type="14" ><wildcard field='field' term='that*'/>
</astFIELD>
</modifier>
</boolean>
     15. step class org.apache.lucene.queryParser.aqp.processors.AqpQRANGEINProcessor
     Tree changed: NO
--------------------------------------------

     16. step class org.apache.lucene.queryParser.aqp.processors.AqpQRANGEEXProcessor
     Tree changed: NO
--------------------------------------------

     17. step class org.apache.lucene.queryParser.aqp.processors.AqpQANYTHINGProcessor
     Tree changed: NO
--------------------------------------------

     18. step class org.apache.lucene.queryParser.aqp.processors.AqpFIELDProcessor
     Tree changed: YES
--------------------------------------------
<boolean operation='OR'>
<modifier operation='MOD_NONE'>
<field start='0' end='3' field='field' text='this'/>
</modifier>
<modifier operation='MOD_NONE'>
<wildcard field='field' term='that*'/>
</modifier>
</boolean>
     19. step class org.apache.lucene.queryParser.aqp.processors.AqpFuzzyModifierProcessor
     Tree changed: NO
--------------------------------------------

     20. step class org.apache.lucene.queryParser.standard.processors.WildcardQueryNodeProcessor
     Tree changed: YES
--------------------------------------------
<boolean operation='OR'>
<modifier operation='MOD_NONE'>
<field start='0' end='3' field='field' text='this'/>
</modifier>
<modifier operation='MOD_NONE'>
<prefixWildcard field='field' term='that*'/>
</modifier>
</boolean>
     21. step class org.apache.lucene.queryParser.standard.processors.MultiFieldQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     22. step class org.apache.lucene.queryParser.standard.processors.FuzzyQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     23. step class org.apache.lucene.queryParser.standard.processors.MatchAllDocsQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     24. step class org.apache.lucene.queryParser.standard.processors.LowercaseExpandedTermsQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     25. step class org.apache.lucene.queryParser.standard.processors.ParametricRangeQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     26. step class org.apache.lucene.queryParser.standard.processors.AllowLeadingWildcardProcessor
     Tree changed: NO
--------------------------------------------

     27. step class org.apache.lucene.queryParser.standard.processors.AnalyzerQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     28. step class org.apache.lucene.queryParser.standard.processors.PhraseSlopQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     29. step class org.apache.lucene.queryParser.core.processors.NoChildOptimizationQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     30. step class org.apache.lucene.queryParser.core.processors.RemoveDeletedQueryNodesProcessor
     Tree changed: NO
--------------------------------------------

     31. step class org.apache.lucene.queryParser.standard.processors.RemoveEmptyNonLeafQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     32. step class org.apache.lucene.queryParser.standard.processors.BooleanSingleChildOptimizationQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     33. step class org.apache.lucene.queryParser.standard.processors.DefaultPhraseSlopQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     34. step class org.apache.lucene.queryParser.standard.processors.BoostQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     35. step class org.apache.lucene.queryParser.standard.processors.MultiTermRewriteMethodProcessor
     Tree changed: NO
--------------------------------------------

     36. step class org.apache.lucene.queryParser.aqp.processors.AqpGroupQueryOptimizerProcessor
     Tree changed: NO
--------------------------------------------

     37. step class org.apache.lucene.queryParser.aqp.processors.AqpOptimizationProcessor
     Tree changed: NO
--------------------------------------------

final result:
--------------------------------------------
<boolean operation='OR'>
<modifier operation='MOD_NONE'>
<field start='0' end='3' field='field' text='this'/>
</modifier>
<modifier operation='MOD_NONE'>
<prefixWildcard field='field' term='that*'/>
</modifier>
</boolean>

query:        this OR that*
result:        field:this field:that*

The final result is what the QueryBuilder receives. And there it is actually easier. The standard query builder contains:

public AqpStandardQueryTreeBuilder() {
    setBuilder(GroupQueryNode.class, new GroupQueryNodeBuilder());
    setBuilder(FieldQueryNode.class, new AqpFieldQueryNodeBuilder());
    setBuilder(BooleanQueryNode.class, new BooleanQueryNodeBuilder());
    setBuilder(FuzzyQueryNode.class, new FuzzyQueryNodeBuilder());
    setBuilder(BoostQueryNode.class, new BoostQueryNodeBuilder());
    setBuilder(ModifierQueryNode.class, new ModifierQueryNodeBuilder());
    setBuilder(WildcardQueryNode.class, new WildcardQueryNodeBuilder());
    setBuilder(TokenizedPhraseQueryNode.class, new PhraseQueryNodeBuilder());
    setBuilder(MatchNoDocsQueryNode.class, new MatchNoDocsQueryNodeBuilder());
    setBuilder(PrefixWildcardQueryNode.class,
        new PrefixWildcardQueryNodeBuilder());
    setBuilder(RangeQueryNode.class, new RangeQueryNodeBuilder());
    setBuilder(SlopQueryNode.class, new SlopQueryNodeBuilder());
    setBuilder(StandardBooleanQueryNode.class,
        new StandardBooleanQueryNodeBuilder());
    setBuilder(MultiPhraseQueryNode.class, new MultiPhraseQueryNodeBuilder());
    setBuilder(MatchAllDocsQueryNode.class, new MatchAllDocsQueryNodeBuilder());
  }

So, unless you introduce a completely new QueryNode type, the standard query parser will be able to translate the tree into lucene query instance.

If you would like to study a working lucene query parser for the Invenio grammar we wrote about in the first part, then have a look at:

org/apache/lucene/queryParser/aqp/AqpInvenioSyntaxParser.java

The source code contains a complete configuration of a query parser for Invenio, including the setup of default values, the QueryProcessor pipeline and query builders. And can be run as:

import org.apache.lucene.queryParser.aqp.AqpQueryParser;
import org.apache.lucene.queryParser.aqp.AqpInvenioQueryParser;

AqpQueryParser qp = new AqpInvenioQueryParser();
qp.setDebug(true); # too see the info on stderr

System.out.println(qp.parser("this or that*", "defaultField"))

OK, and in the next part we can see why Invenio parsing is syntactically ambiguous and how we can solve it and plug Invenio grammar inside SOLR.

Advertisements