Tags

,

Let’s first look at how parsers usually work (including Invenio). First they split the input stream into tokens – ie. they identify elements of the input stream. And then they parse (recognize) this token stream. Take this example:

(this is "phrase")

The query is semantically identical (in our grammar) with the query:

this is "phrase"

But for the parser, they are not the same – ANTLR sees them in this way:

LEFT-BRACKET + TOKEN + TOKEN + PHRASE-TOKEN + RIGHT-BRACKET

While Invenio query parser sees them as:

TOKEN TOKEN TOKEN TOKEN TOKEN

Yes, correct – Invenio splits elements by empty space and does not recognize their types during tokenization. But of course, it is not as simple. Invenio parser does know about a few special characters (tokens) and about one special case of tokens.

The special characters are (without commas):

(, ), +, |, -, + -

These are basically query operators (AND, OR, NOT) and brackets to groups clauses.

The special tokens are for Invenio (things like) these:

e(-)e(+)
U(1)
U(1,x)

You will notice that there is always ‘something else than bracket’ immediately before the bracket, then the opening bracket, ‘something else than bracket’, closing bracket. These expressions contain the special characters, but should not be parsed. Therefore series of regular expressions ‘escape’ them before tokenization:

s = s.replace('->', '####DATE###RANGE##OP#') # XXX: Save '->'
s = re.sub('(?P<outside>[a-zA-Z0-9_,=:]+)\((?P<inside>[a-zA-Z0-9_,+-/]*)\)',
                       '#####\g<outside>####PAREN###\g<inside>##PAREN#', s) # XXX: Save U(1) and SL(2,Z)
s = re.sub('####PAREN###(?P<content0>[.0-9/-]*)(?P<plus>[+])(?P<content1>[.0-9/-]*)##PAREN#',
                       '####PAREN###\g<content0>##PLUS##\g<content1>##PAREN#', s)
s = re.sub('####PAREN###(?P<content0>([.0-9/]|##PLUS##)*)(?P<minus>[-])' +\
                                   '(?P<content1>([.0-9/]|##PLUS##)*)##PAREN#',
                       '####PAREN###\g<content0>##MINUS##\g<content1>##PAREN#', s) # XXX: Save e(+)e(-)

It works in this sequence:

hey (u(2))      -->
hey (####u####PAREN###2##PAREN#)     -->
['hey', '(', 'u(2)', ')']

We must mimic Invenio query parsing behaviour, but ANTLR works differently. First of all, we have written our grammar that recognizes more types of tokens (and for good reasons, which will be visible later). The tokens are

*, ?, (, ), +, -, |, AND, OR, NOT, /, ", PHRASE-TOKEN, WILDCARD-TOKEN, REGEX etc...

Secondly, there is no parse rule for the ‘Invenio special family of tokens’. It would be too complicated to write these rules together with the parsing rules, it would be inefficient because of necessity to look at the context of a bracket, it would be domain-specific – ie. that family of tokens is very special and is perhaps necessary for High-Energy Physics and its formulas, but it makes little sense in a digital library of French literature. And in Biology they might require a different type of tokens to be recognized as well. Perhaps these:

U(1, 2)  # notice the space between '1,' and '2'

Btw, I am puzzled by this special family of tokens. Not only because they are sort of ‘hard-coded’, but also because they are inherently ambiguous. They treat spaces differently, but an unaware user types spaces! And it will be parsed differently. Notice this:

from invenio.search_engine import SearchQueryParenthesisedParser as sp
qp = sp()
In [13]: qp.parse_query('this e(1, 2)')
Out[13]: ['+', 'this', '+', 'e', '+', '1, + 2']

In [14]: qp.parse_query('this e(1,2)')
Out[14]: ['+', 'this', '+', 'e(1,2)']

But let us return to our ANTLR grammar. So how do we solve the problem of special tokens? In the similar way to Invenio escaping. In our ANTLR grammar, if users want to search for special characters, she must escape them

U\(1,2\)

Or she must type them as a phrase

"U(1,2)"

This solution would save everybody the troubles with special characters. Such tokens are NOT ambiguous. And can even contains spaces…

"U(1, 2)"

But Invenio requires that we parse even these cases. To do that, I’ll use a mini-grammar which modifies the input query.

https://github.com/romanchyla/montysolr/blob/master/contrib/invenio/grammars/FixInvenio.g

The most important part is this:

SAFE_TOKEN
:(~(' ' | '\t' | '\n' | '\r' | '\u3000'
| '\'' | '\"'| '\\' | '/' | ')' | '('
)
| ESC_CHAR )+;
SUSPICIOUS_TOKEN
:
(SAFE_TOKEN LPAREN SUB_SUS? RPAREN)+ SUB_SUS?
;
fragment SUB_SUS
:
LPAREN SUB_SUS RPAREN
| SAFE_TOKEN
;
It handles the cases described above, ant it is actually a bit smarter than Invenio, because it allows for recursion, which Invenio cannot understand (and it ought to).
In [15]: qp.parse_query('this e(e(1,2))')
Out[15]: ['+', 'this', '+', 'e', '+', 'e(1,2)']
The mini-grammar is more complicated then the regular expressions, but much more powerful. Though strictly speaking, we could reproduce (the broken) behavior of the Invenio parser by using the same regular expressions…
And now let’s see how it looks in the code:
public class AqpInvenioSyntaxParser extends AqpSyntaxParserAbstract {
    # .........

    @Override
    public TokenStream getTokenStream(CharSequence input) throws QueryNodeParseException {
        ANTLRStringStream in = new ANTLRStringStream(input.toString());
        FixInvenioLexer fLexer = new FixInvenioLexer(in);
        FixInvenioParser fParser = new FixInvenioParser(new CommonTokenStream(fLexer));
        
        try {
            fParser.mainQ();
        } catch (RecognitionException e) {
            throw new QueryNodeParseException(new MessageImpl(input + " " + 
                    fParser.getErrorMessage(e, fParser.getTokenNames())));
        }
        
        in = new ANTLRStringStream(fParser.corrected.toString());
        InvenioLexer lexer = new InvenioLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        
        return tokens;
    }

    #.......
}

First we parse the query input through the FixInvenio mini-grammar, we properly escape the special cases and concatenate them into a new query string. Then we serve the new escaped query to the standard (unambiguous) Invenio parser.

Advertisements