Measuring SOLR performance II

Tags

, , ,

In the previous post https://29min.wordpress.com/2013/07/31/measuring-solr-query-performance I tried to look at the effect of default CMS vs G1 garbage collector. Both default were subotimal and I removed them from the future comparisons. Having received suggestions on the solr mailing list, particularly from Shawn Heisey, I’ve tried running the measurements again with added sets of parameters.

cms-x1 ="-Xmx20480m \
 -XX:NewRatio=3 \
 -XX:SurvivorRatio=4 \
 -XX:TargetSurvivorRatio=90 \
 -XX:MaxTenuringThreshold=8 \
 -XX:+UseConcMarkSweepGC \
 -XX:+CMSScavengeBeforeRemark \
 -XX:PretenureSizeThreshold=64m \
 -XX:CMSFullGCsBeforeCompaction=1 \
 -XX:+UseCMSInitiatingOccupancyOnly \
 -XX:CMSInitiatingOccupancyFraction=70 \
 -XX:CMSTriggerPermRatio=80 \
 -XX:CMSMaxAbortablePrecleanTime=6000 \
 -XX:+CMSParallelRemarkEnabled \
 -XX:+ParallelRefProcEnabled \
 -XX:+UseLargePages \
 -XX:+AggressiveOpts \
 -Dmontysolr.enable.warming=false -Dsolr.cache.size=0"

cms-x2 =
"-Xmx20480m \
 -server \
 -XX:+PrintGCTimeStamps \
 -XX:+PrintGCDetails \
 -XX:MaxPermSize=64m \
 -XX:NewSize=1024m \
 -XX:SurvivorRatio=1 \
 -XX:TargetSurvivorRatio=90 \
 -XX:MaxTenuringThreshold=8 \
 -XX:+UseConcMarkSweepGC \
 -XX:+CMSScavengeBeforeRemark \
 -XX:PretenureSizeThreshold=512m \
 -XX:CMSFullGCsBeforeCompaction=1 \
 -XX:+UseCMSInitiatingOccupancyOnly \
 -XX:CMSInitiatingOccupancyFraction=70 \
 -XX:CMSTriggerPermRatio=80 \
 -XX:CMSMaxAbortablePrecleanTime=6000 \
 -XX:+CMSConcurrentMTEnabled \
 -XX:+UseParNewGC \
 -XX:ConcGCThreads=7 \
 -XX:ParallelGCThreads=7 \
 -XX:+UseLargePages \
 -Dmontysolr.enable.warming=false -Dsolr.cache.size=0"

cms-x3 = "-Xmx20480m \
 -XX:+AggressiveOpts \
 -XX:+HeapDumpOnOutOfMemoryError \
 -XX:+OptimizeStringConcat \
 -XX:+UseFastAccessorMethods \
 -XX:+UseG1GC \
 -XX:+UseStringCache \
 -XX:-UseSplitVerifier \
 -XX:MaxGCPauseMillis=50 \
 -Dmontysolr.enable.warming=false -Dsolr.cache.size=0"

This was the scenario (if you want to repeat): https://github.com/romanchyla/solrjmeter/blob/master/scenarios/comparing-garbage-collectors.sh

The results are not at all easy to implement, but I’ll throw it out here, just in case somebody may have some interesting insight. I have two comments: even if the tests execute hundreds of thousands of queries per each run (for every garbage collector configuration), these queries are not good and not the worst case scenarios – especially for the AND and NEAR5 clauses. So, these searches in majority of cases find nothing or only very small number of hits, and that is not helpful – I am working on a better set, but that takes time.

The next observation I have to offer is that the CMS-custom configuration seems to be pretty consistent and performing fastest of all, it keeps to have lowest averages (which may or may not be significant, but most likely it should count, because of the sheer number of requests executed). So I am thinking that CMS-custom (dark blue) and the cms-x3 (green) configuration are my favorites right now. But I need to have better queries to test and the differences may NOT BE SIGNIFICANT.

Colour legend (sorry, not having time to play with it now):

  • dark (shade of blue/grey) = cms-custom
  • yellow = g1-custom
  • green = cms-x1
  • pink = cms-x2
  • light blue = cms-x3

I have just tweaked the chart to separate the columns – but basically, cms-custom and g1-custom ran 6 times, the others ran 3 times

gc-run2

Advertisements

Measuring SOLR query performance

Tags

, , , , ,

So the problem statement: I have a server, I have a SOLR index, how do I optimize Java which is running my SOLR instance? How do I know this parameter is better than another parameter? I’ll run experiment. Measure response times (with other parameters) and repeat the experiment(s) again and again.

The lucene benchmark is the right way to measure ‘things’. I looked at the source code, wanted to extend it, but hmm, cough, it is bit hard to read :). But I know JMeter and I have used it before to measure server responses (and like statistics, so I want to see details of distributions, not just nice pictures).  Looking around more, I cannot not find anything else that satisfies my needs – persistence, reliability and being rigorous. The solrmeter is a bunch of utilites, nice, true, but not better than JMeter, so why to use it? I won’t.

The sematext monitoring platform looks great (I would like to use it, but first I have to convince our team), though it seems better fit for for real-time monitoring. And I want to run my experiments and look at very detailed info, for example such as  this one

latencies-over-time

This innocuous graph shows how long our users are waiting when I use the standard Java garbage collector. The test in this experiment ran only for 1m, but I already know that the large spike there is not random. And it is NOT innocent! It can be pretty bad… latencies-over-time

This graph shows latencies for query of x NEAR5 y (randomly generated), the two large spikes are garbage collecotor swipes. And I think I can’t see all these details in Sematext SPK, so I am forced to write my own tool to measure SOLR performance (besides, as I discovered later, 3 loops of 4 scenarios produced 720MB of data … eeeek, I don’t think SPK has that much of data)

 

So here it is:

https://github.com/romanchyla/solrjmeter

I have decided to test it on the problem of tuning Java VM for Solr. There are several posts in the SOLR mailing list with recommendations for different  Java VM parameters.

First thing first: this will run against an index with ~9.5M docs, the index has 200GB, is on a machine running CentOs 6, the SOLR will use 20GB of RAM, the machine has 64GB of RAM, the rest is for OS caching – I won’t bother you with other details :)

I have generated thousands of queries, they are of different types, exercise boolean operations, proximity operators, they search in fields, as well as across fields (unfielded search, using edismax). All of that is very specific to the way how we are parsing queries at ADS (Astrophysics Data System), but you should assume it makes sense to us to test these random queries ;-) So the 4 scenarios are:

  1. Standard Java garbage collector (without custom settings)
  2. CMS with custom settings (thanks to Shawn Heisey, who shared his setup with the mailing list0): -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:NewRatio=3 -XX:MaxTenuringThreshold=8 -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts
  3. New G1 Garbage collector without parameters
  4. Custom G1 (again recommende on the list, perhaps by Otis Gospodenic, but I am not sure):-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache -XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA -XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000

It ran last night, each test around 1.5 hours (theare are many queries, my dev machine was doing the measurement and generating graphs – so it takes long time to finish), the server is restarted with every run, there is a rampup period of 30 secs per every query class, then 6 threads are doing search for 60secs. The loop of 30 classes was repeated 3 times.

As expected, the default Java garbage collector had the worst spread (the variation as measured by standard deviation – ie. how much and how many times the response times were far away from the average response time).

cms-is-worst

The screenshot is not showing labels, but the default CMS is the light yellow. So I decided to remove it from comparison and look only at the remaining three. And they all look quite similar! I had to drill down at the details per each scenario. This is from the middle run no# 2

G1 (default)

+------------------------------------------+-------+-------+-------+-------+--------+--------+-------+------+--------+-----+--------+
 |                                     Test |   %90 |   %95 |   %98 |   %99 | %error |    avg | count |  max | median | min |  stdev |
 +------------------------------------------+-------+-------+-------+-------+--------+--------+-------+------+--------+-----+--------+
 |            booleanHighFreqAndHighFreq.aq |  41.0 |  54.0 |  69.0 |  83.0 |  0.032 | 17.000 |  7872 |  325 |     10 |   3 | 21.916 |
 |   booleanHighFreqAndHighFreqUnfielded.aq |  43.0 |  55.0 |  74.0 |  91.0 |  0.033 | 18.000 |  7690 |  364 |     11 |   3 | 22.067 |
 |             booleanHighFreqAndLowFreq.aq |  22.0 |  32.0 |  47.0 |  62.0 |  0.021 | 10.000 | 11198 |  292 |      5 |   2 | 14.297 |
 |    booleanHighFreqAndLowFreqUnfielded.aq |  21.0 |  30.0 |  43.0 |  55.0 |  0.021 | 10.000 | 11405 |  260 |      5 |   2 | 11.426 |
 |             booleanHighFreqAndMedFreq.aq |  42.0 |  63.0 |  98.0 | 133.0 |  0.035 | 21.000 |  5123 | 1047 |     10 |   2 | 53.822 |
 |    booleanHighFreqAndMedFreqUnfielded.aq |  43.0 |  64.0 | 103.0 | 137.0 |  0.035 | 22.000 |  5141 | 1177 |     10 |   2 | 55.122 |
 |          booleanHighFreqNear2HighFreq.aq |  61.0 | 143.0 | 291.0 | 562.0 |  0.181 | 31.000 |  6190 |  875 |      9 |   2 | 82.651 |
 | booleanHighFreqNear2HighFreqUnfielded.aq |  64.0 | 142.0 | 292.0 | 574.0 |  0.181 | 32.000 |  6192 |  865 |      9 |   2 | 83.321 |
 |           booleanHighFreqNear2LowFreq.aq |  16.0 |  23.0 |  49.0 |  70.0 |  0.000 |  9.000 | 12013 | 5074 |      6 |   3 | 47.976 |
 |  booleanHighFreqNear2LowFreqUnfielded.aq |  15.0 |  21.0 |  46.0 |  67.0 |  0.000 |  9.000 | 12924 |  254 |      6 |   3 | 11.578 |
 |           booleanHighFreqNear2MedFreq.aq |  31.0 |  51.0 |  73.0 |  81.0 |  0.058 | 14.000 | 10586 |  306 |      8 |   2 | 19.352 |
 |  booleanHighFreqNear2MedFreqUnfielded.aq |  29.0 |  53.0 |  71.0 |  79.0 |  0.058 | 13.000 | 10585 |  249 |      8 |   2 | 17.330 |
 |          booleanHighFreqNear5HighFreq.aq |  68.0 | 122.0 | 279.0 | 464.0 |  0.170 | 30.000 |  6501 |  604 |      9 |   2 | 69.919 |
 | booleanHighFreqNear5HighFreqUnfielded.aq |  74.0 | 124.0 | 281.0 | 455.0 |  0.170 | 31.000 |  6340 |  680 |      9 |   2 | 70.063 |
 |           booleanHighFreqNear5LowFreq.aq |  21.0 |  27.0 |  47.0 |  66.0 |  0.000 | 10.000 | 11705 |  255 |      6 |   3 | 13.651 |
 |  booleanHighFreqNear5LowFreqUnfielded.aq |  21.0 |  28.0 |  49.0 |  69.0 |  0.000 | 10.000 | 11739 |  268 |      6 |   3 | 13.623 |
 |           booleanHighFreqNear5MedFreq.aq |  35.0 |  53.0 |  71.0 | 100.0 |  0.039 | 15.000 | 10173 |  296 |      8 |   2 | 20.671 |
 |  booleanHighFreqNear5MedFreqUnfielded.aq |  33.0 |  53.0 |  70.0 | 105.0 |  0.039 | 14.000 | 10408 |  277 |      8 |   2 | 20.575 |
 |            booleanHighFreqNotHighFreq.aq |  54.0 |  74.0 | 133.0 | 201.0 |  0.041 | 26.000 |  6556 |  425 |     17 |   2 | 32.122 |
 |   booleanHighFreqNotHighFreqUnfielded.aq |  51.0 |  71.0 | 128.0 | 197.0 |  0.041 | 25.000 |  6657 |  519 |     16 |   2 | 31.576 |
 |             booleanHighFreqNotLowFreq.aq |  63.0 |  84.0 |  93.0 |  98.0 |  0.000 | 35.000 |  5253 |  290 |     32 |   3 | 22.410 |
 |    booleanHighFreqNotLowFreqUnfielded.aq |  62.0 |  81.0 |  93.0 |  95.0 |  0.000 | 35.000 |  5269 |  292 |     31 |   3 | 21.747 |
 |             booleanHighFreqNotMedFreq.aq |  54.0 |  73.0 | 137.0 | 197.0 |  0.039 | 28.000 |  6263 |  404 |     20 |   3 | 30.826 |
 |    booleanHighFreqNotMedFreqUnfielded.aq |  55.0 |  74.0 | 138.0 | 198.0 |  0.039 | 29.000 |  6192 |  450 |     20 |   3 | 31.909 |
 |             booleanHighFreqOrHighFreq.aq |  89.0 | 113.0 | 138.0 | 223.0 |  0.043 | 38.000 |  4734 |  388 |     22 |   2 | 43.684 |
 |    booleanHighFreqOrHighFreqUnfielded.aq |  90.0 | 111.0 | 134.0 | 219.0 |  0.043 | 38.000 |  4746 |  392 |     22 |   2 | 42.164 |
 |              booleanHighFreqOrLowFreq.aq | 101.0 | 181.0 | 241.0 | 267.0 |  0.010 | 52.000 |  3705 |  401 |     38 |   3 | 52.070 |
 |     booleanHighFreqOrLowFreqUnfielded.aq |  96.0 | 178.0 | 239.0 | 262.0 |  0.010 | 51.000 |  3719 |  443 |     37 |   3 | 50.728 |
 |              booleanHighFreqOrMedFreq.aq |  73.0 |  92.0 | 217.0 | 261.0 |  0.048 | 37.000 |  5288 |  339 |     24 |   2 | 42.742 |
 |     booleanHighFreqOrMedFreqUnfielded.aq |  73.0 |  90.0 | 208.0 | 257.0 |  0.048 | 36.000 |  5349 |  337 |     23 |   3 | 42.115 |
 +------------------------------------------+-------+-------+-------+-------+--------+--------+-------+------+--------+-----+--------+

G1-customized

 +------------------------------------------+-------+-------+-------+-------+--------+--------+-------+------+--------+-----+--------+
 |                                     Test |   %90 |   %95 |   %98 |   %99 | %error |    avg | count |  max | median | min |  stdev |
 +------------------------------------------+-------+-------+-------+-------+--------+--------+-------+------+--------+-----+--------+
 |            booleanHighFreqAndHighFreq.aq |  44.0 |  57.0 |  75.0 |  90.0 |  0.033 | 18.000 |  7719 |  389 |     10 |   3 | 22.858 |
 |   booleanHighFreqAndHighFreqUnfielded.aq |  43.0 |  55.0 |  71.0 |  81.0 |  0.033 | 18.000 |  7608 |  311 |     12 |   3 | 20.056 |
 |             booleanHighFreqAndLowFreq.aq |  27.0 |  37.0 |  53.0 |  70.0 |  0.021 | 11.000 | 10836 |  508 |      6 |   2 | 16.697 |
 |    booleanHighFreqAndLowFreqUnfielded.aq |  27.0 |  37.0 |  54.0 |  71.0 |  0.021 | 11.000 | 10875 |  262 |      5 |   3 | 15.767 |
 |             booleanHighFreqAndMedFreq.aq |  53.0 |  76.0 | 127.0 | 213.0 |  0.036 | 25.000 |  5047 | 1235 |     10 |   3 | 63.891 |
 |    booleanHighFreqAndMedFreqUnfielded.aq |  45.0 |  67.0 |  98.0 | 156.0 |  0.035 | 23.000 |  5094 | 1290 |     10 |   3 | 62.324 |
 |          booleanHighFreqNear2HighFreq.aq |  70.0 | 144.0 | 285.0 | 536.0 |  0.183 | 32.000 |  6122 |  858 |      9 |   2 | 81.670 |
 | booleanHighFreqNear2HighFreqUnfielded.aq |  60.0 | 147.0 | 289.0 | 524.0 |  0.181 | 31.000 |  6186 |  935 |      9 |   2 | 83.910 |
 |           booleanHighFreqNear2LowFreq.aq |  18.0 |  29.0 |  56.0 |  70.0 |  0.000 | 10.000 | 12444 |  311 |      6 |   3 | 15.384 |
 |  booleanHighFreqNear2LowFreqUnfielded.aq |  17.0 |  29.0 |  56.0 |  66.0 |  0.000 |  9.000 | 12523 |  271 |      6 |   3 | 14.220 |
 |           booleanHighFreqNear2MedFreq.aq |  29.0 |  53.0 |  78.0 |  96.0 |  0.058 | 13.000 | 10625 |  256 |      8 |   2 | 18.725 |
 |  booleanHighFreqNear2MedFreqUnfielded.aq |  29.0 |  51.0 |  71.0 |  89.0 |  0.058 | 13.000 | 10390 |  244 |      8 |   2 | 17.707 |
 |          booleanHighFreqNear5HighFreq.aq |  76.0 | 125.0 | 278.0 | 448.0 |  0.170 | 31.000 |  6338 |  632 |      9 |   2 | 71.548 |
 | booleanHighFreqNear5HighFreqUnfielded.aq |  76.0 | 126.0 | 300.0 | 455.0 |  0.170 | 31.000 |  6341 |  642 |      9 |   3 | 72.610 |
 |           booleanHighFreqNear5LowFreq.aq |  23.0 |  32.0 |  54.0 |  71.0 |  0.000 | 11.000 | 11526 |  671 |      6 |   3 | 15.748 |
 |  booleanHighFreqNear5LowFreqUnfielded.aq |  26.0 |  36.0 |  59.0 |  72.0 |  0.000 | 11.000 | 11418 |  297 |      6 |   3 | 16.288 |
 |           booleanHighFreqNear5MedFreq.aq |  37.0 |  55.0 |  74.0 | 100.0 |  0.038 | 15.000 | 10097 |  318 |      8 |   3 | 20.688 |
 |  booleanHighFreqNear5MedFreqUnfielded.aq |  39.0 |  56.0 |  70.0 |  87.0 |  0.038 | 15.000 | 10084 |  285 |      8 |   2 | 21.326 |
 |            booleanHighFreqNotHighFreq.aq |  61.0 |  80.0 | 138.0 | 210.0 |  0.042 | 28.000 |  6344 |  376 |     18 |   2 | 34.587 |
 |   booleanHighFreqNotHighFreqUnfielded.aq |  59.0 |  79.0 | 133.0 | 205.0 |  0.042 | 28.000 |  6414 |  473 |     18 |   2 | 34.226 |
 |             booleanHighFreqNotLowFreq.aq |  66.0 |  88.0 |  95.0 | 108.0 |  0.000 | 38.000 |  5090 |  312 |     34 |   4 | 24.931 |
 |    booleanHighFreqNotLowFreqUnfielded.aq |  66.0 |  88.0 |  96.0 | 109.0 |  0.000 | 38.000 |  5034 |  303 |     34 |   4 | 25.333 |
 |             booleanHighFreqNotMedFreq.aq |  57.0 |  76.0 | 140.0 | 200.0 |  0.038 | 30.000 |  6134 |  306 |     20 |   3 | 32.044 |
 |    booleanHighFreqNotMedFreqUnfielded.aq |  58.0 |  79.0 | 140.0 | 200.0 |  0.038 | 31.000 |  6119 |  293 |     21 |   2 | 32.188 |
 |             booleanHighFreqOrHighFreq.aq |  97.0 | 118.0 | 148.0 | 255.0 |  0.042 | 41.000 |  4491 |  387 |     24 |   3 | 46.294 |
 |    booleanHighFreqOrHighFreqUnfielded.aq |  96.0 | 118.0 | 149.0 | 250.0 |  0.041 | 41.000 |  4566 |  389 |     24 |   3 | 45.882 |
 |              booleanHighFreqOrLowFreq.aq |  99.0 | 180.0 | 243.0 | 271.0 |  0.010 | 53.000 |  3691 |  471 |     37 |   3 | 51.968 |
 |     booleanHighFreqOrLowFreqUnfielded.aq | 113.0 | 194.0 | 251.0 | 277.0 |  0.010 | 57.000 |  3524 |  464 |     39 |   3 | 56.207 |
 |              booleanHighFreqOrMedFreq.aq |  75.0 |  95.0 | 226.0 | 274.0 |  0.048 | 38.000 |  5198 |  342 |     25 |   2 | 43.792 |
 |     booleanHighFreqOrMedFreqUnfielded.aq |  74.0 |  94.0 | 231.0 | 265.0 |  0.047 | 37.000 |  5241 |  359 |     24 |   2 | 44.198 |
 +------------------------------------------+-------+-------+-------+-------+--------+--------+-------+------+--------+-----+--------+

CMS-customized

+------------------------------------------+-------+-------+-------+-------+--------+--------+-------+------+--------+-----+--------+
 |                                     Test |   %90 |   %95 |   %98 |   %99 | %error |    avg | count |  max | median | min |  stdev |
 +------------------------------------------+-------+-------+-------+-------+--------+--------+-------+------+--------+-----+--------+
 |            booleanHighFreqAndHighFreq.aq |  41.0 |  54.0 |  69.0 |  82.0 |  0.032 | 17.000 |  7874 |  328 |     10 |   2 | 20.510 |
 |   booleanHighFreqAndHighFreqUnfielded.aq |  45.0 |  58.0 |  78.0 |  93.0 |  0.033 | 19.000 |  7463 |  358 |     12 |   3 | 22.407 |
 |             booleanHighFreqAndLowFreq.aq |  24.0 |  32.0 |  42.0 |  49.0 |  0.021 | 10.000 | 10982 |  623 |      6 |   2 | 13.924 |
 |    booleanHighFreqAndLowFreqUnfielded.aq |  25.0 |  34.0 |  44.0 |  53.0 |  0.021 | 10.000 | 11174 |  289 |      5 |   2 | 15.003 |
 |             booleanHighFreqAndMedFreq.aq |  48.0 |  68.0 | 104.0 | 168.0 |  0.035 | 23.000 |  5101 | 1176 |     10 |   3 | 57.798 |
 |    booleanHighFreqAndMedFreqUnfielded.aq |  48.0 |  72.0 | 116.0 | 213.0 |  0.036 | 24.000 |  5022 | 1307 |     10 |   2 | 64.380 |
 |          booleanHighFreqNear2HighFreq.aq |  62.0 | 137.0 | 284.0 | 529.0 |  0.181 | 31.000 |  6281 |  870 |      9 |   2 | 79.851 |
 | booleanHighFreqNear2HighFreqUnfielded.aq |  65.0 | 143.0 | 287.0 | 559.0 |  0.181 | 32.000 |  6194 |  875 |      9 |   2 | 81.449 |
 |           booleanHighFreqNear2LowFreq.aq |  18.0 |  26.0 |  39.0 |  50.0 |  0.000 |  9.000 | 12787 |  302 |      6 |   3 | 12.519 |
 |  booleanHighFreqNear2LowFreqUnfielded.aq |  17.0 |  25.0 |  39.0 |  49.0 |  0.000 |  9.000 | 12962 |  251 |      6 |   3 | 11.857 |
 |           booleanHighFreqNear2MedFreq.aq |  28.0 |  41.0 |  66.0 |  80.0 |  0.058 | 13.000 | 10839 |  274 |      7 |   2 | 17.402 |
 |  booleanHighFreqNear2MedFreqUnfielded.aq |  30.0 |  44.0 |  68.0 |  82.0 |  0.058 | 13.000 | 10756 |  308 |      7 |   2 | 18.348 |
 |          booleanHighFreqNear5HighFreq.aq |  67.0 | 119.0 | 269.0 | 455.0 |  0.170 | 29.000 |  6536 |  611 |      9 |   3 | 68.869 |
 | booleanHighFreqNear5HighFreqUnfielded.aq |  66.0 | 118.0 | 270.0 | 458.0 |  0.171 | 29.000 |  6590 |  639 |      9 |   2 | 68.033 |
 |           booleanHighFreqNear5LowFreq.aq |  22.0 |  29.0 |  37.0 |  46.0 |  0.000 | 10.000 | 11819 |  281 |      6 |   3 | 12.898 |
 |  booleanHighFreqNear5LowFreqUnfielded.aq |  23.0 |  30.0 |  41.0 |  48.0 |  0.000 | 10.000 | 11722 |  300 |      6 |   3 | 13.517 |
 |           booleanHighFreqNear5MedFreq.aq |  35.0 |  50.0 |  65.0 |  85.0 |  0.038 | 15.000 | 10278 |  284 |      8 |   3 | 20.559 |
 |  booleanHighFreqNear5MedFreqUnfielded.aq |  33.0 |  49.0 |  63.0 |  77.0 |  0.039 | 14.000 | 10422 |  259 |      8 |   2 | 18.036 |
 |            booleanHighFreqNotHighFreq.aq |  53.0 |  73.0 | 136.0 | 206.0 |  0.041 | 26.000 |  6586 |  426 |     17 |   2 | 32.367 |
 |   booleanHighFreqNotHighFreqUnfielded.aq |  57.0 |  76.0 | 136.0 | 205.0 |  0.042 | 27.000 |  6476 |  353 |     17 |   2 | 33.097 |
 |             booleanHighFreqNotLowFreq.aq |  64.0 |  83.0 |  93.0 |  97.0 |  0.000 | 36.000 |  5220 |  294 |     33 |   3 | 22.540 |
 |    booleanHighFreqNotLowFreqUnfielded.aq |  66.0 |  86.0 |  94.0 |  99.0 |  0.000 | 37.000 |  5139 |  335 |     33 |   3 | 23.835 |
 |             booleanHighFreqNotMedFreq.aq |  57.0 |  75.0 | 139.0 | 203.0 |  0.038 | 30.000 |  6117 |  313 |     21 |   3 | 31.941 |
 |    booleanHighFreqNotMedFreqUnfielded.aq |  57.0 |  78.0 | 150.0 | 207.0 |  0.039 | 31.000 |  6072 |  414 |     21 |   3 | 34.089 |
 |             booleanHighFreqOrHighFreq.aq |  90.0 | 109.0 | 131.0 | 223.0 |  0.042 | 37.000 |  4771 |  367 |     22 |   2 | 41.495 |
 |    booleanHighFreqOrHighFreqUnfielded.aq |  95.0 | 115.0 | 136.0 | 232.0 |  0.041 | 40.000 |  4596 |  378 |     24 |   2 | 43.960 |
 |              booleanHighFreqOrLowFreq.aq | 104.0 | 178.0 | 248.0 | 270.0 |  0.010 | 54.000 |  3689 |  363 |     37 |   3 | 52.516 |
 |     booleanHighFreqOrLowFreqUnfielded.aq | 100.0 | 173.0 | 245.0 | 270.0 |  0.010 | 53.000 |  3695 |  548 |     38 |   3 | 51.579 |
 |              booleanHighFreqOrMedFreq.aq |  76.0 |  97.0 | 239.0 | 274.0 |  0.048 | 39.000 |  5138 |  332 |     27 |   3 | 43.823 |
 |     booleanHighFreqOrMedFreqUnfielded.aq |  74.0 |  92.0 | 226.0 | 264.0 |  0.048 | 37.000 |  5272 |  417 |     25 |   2 | 43.144 |
 +------------------------------------------+-------+-------+-------+-------+--------+--------+-------+------+--------+-----+--------+

I didn’t like the fact that the G1 had a query that took 5s and it had more of the maxes as well. So I decided to concentrate on the customized G1 and CMS. But they do seem to be remarkably similar!

Their distribution for the X OR Y queries are almost exactly the same, only that custom G1 finished slightly more queries there, but that can be just random thing. When I looked at distribution of one class, I first thought there was a bug in my tool, because they were almost exactly overlapping! But I checked it and there was no bug.

So there is no conclusion for me yet. Both options seem good and much better than the default garbage collector. Sematext produced interesting graph on their experience with G1, which may seem to suggest that G1 is better there: http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/ but I am not yet ready to make such a conclusion.

There are of course many problems with my lipstick test – it is triggering the garbage collector through the unknown number of dead objects that are there during search/fetching results, it is using not a great set of queries, it is running for only 90s per query type. But I am confident my tool is measuring what I want to measure – ie the responsiveness of our servers. It has clearly shown latency problems with the default CMS garbage collector and indicated problems with default G1. I should prepare better queries, something that loads many hits, does heavy searching and also I may want to let it run for much longer period.

And here the good news, I now have the way to test different scenarios. I don’t need to guess, or rely on somebody’s measurements, I can target our machines and I can process the data using statistical tools (because the data is available after it was harvested – btw, over 500MB of data, I should trim that down).

If you  want to look at your own SOLR instances in the same way, you can. The tool for is available as open-source at: https://github.com/romanchyla/solrjmeter

You can reproduce the same test easily – just change the following file to run against your standard SOLR server and execute the scenario: https://github.com/romanchyla/solrjmeter/blob/master/scenarios/comparing-garbage-collectors.sh

Here are few more screenshots, first comparison of standard means per different scenarios:

comparison-view

Next, the view on the details of one of the scenario runs:

day-view

And the last, the detailed view on one query type:

test-view

Developing grammars for Solr

Recently, there has been some interesting updates to the way how grammars can be built and tested with MontySolr. But because the old (existing) features were not presented here, I shall include them in the pack as well.

Let’s start with tests – inside the contrib/antlrqueryparser/grammars and as well contrib/invenio/grammars there is a new spreadsheet. It contains the gunit tests for the grammar(s). The first sheet is for adding and reorganizing the tests, the second sheet for exporting them. So usually what I do is (inside the second sheet)

open in an editor StandardLuceneGrammar.gunit
open StandardLuceneGrammar.xls
edit Sheet1
switch to Sheet
select all and copy (Ctr+a, Ctrl+c)
paste into StandardLuceneGrammar.gunit
save
run "ant gunit" (inside contrib/antlrqueryparser)

This way you will obtain:

gunit:
     [echo] 
     [echo]             Running GUNIT: StandardLuceneGrammar        
     [echo]             
     [java] -----------------------------------------------------------------------
     [java] executing testsuite for grammar:StandardLuceneGrammar with 179 tests
     [java] -----------------------------------------------------------------------
     [java] 36 failures found:
     [java] test144 (atom, line159) - 
     [java] expected: (MODIFIER (TMODIFIER (FIELD (QNORMAL this))))
     [java] actual: (MODIFIER (TMODIFIER (FIELD (QNORMAL th\\*is))))
     .......
     [java] test179 (atom, line194) - 
     [java] expected: (QTRUNCATED *t*\\a)
     [java] actual: (MODIFIER (TMODIFIER (FIELD (QTRUNCATED *t*\\a))))
     [java] 
     [java] 
     [java] Tests run: 179, Failures: 36
     [java] Java Result: 36

Gunit found errors. It identifies the test with the line number, so I go to the spreadsheet, to the line (e.g. line 194) and check the test. Either update the result, or fix the grammar if it was wrong. This way edits are much faster than editing the gunit file by hand.

And because ANTLRWorks (the popular editor for ANTLR grammars) is not always reliable in reporting the mistakes, gunit tests are crucial – I run them after every significant change into the grammar.

By default, the Ant (inside contrib/antlrqueryparser) will assume that the debugged grammar is called “StandardLuceneGrammar”, but nothing prevents you from creating a new grammar, place it inside ./grammars (both .g and .gunit file) and call ant like:

ant gunit -Dgrammar=MyNewGrammar

The other handy feature for developing/debugging a grammar is to generate a graph of the parse tree for every gunit test.

ant generate-html

This will produce a html page with a chart for every gunit line (a valid gunit line):

To use this feature, your system must have a xdot package that generates SVG images from .dot files (e.g. on Ubuntu “sudo apt-get install xdot”) and if it is installed in non-standard location, then you can edit the contrib/antlrqueryparser/build.properties.

For testing individual cases, you don’t want to wait to see hundreds of graphs, but rather see the individual queries. There if a few of interesting commands

rchyla@diana antlrqueryparser> ant view -Dquery="x (a or b)"

Will produce:

If instead, you want to regenerate a grammar, you can do:

ant genererate-antlr -Dgrammar=<name>

Or, if you are debugging, you can use the following commands

rchyla@diana antlrqueryparser> ant try-view -Dquery="a -c"

Which will recompile the grammar before display the chart.

And …

rchyla@diana antlrqueryparser> ant try-tree -Dquery="a -c"

…will show just the parse tree (as you get familiar with ANTLR, this output will become much faster to decode).

tree:
     [echo] 
     [echo]                 Generating TREE: StandardLuceneGrammar  
     [echo]                 Query: a -c 
     [echo]                 Rule: mainQ       
     [echo]             
     [java] Grammar: StandardLuceneGrammar rule:mainQ
     [java] query: a -c
     [java] (DEFOP (MODIFIER (TMODIFIER (FIELD (QNORMAL a)))) (MODIFIER - (TMODIFIER (FIELD (QNORMAL c)))))

Continuous integration for MontySolr

I have been wanting some automated build and testing for MontySolr. Recently, after playing and reading around, tried out Hudson and am pretty happy with the results.

Firstly, the read-up made me to re-discover Joel’s blog and found the great Joel’s test. This blog post really made me think…

Installation of Hudson on Ubuntu doesn’t work at described in the official documentation, but it is fairly simple anyway:

wget http://java.net/projects/hudson/downloads/download/Debian/hudson-debian-2.2.0.deb
sudo dpkg --install hudson-debian-2.2.0.deb

Maybe you’ll have to install daemon package as well

sudo apt-get install daemon

And then you have a hudson on the localhost:8080 (you can edit /etc/default/hudson to set a different port)

So then I created inside Hudson a “New Job” with the following parameters:

Source repository: git://github.com/romanchyla/montysolr.git
Build triggers: Poll SCM
Schedule: */6 * * * *
#1 actions - execute shell: cp $WORKSPACE/build.properties.default $WORKSPACE/build.properties
#2 action - execute Ant: target=build-all

Initially, there were a few errors (because on my dev machine I had some extra jars available), but I fixed that and now MontySolr builds cleanly inside Hudson.

And because I happened to sit in a plane above Atlantic while playing with Hudson, there was no internet and the build could not work. So I created a new Hudson job and directed it to my local Git repository.  With a few more steps, this can be used to build MontySolr without any internet connection

#1 Go to Hudson and clone the existing Hudson project 
#2 Set Git: /dvt/workspace/montysolr 
#3 Add actions:
- Execute shell:
cp $WORKSPACE/build.properties.default $WORKSPACE/build.properties
mkdir -p $WORKSPACE/build/solr-download
cp /dvt/workspace/montysolr/build/solr-download/*.tgz $WORKSPACE/build/solr-download
- Invoke Ant:
untar-solr
 build-solr-example
- Invoke Ant:
build-all

Using this setup, Hudson grabs Solr form the dev folder (I assume it is there), downloads it without an internet connection, unpacks and then runs the standard ‘build-all’ target. This target can be very useful for the dev-machine development.

How to build and test Invenio query parser (III.)

Tags

,

Let’s first look at how parsers usually work (including Invenio). First they split the input stream into tokens – ie. they identify elements of the input stream. And then they parse (recognize) this token stream. Take this example:

(this is "phrase")

The query is semantically identical (in our grammar) with the query:

this is "phrase"

But for the parser, they are not the same – ANTLR sees them in this way:

LEFT-BRACKET + TOKEN + TOKEN + PHRASE-TOKEN + RIGHT-BRACKET

While Invenio query parser sees them as:

TOKEN TOKEN TOKEN TOKEN TOKEN

Yes, correct – Invenio splits elements by empty space and does not recognize their types during tokenization. But of course, it is not as simple. Invenio parser does know about a few special characters (tokens) and about one special case of tokens.

The special characters are (without commas):

(, ), +, |, -, + -

These are basically query operators (AND, OR, NOT) and brackets to groups clauses.

The special tokens are for Invenio (things like) these:

e(-)e(+)
U(1)
U(1,x)

You will notice that there is always ‘something else than bracket’ immediately before the bracket, then the opening bracket, ‘something else than bracket’, closing bracket. These expressions contain the special characters, but should not be parsed. Therefore series of regular expressions ‘escape’ them before tokenization:

s = s.replace('->', '####DATE###RANGE##OP#') # XXX: Save '->'
s = re.sub('(?P<outside>[a-zA-Z0-9_,=:]+)\((?P<inside>[a-zA-Z0-9_,+-/]*)\)',
                       '#####\g<outside>####PAREN###\g<inside>##PAREN#', s) # XXX: Save U(1) and SL(2,Z)
s = re.sub('####PAREN###(?P<content0>[.0-9/-]*)(?P<plus>[+])(?P<content1>[.0-9/-]*)##PAREN#',
                       '####PAREN###\g<content0>##PLUS##\g<content1>##PAREN#', s)
s = re.sub('####PAREN###(?P<content0>([.0-9/]|##PLUS##)*)(?P<minus>[-])' +\
                                   '(?P<content1>([.0-9/]|##PLUS##)*)##PAREN#',
                       '####PAREN###\g<content0>##MINUS##\g<content1>##PAREN#', s) # XXX: Save e(+)e(-)

It works in this sequence:

hey (u(2))      -->
hey (####u####PAREN###2##PAREN#)     -->
['hey', '(', 'u(2)', ')']

We must mimic Invenio query parsing behaviour, but ANTLR works differently. First of all, we have written our grammar that recognizes more types of tokens (and for good reasons, which will be visible later). The tokens are

*, ?, (, ), +, -, |, AND, OR, NOT, /, ", PHRASE-TOKEN, WILDCARD-TOKEN, REGEX etc...

Secondly, there is no parse rule for the ‘Invenio special family of tokens’. It would be too complicated to write these rules together with the parsing rules, it would be inefficient because of necessity to look at the context of a bracket, it would be domain-specific – ie. that family of tokens is very special and is perhaps necessary for High-Energy Physics and its formulas, but it makes little sense in a digital library of French literature. And in Biology they might require a different type of tokens to be recognized as well. Perhaps these:

U(1, 2)  # notice the space between '1,' and '2'

Btw, I am puzzled by this special family of tokens. Not only because they are sort of ‘hard-coded’, but also because they are inherently ambiguous. They treat spaces differently, but an unaware user types spaces! And it will be parsed differently. Notice this:

from invenio.search_engine import SearchQueryParenthesisedParser as sp
qp = sp()
In [13]: qp.parse_query('this e(1, 2)')
Out[13]: ['+', 'this', '+', 'e', '+', '1, + 2']

In [14]: qp.parse_query('this e(1,2)')
Out[14]: ['+', 'this', '+', 'e(1,2)']

But let us return to our ANTLR grammar. So how do we solve the problem of special tokens? In the similar way to Invenio escaping. In our ANTLR grammar, if users want to search for special characters, she must escape them

U\(1,2\)

Or she must type them as a phrase

"U(1,2)"

This solution would save everybody the troubles with special characters. Such tokens are NOT ambiguous. And can even contains spaces…

"U(1, 2)"

But Invenio requires that we parse even these cases. To do that, I’ll use a mini-grammar which modifies the input query.

https://github.com/romanchyla/montysolr/blob/master/contrib/invenio/grammars/FixInvenio.g

The most important part is this:

SAFE_TOKEN
:(~(' ' | '\t' | '\n' | '\r' | '\u3000'
| '\'' | '\"'| '\\' | '/' | ')' | '('
)
| ESC_CHAR )+;
SUSPICIOUS_TOKEN
:
(SAFE_TOKEN LPAREN SUB_SUS? RPAREN)+ SUB_SUS?
;
fragment SUB_SUS
:
LPAREN SUB_SUS RPAREN
| SAFE_TOKEN
;
It handles the cases described above, ant it is actually a bit smarter than Invenio, because it allows for recursion, which Invenio cannot understand (and it ought to).
In [15]: qp.parse_query('this e(e(1,2))')
Out[15]: ['+', 'this', '+', 'e', '+', 'e(1,2)']
The mini-grammar is more complicated then the regular expressions, but much more powerful. Though strictly speaking, we could reproduce (the broken) behavior of the Invenio parser by using the same regular expressions…
And now let’s see how it looks in the code:
public class AqpInvenioSyntaxParser extends AqpSyntaxParserAbstract {
    # .........

    @Override
    public TokenStream getTokenStream(CharSequence input) throws QueryNodeParseException {
        ANTLRStringStream in = new ANTLRStringStream(input.toString());
        FixInvenioLexer fLexer = new FixInvenioLexer(in);
        FixInvenioParser fParser = new FixInvenioParser(new CommonTokenStream(fLexer));
        
        try {
            fParser.mainQ();
        } catch (RecognitionException e) {
            throw new QueryNodeParseException(new MessageImpl(input + " " + 
                    fParser.getErrorMessage(e, fParser.getTokenNames())));
        }
        
        in = new ANTLRStringStream(fParser.corrected.toString());
        InvenioLexer lexer = new InvenioLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        
        return tokens;
    }

    #.......
}

First we parse the query input through the FixInvenio mini-grammar, we properly escape the special cases and concatenate them into a new query string. Then we serve the new escaped query to the standard (unambiguous) Invenio parser.

How to setup Invenio indexer in MontySolr

Montysolr contrib/invenio comes with the component which automatically updates the Solr index. I’ll show here how to activate it.

First, make sure you compile the contrib/invenio

cd contrib/invenio
ant

Or alternatively, if you did in the montysolr root

ant build-all

After the compilation, we should see:

build/contrib/invenio/montysolr-invenio-*.*-SNAPSHOT.jar

This jar can be added into your Solr installation lib folder. We’ll also need dataimporthandler.jar from the Solr’s contrib

If the jars are in place, we can configure montysolr.xml

<!-- Handler that keeps Invenio in-sync on request -->
  <requestHandler name="/invenio/update" default="false">
     <lst name="defaults">
       <str name="inveniourl">${solr.inveniourl:http://localhost/search}</str>
       <str name="importurl">http://localhost:8983/solr/invenio/import?command=full-import&amp;dirs=</str>
       <str name="updateurl">http://localhost:8983/solr/invenio/import?command=full-import&amp;dirs=</str>
       <str name="deleteurl">blankrecords</str>
     </lst>
  </requestHandler>
<requestHandler name="/invenio/import" class="solr.WaitingDataImportHandler">
    <lst name="defaults">
      <str name="config">data-config.xml</str>
      <bool name="clean">false</bool>
      <bool name="commit">false</bool>
    </lst>
  </requestHandler>

We have registered a new handler under url /invenio/update. This handler will automatically retrieve all the new/updated/deleted recids from Invenio and then it will decide what to do with them. It is invoked through url, here are some examples

  • http://yoursite/solr/invenio/update
    – retrieve recids of the new/changed/deleted records since the time of the last update operation (if this is the first time you invoke this url, it will retrieve and index all records)
  • http://yoursite/solr/invenio/update?last_recid=-1
    – to force reindexing of everything
  • http://yoursite/solr/invenio/update?last_recid=94
    – it will find out when the record with recid 94 was changed and will gather all the changes that happen after

But in order for this handler to work, we must correctly configure MontySolr. First of all, the PYTHONPATH. Make sure python can load the following modules: montysolr, monty_invenio, montysolr_java

eg.

export PYTHONPATH:/some/path/montysolr/src/python:/some/path/montysolr/contrib/invenio/src/python:/some/path/montysolr/build/dist:$PYTHONPATH

Then we must also correctly configure the /invenio/update handler.  The update handler has two modes of operation, it can either generate empty lucene documents for any existing Invenio recid. In that case we can say:

....
<lst name="defaults">
   <bool name="generate">true</bool>
</lst>

Or we may want that the update handler invokes various other handlers. In this case we do:

<lst name="defaults">
       <str name="importurl">http://localhost:8983/solr/invenio/import?command=full-import&amp;dirs=</str>
       <str name="updateurl">http://localhost:8983/solr/invenio/import?command=full-import&amp;dirs=</str>
       <str name="deleteurl">blankrecords</str>
       <str name="inveniourl">${solr.inveniourl:http://localhost/search}</str>
 </lst>

updateurl is a complete url to the Solr update handler (this handler should fetch updated source documents and index them). This handler is just a slightly modified version of the DataImportHandler. It will not revert changes if one record fails and it also postpones reply (blocks) until all records were processed. Besides that, it is just a normal http://wiki.apache.org/solr/DataImportHandler

importurl : complete url to the Solr update handler (this handler should fetch new source documents and index them). We use the same handler as for update. If we said blankrecords, empty lucene documents with invenio recid<->lucene docid would be created

deleteurl : complete url to the Solr update handler (this handler should remove deleted documents from Solr index). blankrecords means that we will simply delete the lucene docs.

inveniourl is the url to the Invenio search instance. When new records are discovered, they will be retrieved from there

For example, with the following parameters:

last_recid: 90
inveniourl: http://invenio-server/search
updateurl: http://localhost:8983/solr/update-dataimport?command=full-import&dirs=/proj/fulltext/extracted
importurl: http://localhost:8983/solr/waiting-dataimport?command=full-import&arg1=val1&arg2=val2
deleteurl: blankrecords

We ping this url: http://localhost:8983/solr/invenio/update?last_recid=90

… the handler asks Invenio and discovers following changes:

updated records: 53, 54, 55, 100
added records: 101,103
deleted records: 91,92,93,102

…which will results in 2 requests and 1 local delete operation

  1. http://localhost:8983/solr/update-dataimport?command=full-import&dirs=/proj/fulltext/extracted&url=http://invenio-server/search?p=recid:53->55 OR recid:100&rg=200&of=xm
  2. http://localhost:8983/solr/waiting-dataimport?command=full-import&arg1=val1&arg2=val2&url=http://invenio-server/search?p=recid:101 OR recid:103&rg=200&of=xm
  3. <local deletion of lucene docs that map to recids: 91,92,93,102>

However, changes will not be yet committed to the index, unless you specify

<lst name="defaults">
       <str name="importurl">http://localhost:8983/solr/invenio/import?command=full-import&amp;dirs=</str>
       <str name="updateurl">http://localhost:8983/solr/invenio/import?command=full-import&amp;dirs=</str>
       <str name="deleteurl">blankrecords</str>
       <str name="inveniourl">${solr.inveniourl:http://localhost/search}</str>
       <bool commit="true">
  </lst>

But unfortunately, the commit configuration may not affect the index as you would expect. Because the import of the documents is probably running in parallel, at the same time when we call commit. Commit is therefore useful only for the situation when you use blankrecord ‘urls’. And in general, you should configure the commit policy inside your dataimport handler or have a site-wide commit policy. Or, invoke the commit manually in the end:

http://localhost:8983/solr/update?commit=true

Final note

There are many parameters, so it’s daunting. But once everything is configured, you can actually forget that there exists any update handler. Just set a cron job to periodically invoke

http://your/solr/invenio/update

… or even better, we could make a change inside Invenio source to invoke certain url once a record is changed…

Run several demos from one IP address

CERN provides us with the virtual machines and that is very nice. It is also possible to convince the security guys to make one of those machines accessible from Internet. However, it is not something they like to do very often.

So my problem is that I have one virtual machine visible to the world, but several demos to show (and Invenio requires the whole virtual host of the apache – ie blocks one port).

To solve this, I’ll hide several machines behind one – which is not that different from a load balancing. The apache config contains:

Include /opt/apache.conf

And there I played… (with the help of #cookies and #sticky-load-balance)

DocumentRoot /opt/static-web
<Directory /welcome>
 Options Indexes FollowSymLinks
</Directory>
LogLevel debug
#Set a cookie if BALANCER_ROUTE_CHANGED containing BALANCER_WORKER_ROUTE environment variable
Header add Set-Cookie "BALANCEID=hej.%{BALANCER_WORKER_ROUTE}e; path=/;" env=BALANCER_ROUTE_CHANGED
#Show balancer-manager
<Location /balancer-manager>
 SetHandler balancer-manager
 Order allow,deny
 Allow from all
</Location>
ProxyRequests Off
#Configure members for cluster
<Proxy balancer://rcabalancer>
 BalancerMember http://137.138.124.207:80 route=atlantis
 BalancerMember http://inspirehep.net:80 route=inspire
</Proxy>
#Do not proxy balancer-manager
ProxyPass /balancer-manager !
ProxyPass /welcome !
ProxyPass /robots.txt !
#The actual ProxyPass
ProxyPass / balancer://rcabalancer/ stickysession=BALANCEID nofailover=Off
#Do not forget ProxyPassReverse for redirects
ProxyPassReverse / balancer://rcabalancer
#ProxyPassReverse / http://inspirehep.net/

This is apache config at insdev01.cern.ch and when you visit insdev01.cern.ch/welcome you can get to a static page where a small javascript helps you to choose from available demos (and sets the cookie).

<html>
<head>
<script type="text/javascript">
var cName = 'BALANCEID';
var cTimeout = 0.5;
function createCookie(name,value,days) {
 if (days) {
 var date = new Date();
 date.setTime(date.getTime()+(days*24*60*60*1000));
 var expires = "; expires="+date.toGMTString();
 }
 else var expires = "";
 document.cookie = name+"="+value+expires+"; path=/";
}
function readCookie(name) {
 var nameEQ = name + "=";
 var ca = document.cookie.split(';');
 for(var i=0;i < ca.length;i++) {
 var c = ca[i];
 while (c.charAt(0)==' ') c = c.substring(1,c.length);
 if (c.indexOf(nameEQ) == 0) return c.substring(nameEQ.length,c.length);
 }
 return null;
}
function eraseCookie(name) {
 createCookie(name,"",-1);
}
</script>
</head>
<body>
Hi! <br/>
<p>
This is a testing/demo site. Below you can see the list of available servers. You can activate them by clicking.
<p>
Beware this demo might be broken, but if it is working, then you have a lucky day!
<ol>
 <li> <a href="http://insdev01.cern.ch" onclick="javascript:createCookie(cName, 'inspire', cTimeout)">INSPIRE search </a></li>
 <li> <a href="http://insdev01.cern.ch" onclick="javascript:createCookie(cName, 'atlantis', cTimeout)">Invenio demo search</a> </li>
</ol>
<p>
If it doesn't work, or you want to select another demo from the list, then please close your browser
<br/> ... or wait <script>document.write(cTimeout)</script> days ;) ... and visit this webpage again.
</body>
</html>

This will set the BALANCEID cookie to contain the name of the server which we want to display.

And I also want that bots stop bothering me, so put this inside /opt/static-web/robots.txt

User-agent: *
Disallow: /

A two things confused me:

1. after configuring the proxy and loading the site I saw a blank page, the browser was loading and trying to connect to the server behind the proxy

First I thought I had an error in my configuration, but it was an error on my side – the browser was simply trying to load pictures and css references by the html page which was generated by the hidden machine (but wasn’t yet configured to pretend to be insdev01.cern.ch)

2. server reached MaxClients

This happens only when I want to use the insdev01:80 as a public proxy and run invenio on the same machine on a different port

BalancerMember http://insdev01.cern.ch:8080 route=atlantis

I don’t know yet how to solve this, the returned html code is correct, but for some reason wsgi seems to be loaded in loop (probably because proxy establishes many connections) and this exhausts the maxClients. Which were set to 250.

How to build and test Invenio query parser (II.)

Once we have an ANTLR grammar which produces AST (Abstract Syntax Tree), we can translate the user input (string query) into a Lucene Query objects.

Recall, the AST will produce something like:

We’ll use the lucene’s modern query parser, especially its child antlrqueryparser.

The new contrib parser is much more flexible than the traditional query parsing. But it is also a rather confusing if you see it for the first time. Basically, there are 4 types of components:

  1. parser configuration
  2. syntax parser
  3. query node processor
  4. query builder

To illustrate it, I’ll use this (ugly) drawing

The query parser is the box into which we put the configuration and which holds also the user input (in our case: this OR that*)

In the first stage, the query parser takes “this OR that*” and parses it through the ANTLR syntax parser. We obtain the AST tree.

AST tree will be transformed into QueryNodes by a series of NodeProcessors. We obtain the QNode tree which is much more compact (smaller) that the original AST tree. A lot of node of the original AST tree were consumed, discarded, but some new nodes may also be added.

The QueryNodes are then parsed to a QueryBuilder — and from there we obtain a Lucene Query instance. The query parsing has finished, the searching starts…

The most important parts of this query parser are the NodeProcessors and QueryNodeBuilders. The magic happens there. Let’s look at the processor pipeline first, the initialization:

  public AqpQueryNodeProcessorPipeline(QueryConfigHandler queryConfig) {
    super(queryConfig);

    add(new AqpDEFOPProcessor());
    add(new AqpTreeRewriteProcessor());
    add(new AqpMODIFIERProcessor());
    add(new AqpOPERATORProcessor());
    add(new AqpCLAUSEProcessor());
    add(new AqpTMODIFIERProcessor());
    add(new AqpBOOSTProcessor());
    add(new AqpFUZZYProcessor());
    add(new AqpQRANGEINProcessor());
    add(new AqpQRANGEEXProcessor());
    add(new AqpQNORMALProcessor());
    add(new AqpQPHRASEProcessor());
    add(new AqpQPHRASETRUNCProcessor());
    add(new AqpQTRUNCATEDProcessor());
    add(new AqpQRANGEINProcessor());
    add(new AqpQRANGEEXProcessor());
    add(new AqpQANYTHINGProcessor());
    add(new AqpFIELDProcessor());
    add(new AqpFuzzyModifierProcessor());
    add(new WildcardQueryNodeProcessor());
    add(new MultiFieldQueryNodeProcessor());
    add(new FuzzyQueryNodeProcessor());
    add(new MatchAllDocsQueryNodeProcessor());
    add(new LowercaseExpandedTermsQueryNodeProcessor());
    add(new ParametricRangeQueryNodeProcessor());
    add(new AllowLeadingWildcardProcessor());
    add(new AnalyzerQueryNodeProcessor());
    add(new PhraseSlopQueryNodeProcessor());
    add(new NoChildOptimizationQueryNodeProcessor());
    add(new RemoveDeletedQueryNodesProcessor());
    add(new RemoveEmptyNonLeafQueryNodeProcessor());
    add(new BooleanSingleChildOptimizationQueryNodeProcessor());
    add(new DefaultPhraseSlopQueryNodeProcessor());
    add(new BoostQueryNodeProcessor());
    add(new MultiTermRewriteMethodProcessor());
    add(new AqpGroupQueryOptimizerProcessor());
    add(new AqpOptimizationProcessor());
  }

Errrr, that looks scary, right? There is a number of transformers which massage the AST tree. For example, the AqpDEFOPProcessor takes the query node, looks at its “label” and if the label equals “DEFOP” then it will translate the node with a new QueryNode and set the default operator (and how do we know the default operator? The configuration tells us). If you want to see what is actually happening in the processor, we can turn on the debugging (I’ll describe that later) and see this output on the stdout:

query:        this OR that*
     0. starting
--------------------------------------------

<astOPERATOR label="DEFOP" name="OPERATOR" type="25" >
    <astOPERATOR label="OR" name="OPERATOR" type="25" >
        <astMODIFIER label="MODIFIER" name="MODIFIER" type="21" >
            <astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
                <astFIELD label="FIELD" name="FIELD" type="14" >
                    <astQNORMAL label="QNORMAL" name="QNORMAL" type="33" >
                        <astTERM_NORMAL value="this" start="0" end="3" name="TERM_NORMAL" type="45" />
                    </astQNORMAL>
                </astFIELD>
            </astTMODIFIER>
        </astMODIFIER>
        <astMODIFIER label="MODIFIER" name="MODIFIER" type="21" >
            <astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
                <astFIELD label="FIELD" name="FIELD" type="14" >
                    <astQTRUNCATED label="QTRUNCATED" name="QTRUNCATED" type="38" >
                        <astTERM_TRUNCATED value="that*" start="8" end="12" name="TERM_TRUNCATED" type="47" />
                    </astQTRUNCATED>
                </astFIELD>
            </astTMODIFIER>
        </astMODIFIER>
    </astOPERATOR>
</astOPERATOR>
     1. step class org.apache.lucene.queryParser.aqp.processors.AqpDEFOPProcessor
     Tree changed: YES
--------------------------------------------

<astOPERATOR label="OR" name="OPERATOR" type="25" >
    <astMODIFIER label="MODIFIER" name="MODIFIER" type="21" >
        <astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
            <astFIELD label="FIELD" name="FIELD" type="14" >
                <astQNORMAL label="QNORMAL" name="QNORMAL" type="33" >
                    <astTERM_NORMAL value="this" start="0" end="3" name="TERM_NORMAL" type="45" />
                </astQNORMAL>
            </astFIELD>
        </astTMODIFIER>
    </astMODIFIER>
    <astMODIFIER label="MODIFIER" name="MODIFIER" type="21" >
        <astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
            <astFIELD label="FIELD" name="FIELD" type="14" >
                <astQTRUNCATED label="QTRUNCATED" name="QTRUNCATED" type="38" >
                    <astTERM_TRUNCATED value="that*" start="8" end="12" name="TERM_TRUNCATED" type="47" />
                </astQTRUNCATED>
            </astFIELD>
        </astTMODIFIER>
    </astMODIFIER>
</astOPERATOR>
     2. step class org.apache.lucene.queryParser.aqp.processors.AqpTreeRewriteProcessor
     Tree changed: NO
--------------------------------------------

     3. step class org.apache.lucene.queryParser.aqp.processors.AqpMODIFIERProcessor
     Tree changed: YES
--------------------------------------------

<astOPERATOR label="OR" name="OPERATOR" type="25" >
    <astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
        <astFIELD label="FIELD" name="FIELD" type="14" >
            <astQNORMAL label="QNORMAL" name="QNORMAL" type="33" >
                <astTERM_NORMAL value="this" start="0" end="3" name="TERM_NORMAL" type="45" />
            </astQNORMAL>
        </astFIELD>
    </astTMODIFIER>
    <astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
        <astFIELD label="FIELD" name="FIELD" type="14" >
            <astQTRUNCATED label="QTRUNCATED" name="QTRUNCATED" type="38" >
                <astTERM_TRUNCATED value="that*" start="8" end="12" name="TERM_TRUNCATED" type="47" />
            </astQTRUNCATED>
        </astFIELD>
    </astTMODIFIER>
</astOPERATOR>
     4. step class org.apache.lucene.queryParser.aqp.processors.AqpOPERATORProcessor
     Tree changed: YES
--------------------------------------------
<boolean operation='OR'>
<modifier operation='MOD_NONE'>

<astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
    <astFIELD label="FIELD" name="FIELD" type="14" >
        <astQNORMAL label="QNORMAL" name="QNORMAL" type="33" >
            <astTERM_NORMAL value="this" start="0" end="3" name="TERM_NORMAL" type="45" />
        </astQNORMAL>
    </astFIELD>
</astTMODIFIER>
</modifier>
<modifier operation='MOD_NONE'>

<astTMODIFIER label="TMODIFIER" name="TMODIFIER" type="49" >
    <astFIELD label="FIELD" name="FIELD" type="14" >
        <astQTRUNCATED label="QTRUNCATED" name="QTRUNCATED" type="38" >
            <astTERM_TRUNCATED value="that*" start="8" end="12" name="TERM_TRUNCATED" type="47" />
        </astQTRUNCATED>
    </astFIELD>
</astTMODIFIER>
</modifier>
</boolean>
     5. step class org.apache.lucene.queryParser.aqp.processors.AqpCLAUSEProcessor
     Tree changed: NO
--------------------------------------------

     6. step class org.apache.lucene.queryParser.aqp.processors.AqpTMODIFIERProcessor
     Tree changed: YES
--------------------------------------------
<boolean operation='OR'>
<modifier operation='MOD_NONE'>

<astFIELD label="FIELD" name="FIELD" type="14" >
    <astQNORMAL label="QNORMAL" name="QNORMAL" type="33" >
        <astTERM_NORMAL value="this" start="0" end="3" name="TERM_NORMAL" type="45" />
    </astQNORMAL>
</astFIELD>
</modifier>
<modifier operation='MOD_NONE'>

<astFIELD label="FIELD" name="FIELD" type="14" >
    <astQTRUNCATED label="QTRUNCATED" name="QTRUNCATED" type="38" >
        <astTERM_TRUNCATED value="that*" start="8" end="12" name="TERM_TRUNCATED" type="47" />
    </astQTRUNCATED>
</astFIELD>
</modifier>
</boolean>
     7. step class org.apache.lucene.queryParser.aqp.processors.AqpBOOSTProcessor
     Tree changed: NO
--------------------------------------------

     8. step class org.apache.lucene.queryParser.aqp.processors.AqpFUZZYProcessor
     Tree changed: NO
--------------------------------------------

     9. step class org.apache.lucene.queryParser.aqp.processors.AqpQRANGEINProcessor
     Tree changed: NO
--------------------------------------------

     10. step class org.apache.lucene.queryParser.aqp.processors.AqpQRANGEEXProcessor
     Tree changed: NO
--------------------------------------------

     11. step class org.apache.lucene.queryParser.aqp.processors.AqpQNORMALProcessor
     Tree changed: YES
--------------------------------------------
<boolean operation='OR'>
<modifier operation='MOD_NONE'>

<astFIELD label="FIELD" name="FIELD" type="14" ><field start='0' end='3' field='field' text='this'/>
</astFIELD>
</modifier>
<modifier operation='MOD_NONE'>

<astFIELD label="FIELD" name="FIELD" type="14" >
    <astQTRUNCATED label="QTRUNCATED" name="QTRUNCATED" type="38" >
        <astTERM_TRUNCATED value="that*" start="8" end="12" name="TERM_TRUNCATED" type="47" />
    </astQTRUNCATED>
</astFIELD>
</modifier>
</boolean>
     12. step class org.apache.lucene.queryParser.aqp.processors.AqpQPHRASEProcessor
     Tree changed: NO
--------------------------------------------

     13. step class org.apache.lucene.queryParser.aqp.processors.AqpQPHRASETRUNCProcessor
     Tree changed: NO
--------------------------------------------

     14. step class org.apache.lucene.queryParser.aqp.processors.AqpQTRUNCATEDProcessor
     Tree changed: YES
--------------------------------------------
<boolean operation='OR'>
<modifier operation='MOD_NONE'>

<astFIELD label="FIELD" name="FIELD" type="14" ><field start='0' end='3' field='field' text='this'/>
</astFIELD>
</modifier>
<modifier operation='MOD_NONE'>

<astFIELD label="FIELD" name="FIELD" type="14" ><wildcard field='field' term='that*'/>
</astFIELD>
</modifier>
</boolean>
     15. step class org.apache.lucene.queryParser.aqp.processors.AqpQRANGEINProcessor
     Tree changed: NO
--------------------------------------------

     16. step class org.apache.lucene.queryParser.aqp.processors.AqpQRANGEEXProcessor
     Tree changed: NO
--------------------------------------------

     17. step class org.apache.lucene.queryParser.aqp.processors.AqpQANYTHINGProcessor
     Tree changed: NO
--------------------------------------------

     18. step class org.apache.lucene.queryParser.aqp.processors.AqpFIELDProcessor
     Tree changed: YES
--------------------------------------------
<boolean operation='OR'>
<modifier operation='MOD_NONE'>
<field start='0' end='3' field='field' text='this'/>
</modifier>
<modifier operation='MOD_NONE'>
<wildcard field='field' term='that*'/>
</modifier>
</boolean>
     19. step class org.apache.lucene.queryParser.aqp.processors.AqpFuzzyModifierProcessor
     Tree changed: NO
--------------------------------------------

     20. step class org.apache.lucene.queryParser.standard.processors.WildcardQueryNodeProcessor
     Tree changed: YES
--------------------------------------------
<boolean operation='OR'>
<modifier operation='MOD_NONE'>
<field start='0' end='3' field='field' text='this'/>
</modifier>
<modifier operation='MOD_NONE'>
<prefixWildcard field='field' term='that*'/>
</modifier>
</boolean>
     21. step class org.apache.lucene.queryParser.standard.processors.MultiFieldQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     22. step class org.apache.lucene.queryParser.standard.processors.FuzzyQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     23. step class org.apache.lucene.queryParser.standard.processors.MatchAllDocsQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     24. step class org.apache.lucene.queryParser.standard.processors.LowercaseExpandedTermsQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     25. step class org.apache.lucene.queryParser.standard.processors.ParametricRangeQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     26. step class org.apache.lucene.queryParser.standard.processors.AllowLeadingWildcardProcessor
     Tree changed: NO
--------------------------------------------

     27. step class org.apache.lucene.queryParser.standard.processors.AnalyzerQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     28. step class org.apache.lucene.queryParser.standard.processors.PhraseSlopQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     29. step class org.apache.lucene.queryParser.core.processors.NoChildOptimizationQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     30. step class org.apache.lucene.queryParser.core.processors.RemoveDeletedQueryNodesProcessor
     Tree changed: NO
--------------------------------------------

     31. step class org.apache.lucene.queryParser.standard.processors.RemoveEmptyNonLeafQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     32. step class org.apache.lucene.queryParser.standard.processors.BooleanSingleChildOptimizationQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     33. step class org.apache.lucene.queryParser.standard.processors.DefaultPhraseSlopQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     34. step class org.apache.lucene.queryParser.standard.processors.BoostQueryNodeProcessor
     Tree changed: NO
--------------------------------------------

     35. step class org.apache.lucene.queryParser.standard.processors.MultiTermRewriteMethodProcessor
     Tree changed: NO
--------------------------------------------

     36. step class org.apache.lucene.queryParser.aqp.processors.AqpGroupQueryOptimizerProcessor
     Tree changed: NO
--------------------------------------------

     37. step class org.apache.lucene.queryParser.aqp.processors.AqpOptimizationProcessor
     Tree changed: NO
--------------------------------------------

final result:
--------------------------------------------
<boolean operation='OR'>
<modifier operation='MOD_NONE'>
<field start='0' end='3' field='field' text='this'/>
</modifier>
<modifier operation='MOD_NONE'>
<prefixWildcard field='field' term='that*'/>
</modifier>
</boolean>

query:        this OR that*
result:        field:this field:that*

The final result is what the QueryBuilder receives. And there it is actually easier. The standard query builder contains:

public AqpStandardQueryTreeBuilder() {
    setBuilder(GroupQueryNode.class, new GroupQueryNodeBuilder());
    setBuilder(FieldQueryNode.class, new AqpFieldQueryNodeBuilder());
    setBuilder(BooleanQueryNode.class, new BooleanQueryNodeBuilder());
    setBuilder(FuzzyQueryNode.class, new FuzzyQueryNodeBuilder());
    setBuilder(BoostQueryNode.class, new BoostQueryNodeBuilder());
    setBuilder(ModifierQueryNode.class, new ModifierQueryNodeBuilder());
    setBuilder(WildcardQueryNode.class, new WildcardQueryNodeBuilder());
    setBuilder(TokenizedPhraseQueryNode.class, new PhraseQueryNodeBuilder());
    setBuilder(MatchNoDocsQueryNode.class, new MatchNoDocsQueryNodeBuilder());
    setBuilder(PrefixWildcardQueryNode.class,
        new PrefixWildcardQueryNodeBuilder());
    setBuilder(RangeQueryNode.class, new RangeQueryNodeBuilder());
    setBuilder(SlopQueryNode.class, new SlopQueryNodeBuilder());
    setBuilder(StandardBooleanQueryNode.class,
        new StandardBooleanQueryNodeBuilder());
    setBuilder(MultiPhraseQueryNode.class, new MultiPhraseQueryNodeBuilder());
    setBuilder(MatchAllDocsQueryNode.class, new MatchAllDocsQueryNodeBuilder());
  }

So, unless you introduce a completely new QueryNode type, the standard query parser will be able to translate the tree into lucene query instance.

If you would like to study a working lucene query parser for the Invenio grammar we wrote about in the first part, then have a look at:

org/apache/lucene/queryParser/aqp/AqpInvenioSyntaxParser.java

The source code contains a complete configuration of a query parser for Invenio, including the setup of default values, the QueryProcessor pipeline and query builders. And can be run as:

import org.apache.lucene.queryParser.aqp.AqpQueryParser;
import org.apache.lucene.queryParser.aqp.AqpInvenioQueryParser;

AqpQueryParser qp = new AqpInvenioQueryParser();
qp.setDebug(true); # too see the info on stderr

System.out.println(qp.parser("this or that*", "defaultField"))

OK, and in the next part we can see why Invenio parsing is syntactically ambiguous and how we can solve it and plug Invenio grammar inside SOLR.

How to build and test Invenio query parser (I.)

This entry shows how to mimic exactly the same behaviour of an Invenio query parser in Java. Using the modern ANTLR parser library and antlrqueryparser module of lucene and montysolr.

First, we need a query syntax grammar for ANTLR. As there existed no formal grammar before and Invenio was defined only via examples (#1, #2) we must create a formal grammar ourselves.

cd contrib/antlrqueryparser/grammars
vim Invenio.g

You can download the complete grammar from: https://github.com/romanchyla/montysolr/blob/master/contrib/invenio/grammars/Invenio.g

So we open up ANTLRWorks:

java -jar antlrworks.jar &

And we load the grammar…

In my experience, the ANTLRWorks is a very good GUI for designing the grammar, but it has some very serious problems. For example, you cannot rely on the Interpreter (depicted above) that it always shows the correct pass. Sometimes it fails, but pretends to work. Sometimes the other way round.

So, when I edit the grammar,  I used to do:

  1. Edit the grammar
  2. Ctrl+D (to check the grammar syntax)
  3. Run example in the interpreter tab
  4. verify parsing on the command line
ant try-view -Dgrammar=Invenio -Dquery="some #funny^0.4 token"
Buildfile: /dvt/workspace/montysolr/contrib/antlrqueryparser/build.xml
     [echo] Building antlrqueryparser...

antlr-generate:
     [echo]
     [echo]                 Regenerating: Invenio
     [echo]                 Output: /dvt/workspace/montysolr/contrib/antlrqueryparser/src/java/org/apache/lucene/queryParser/aqp/parser
     [echo]                 
    [javac] Compiling 2 source files to /dvt/workspace/montysolr/build/contrib/antlrqueryparser/classes/java
    [javac] InvenioParser.java:181: warning: [cast] redundant cast to java.lang.Object
    [javac]                     root_0 = (Object)adaptor.nil();
    [javac]                              ^
    ... a lot of runnning text
    [javac] 100 warnings

antlr:

compile:
    [javac] /dvt/workspace/montysolr/contrib/antlrqueryparser/build.xml:131: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
    [javac] Compiling 1 source file to /dvt/workspace/montysolr/build/contrib/antlrqueryparser/classes/java

compile-all:
    [javac] /dvt/workspace/montysolr/contrib/antlrqueryparser/build.xml:122: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds

dot:
     [echo]
     [echo]                     Generating DOT: Invenio  
     [echo]                     Query: some #funny^0.4 token
     [echo]                     Rule: mainQ       
     [echo]                 

tree:
     [echo]
     [echo]                 Generating TREE: Invenio  
     [echo]                 Query: some #funny^0.4 token
     [echo]                 Rule: mainQ       
     [echo]             
     [java] Grammar: Invenio rule:mainQ
     [java] query: some #funny^0.4 token
     [java] (DEFOP (DEFOP (MODIFIER (TMODIFIER (FIELD (QNORMAL some)))) (MODIFIER # (TMODIFIER (BOOST ^0.4) FUZZY (FIELD (QNORMAL funny))))) (MODIFIER (TMODIFIER (FIELD (QNORMAL token)))))

display:

If you have a dot viewer installed, you will see the following picture of the AST (Abstract Syntax Parse) tree:

The picture shows me the AST tree that is produces by ANTLR and everything looks OK – if there was an error, you would notice it on the command line and saw a blank picture.

If you can’t see any picture despite a correct grammar, verify the contrib/antlrqueryparser/build.properties file has a correct pat to your viewer.

dot_viewer=/usr/bin/xdot

Notice, that the debug output includes the textual representation of the AST tree.

(DEFOP (DEFOP (MODIFIER (TMODIFIER (FIELD (QNORMAL some)))) (MODIFIER # (TMODIFIER (BOOST ^0.4) FUZZY (FIELD (QNORMAL funny))))) (MODIFIER (TMODIFIER (FIELD (QNORMAL token)))))

This representation is the same as the picture. We can use it to test a whole batches of examples. You must create a file called Invenio.gunit and save it inside the grammars folder and run the following:

ant gunit -Dgrammar=Invenio -Dquery="some #funny^0.4 token"

We get output similar to this:

gunit:
     [echo]
     [echo]         Running GUNIT: Invenio        
     [echo]         
     [java] -----------------------------------------------------------------------
     [java] executing testsuite for grammar:Invenio with 118 tests
     [java] -----------------------------------------------------------------------
     [java] 2 failures found:
     [java] test19 (mainQ, line67) -
     [java] expected: FAIL
     [java] actual: OK
     [java]
     [java] test65 (mainQ, line193) -
     [java] expected:
     [java] actual: line 1:4 no viable alternative at input '+'
     [java] line 1:9 no viable alternative at input '-'
     [java]
     [java]
     [java]
     [java] Tests run: 118, Failures: 2
     [java] Java Result: 2

This test is much faster than using ANTLRWorks and you can see immediately whether grammar changes didn’t break something else. In our case, we have got two exceptions,  here is the offending portion:

"C++" -> "(MODIFIER (TMODIFIER (FIELD (QNORMAL C++))))"

"O\'Shea" -> "(MODIFIER (TMODIFIER (FIELD (QNORMAL O'Shea))))"

"$e^{+}e^{-}$" -> ""

"hep-ph/0204133" -> "(MODIFIER (TMODIFIER (FIELD (QNORMAL hep-ph/0204133))))"

"BlaCK hOlEs" -> "(DEFOP (MODIFIER (TMODIFIER (FIELD (QNORMAL BlaCK)))) (MODIFIER (TMODIFIER (FIELD (QNORMAL hOlEs)))))"

"пушкин" -> "(MODIFIER (TMODIFIER (FIELD (QNORMAL пушкин))))"

I’ll show you in one of the next entries how to deal with such problems in Invenio grammar.