Tuesday, December 25, 2012

Solr Functions in Action

The DataImportHandler is great contrib which provides methods to import data into Solr from relational databases. It operates in two modes "full build" and "incremental updates". Delta import calculates changed items then executes query to extract data from the source. It spawns multiple round-trips between Solr and datasource which is often an undesirable behavior. Moreover sometimes it causes "out of memory" exception. So that authors suggest an alternative way by using the same query for both full and delta updates distinguishing them with request and "dataimporter.*" parameters.

What's wrong here? Certainly an XML attribute isn't the best place for writing SQL text. It's better to keep SQL in files (no need to to escape characters, highlighting and so on). Second painful thing is pretty complex query. Such constructions often cause bad execution plans, especially when query runs on complicated views. So there are two options: tune query (it may become overtuned or non-portable soon) or make query simpler.

I've written a few functions. They are very simple in itself but together they produce really great cumulative effect:

  • decode is remake of Oracle's decode
  • load reads query from file
  • run executes statements

Here is rewritten configuration file:

This is query for full index. Note what 'before-full.sql' runs before query execution:

Completely different table can be used in delta update:

Please find functions sources.