Tuesday, March 20, 2012

Adventures with Solr Join

Solr. Brilliant when you can denormalise all your data into a document by document structure. Not so brilliant when you can't and you need to "join" searches across multiple and independent indices.

For example. Say you have an index of Parents and another index of Children and you want to get a list of Parents who have a Child called Tom.

The normal Solr response to this is to add the Children's names as a multivalued field in the Parent document.

However, what if you can't ? There are 4 million Parents in the database with hundreds of fields and to load the index that way would mean making 4 million sub query calls to get the Children. And the Parents don't really change much but the World's sperm banks are issuing and correcting thousands or parental links a day, which would mean updating thousands of Parent documents (I should mention that this whole example is entirely hypothetical!).

In this case you really need to maintain two indexes. An index of Parents and an index of Children (that contains a mother and father id field).

Both can be populated by a single query to the backing database.

The current trunk copy of Solr 4.0 has a new join feature that promises this functionality. However there are a few things to bear in mind -


  • You can only perform a join on one index with itself or across two cores in the same servlet container. No joining across multiple servlets (as far as I can tell).

  • You can join based on a criteria on the Child but you'll only get fields back from the Parent - nothing from the Child is returned.

  • There's not much documentation out there.


Some things I had to do to get my example to work -



If you want to join by filtering on a Child field that doesn't exist in the Parent, for now at least, you have to add the same field as a dummy field in the Parent schema.

If you want to perform a query on the Parents at the same time as joining with Children you need to write the join part as a nested query.


http://localhost:8983/solr/parents/select?q=alive:yes AND _query_:"{!join fromIndex=children from=fatherid to=parentid v='childname:Tom'}"


In English: Return all fathers that are alive and have a child called Tom.

Note: The field childname had to be added to the Parents schema.

3 comments:

Sandeep Srinivasa said...

I've been curious about the same thing - how does solr work in a distributed web services environment ? For example an ecommerce store with a different catalog and a price service - how do you do incremental indexing that is tolerant of the fact that these two services are not in sync with each other.

I wonder how does a system like Amazon push updates to solr.

Mauricio Matsuda said...

Hi Theo!

Thanks for this post!! Absolutely an amazing adventure.

We've been trying to use Joins at Solr as well and we're facing many issues.

Would you mind to let us know your configuration files? For example : data-config.xml and schema.xml?

That would be huge. Thanks a lot in advance.

Mark said...

Is this possible with solrnet?