For example. Say you have an index of Parents and another index of Children and you want to get a list of Parents who have a Child called Tom.
The normal Solr response to this is to add the Children's names as a multivalued field in the Parent document.
However, what if you can't ? There are 4 million Parents in the database with hundreds of fields and to load the index that way would mean making 4 million sub query calls to get the Children. And the Parents don't really change much but the World's sperm banks are issuing and correcting thousands or parental links a day, which would mean updating thousands of Parent documents (I should mention that this whole example is entirely hypothetical!).
In this case you really need to maintain two indexes. An index of Parents and an index of Children (that contains a mother and father id field).
Both can be populated by a single query to the backing database.
The current trunk copy of Solr 4.0 has a new join feature that promises this functionality. However there are a few things to bear in mind -
- You can only perform a join on one index with itself or across two cores in the same servlet container. No joining across multiple servlets (as far as I can tell).
- You can join based on a criteria on the Child but you'll only get fields back from the Parent - nothing from the Child is returned.
- There's not much documentation out there.
Some things I had to do to get my example to work -
If you want to join by filtering on a Child field that doesn't exist in the Parent, for now at least, you have to add the same field as a dummy field in the Parent schema.
If you want to perform a query on the Parents at the same time as joining with Children you need to write the join part as a nested query.
http://localhost:8983/solr/parents/select?q=alive:yes AND _query_:"{!join fromIndex=children from=fatherid to=parentid v='childname:Tom'}"
In English: Return all fathers that are alive and have a child called Tom.
Note: The field childname had to be added to the Parents schema.