Slightly advanced Apache Solr config for dummies

This time I'll share some lessons learned on trying to optizime Apache Solr configuration from my n00b perspective.

First of all, I want to emphasize how cool and ridicilously easy it is to setup Apache Solr and get it up and running on a Drupal 8 site. This is was a real plug & play pleasure! :-)

Installation and basic configuration

But ok, I have to admit that I have "cheated" a little bit on the Solr installation itself. I have never done this and never tried to. Because this project is hosted on platform.sh, where you only need a handful of lines in two yml config files - where you just mention that you want to use Solr, what version and how you want to name the cores and endpoints - but the rest is done for you. This and your Solr configuration files is committed with your Git repository and platform.sh is doing the rest for you.

The second important factor in this plug & play game is the Search API Solr Search module for Drupal. The configuration is usability as its best. You don't have to know anything about Solr to get a basic setup up and running. You can define your Search API server config like you'll do it for the database backend, and afterwards you get download the zipped config files that you only have to copy into the appropriate directory on the Solr server. This and all other steps are well documented. In our case, I only had to add these files to our Git repository into the .platform/solr-conf/6.x directory and push the commit to trigger another build. This is f***ing amazing! (to quote Mike Skinner :D).

Playing around with synonyms

I was very happy about the first step to get this up and running so quickly. Then of course I decided to have a closer look at the configuration in order to tweak it a little bit. One big advantage of using Solr over a normal database backend is that you are able to define synonyms. In our example, we have an Austrian based online shop having a lot of terms used in the product catalogue that are more common in Germany than in Austria. Eg in Austria nobody would say "Rollbandmaß" to a tape measure, but rather "Rollmeter". We also call it a "Kübel", what Germans call "Eimer". This is the perfect use case for the solr.SynonymFilterFactory. And as the search_api_solr module already ships with the synonym filter factory in mind, all you have to do is to fill the synonyms.txt (or the synonyms_LANGCODE.txt for a specific language).... at least theoretically. I filled a few examples, and it seemed to work at first glance. On the next time, I wanted to sit down and add more entries, I struggled and I didn't have a clue why.

Debugging Solr

I needed a deeper inside what went wrong, so I needed to debug Solr but had no idea how - especially since I had problems on my Windows computer to open up a SSH tunnel to the site. So I asked on the platform.sh Slack channel, how I could debug Solr. Only a few moments later, Larry Garfield pointed me that I could workaround my tunnel problem by adding a route to the Solr core temporarily in my staging environment. This helped me lot, at least it got me a few steps further ahead on my journey. So big shout out to Larry Garfield for this :)

I was a little bit overwhelmed with the Solr admin UI though. I had a little insight into what is indexed, etc. But I still had no clue, how to best debug what is going on with my synonyms. I found my help on DrupalChat, where Markus Kalkbrenner gave me the missing links I needed. He's also the mastermind of the search_api_solr module, so another big shout goes to Markus Kalkbrenner (and to Fritz and Paul, whose tracks I thought would be the right musical accompaniment for this journey :D).

So I've verified the existence of the synonyms file in the "files" section, and afterwards used the analysis tool to see exactly what happens on index and query time. The analysis tool is quite easy to use. On the left-hand side you type in the value that gets indexed and which field type you are using, e.g. you type in the title of your product like it is entered in your Drupal database (eg "Rollbandmaß") and use "text_de" as field type. On the right side, you type in the search term you want to analyze, eg. "Rollmeter".

Pitfall number 1: the order of the filters matters!

Now I suddenly saw, what was going on. The indexing process is configured so that in the end various terms get indexed, all of them in lowercase. At query time, there's also a lower-case filter used, alongside the synonym filter and others. The shipped configuration however is defining the lower-case filter to be run before the synomyns filter. You can still define the synonyms filter to ignore case, but it always returns the values, how you define it. As I entered the words in the synonym file in correct German grammar, which means that nouns are starting with an upper-case letter, I didn't get any search results in the end. Because Solr was looking for an upper-cased word in a lower-cased index!

I had two possiblities then: either switch the order of the filters or enter all synonyms in lower-case. I've decided to do the latter, but that's just a matter of taste.

Solr synonyms filter and the expand attribute

There are two different syntax rules, how you can define synonyms for Solr:

term => replacement
term1, term2, term3

The first example is rather clear to read: everything on the left side gets replaced by term(s) on the right side. So you could e.g. correct mis-spellings, like "Flise" and correct it by "Fliese".

The comma separated syntax is treated differently. It depends on whether or not you define the synonyms filter to expand search terms. If you decide to expand, than all terms are treated equally. So, if you enter "term1", then "term2" and "term3" will also be searched for. If you do not expand, then it will only be searched for "term1". So "term2" and "term3" will get replaced by "term1". That's a huge difference.

In the default configuration of search_api_solr, the synonyms filter is configured to be expanded. But this leads us directly to....

Pitfall number 2: the data type of the XML attributes in Solr filter configuration matters

The search_api_solr module is generating bool attribute values in XML configuration files for Solr 6.x by default with integer values, like this: <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de.txt" expand="1" ignoreCase="1"/>. But I observed that on our Solr 6.6.5 instance at least the "expand" attribute gets ignored this way. So I changed it to <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de.txt" expand="true" ignoreCase="true"/> and it worked like a charm! Markus told me that this was changing a lot in the past, but should be stable now since Solr 6.4. So I opened up an issue on the Search API Solr Search issue queue. In the meantime you should ensure and test that your desired settings work as expected.

Excursus: defining the Solr configuration the right way

Here's a thing Markus mentioned to me, I first didn't recognize: although you could do it, you normally shouldn't directly manipulate the exported Solr configuration files. Instead there's an admin UI in the Drupal module for that. Advanced users can also directly edit the Drupal configuration files of course. This way, the information on the Drupal side is always in sync with that what happens on the Solr side, and whenever you export the Solr config files, you get the correct config without risking to lose any modifications. The only exception: if you are concerned by the issue mentioned above and it isn't fixed already, or at least a working patch available, then you have to do this smaller modifications directly in the Solr configuration files. But in general, it's advised to adjust everything on the Drupal side first!