How familiar habits from classic Search API indexing can silently sabotage your RAG setup
In our previous post, The Hidden Cost of Drupal Migrations — and How to Avoid Search API Overload, we explored how Drupal’s Search API can quietly undermine AI-driven search if it’s not configured with care.
This time, we’re diving even deeper into that rabbit hole — into what happens after you’ve built your AI-powered chatbot and connected it to a vector database like Milvus for Retrieval-Augmented Generation (RAG).
Because here’s the catch: even when your embeddings are perfect and your pipeline looks flawless, a few innocent-looking Search API processors — the same ones that used to make your old SQL or Solr indexes better — can completely break your AI’s reasoning.
Old habits, new paradigms
Anyone who’s spent years configuring Search API indexes knows the drill.
A new index? Of course you enable a few trusty processors — Ignore case, Transliteration, maybe Tokenizer and Highlight. They’re almost muscle memory at this point.
And for traditional database backends, those processors really do help — they compensate for the lack of built-in intelligence by smoothing over differences in case, punctuation, and formatting to improve match quality.
But if you’ve mostly worked with Solr or other dedicated search engines, you probably know better. Solr already handles things like tokenization, case normalization, and stemming internally. Adding Drupal’s own processors on top usually doesn’t make things better — and can even distort the results.
That same principle applies, only more dramatically, when you move into AI-driven vector search.
Because vector databases don’t work that way, they don’t “search” in the textual sense — they compare meaning. Each text is converted into a high-dimensional vector, and similarity is calculated as a distance metric, not a lexical match. Casing, tokenization, and even small punctuation changes are already normalized by the embedding model itself.
So when you leave those processors on, you’re not improving recall or accuracy. You’re altering the very data that the AI model uses to understand your content.
The “helps more than it hurts” mindset — until it hurts
Let’s look at two real-world examples from recent Drupal AI projects:
The "ignorecase" trap
Sounds harmless, right? But this processor lowercases everything — including file names. That caused image rendering to break completely when the source files used uppercase letters in their names. (Yes, uppercase in filenames isn’t elegant — but it’s still valid.)
The “highlight” disaster (that wasn’t actually about highlighting)
This one was even trickier. In Drupal AI v1.1.4, a subtle but important bug was fixed inside the RAG Tool.
Originally, the tool was always assuming that results were chunked — yet it never actually enabled the corresponding Search API query option, search_api_ai_get_chunks_result
. Because of that, $get_chunked
was always false
, and the result item IDs were built only from the base entity ID:
$id = $get_chunked ? $match['drupal_entity_id'] . ':' . $match['id'] : $match['drupal_entity_id'];
This meant that all chunks belonging to the same entity overwrote each other. No matter how many chunks were found, you only ever got one result per entity.
The 1.1.4 release finally fixed that — but fixing one thing exposed another.
Once chunk-aware IDs started appearing (like entity:node/11:de:460601774751947103), the highlight processor turned out to be incompatible. It calls \Drupal\search_api\Item\Item::getOriginalObject(), which in turn invokes $this->index->loadItem($this->itemId).
And here’s the kicker: in Drupal’s Search API database, the table that tracks indexed entities only stores un-chunked item IDs.
So when loadItem() received one of those new long IDs with a chunk suffix, it couldn’t find a match. The result? RAG retrieval broke, and logs filled with warnings about invalid or unresolvable item IDs.
In short: the 1.1.4 update fixed chunk handling — but exposed how fragile old processors like highlight can be when they still expect the pre-chunked world.
These kinds of failures are maddening — because they look like mysterious “AI bugs,” when in fact they’re classic Search API configuration ghosts haunting your shiny new RAG setup.
The takeaway: unlearn what you’ve learned
When working with vector search and RAG pipelines, less is more.
Your goal is to preserve the raw, meaningful text exactly as it appears in your content — not to “optimize” it through legacy text-processing filters.
If you must preprocess, keep it purely structural (like removing HTML tags or excessive whitespace). Any change that alters meaning, casing, or token boundaries risks throwing your embeddings off — and once that happens, no amount of prompt engineering will save you.
Lessons learned
1. Start clean.
Before enabling any Search API processors, ask: “Does this transformation change meaning or structure?” If yes, keep it off.
2. Trust the embeddings.
Modern models already handle casing, punctuation, and tokenization internally. Don’t duplicate their work — it only increases the risk of distortion.
3. Beware of hidden markup.
Anything that injects or modifies HTML (like the highlight processor) can break RAG chunking or retrieval.
4. Test incrementally.
After each configuration change, rebuild your embeddings and re-test retrieval quality. You’ll be surprised how fragile the pipeline can be once AI gets involved.
5. Embrace minimalism.
A smaller, cleaner preprocessing chain usually means a more stable, more semantically consistent AI search.
Your Drupal AI chatbot doesn’t need old-school Search API “help.” It just needs you to stop helping it the old way.