Designing a Semantic Search Engine

A search and retrieval pipeline for thousands of records

Introduction

I was tasked to build a semantic search engine as part of my previous summer internship.

My company was developing agentic AI workflows for geospatial tasks (think of it like Cursor AI for data analysis). For example, a typical user's query would look something like this:

Analyse the vegetation cover in Singapore from 2017–2020

From that query, we needed to select the most appropriate dataset for this task, and generate Python code—packaged as a Jupyter notebook—to analyse the data and produce visualisations that could be exported or overlaid on a map.

Recommending the right geospatial dataset for a user’s geospatial task may sound simple. In practice, however, you’re juggling thousands of datasets from various data providers (Google Earth Engine, AWS Open Data, Planetary Computer, Human Data Exchange, etc.), each with wildly inconsistent metadata. Some declare precise spatial and temporal extents, others offer only a terse description or a handful of keywords. Our end goal was for our system to have zero false positives, while keeping end-to-end latency within 5 seconds.

In this blogpost, I'll delve into the architecture of the search engine I developed. In summary, we are using a four-stage, microservice-based pipeline that incrementally filters, retrieves, re-ranks and finally reasons over ~4,000 datasets.

1. Spatial & Temporal Pruning

First, we immediately discard any dataset whose bounding box (the spatial extent covered by the dataset) or date range does not overlap the user’s query. In reference to the example user query in the above section, we would remove datasets that only cover the USA, or datasets are not updated since 2015.

In our 4,000-dataset catalogue, this removes roughly 20-30 % of candidates on average, ensuring downstream models never process spatially or temporally irrelevant records.

2. Field-Level Bi-Encoder Retrieval

Design & Trade-Offs

We wanted broad semantic coverage while keeping vector storage costs reasonable. After evaluating various other options like monolithic text blocks and spatial tiling, we settled on field-level semantic chunks: title, a description detailing the use cases of the dataset, method details, quality caveats and keyword list. Each chunk yields one vector, so each dataset contributes five embeddings to ChromaDB’s HNSW index (a resouce about HNSW ANN: Malkov & Yashunin, 2018).

Pros

  • Fine-grained control: manually up-weight summary over title
  • Clear interpretability: matches can be traced to specific fields

Cons

  • 5x more vectors, modestly higher storage and lookup cost

As a future improvement, we could replace manual field weightings with a small neural aggregator—an MLP that takes the five field embeddings as input, computes dynamic attention-style weights via softmax, and is trained end-to-end with backpropagation to optimise our retrieval metrics.

Performance Evaluation

To ensure our bi-encoder stage strikes the right balance between recall and precision, we built a small but representative benchmark of 100 real-world geospatial tasks. Each entry in the suite pairs a user query with its ground-truth dataset IDs. For example:

json
{ "query": "Find the administrative boundaries of Chennai, India.", "description": "Administrative boundary analysis", "relevantDatasetIds": [ "openstreetmap_nominatim_api", "geoboundaries_api_ADM2" ] }

We used the recall@100 metric to evaluate how accurate the bi-encoder stage is in retrieving the 100 most relevant datasets for this task.

Why Recall@K?

At the filtering stage, our primary concern is coverage— we want to keep any dataset that could be relevant, rather than risk cutting it off too early. Recall@K measures exactly this:

Recall@K  =  {relevantDatasetIds in top K}{total relevantDatasetIds}\mathrm{Recall@K} \;=\; \frac{\bigl|\{\text{relevantDatasetIds in top }K\}\bigr|} {\bigl|\{\text{total relevantDatasetIds}\}\bigr|}

By averaging Recall@K across all 100 tasks, we can ask:

  • Is K too small? A significantly lower Recall@50 suggested our filter is too aggressive, dropping perfectly valid datasets before they ever reach the reranker.

  • Is K too large? A high Recall@300 but negligible gain over Recall@100 implies wasted computation on excessive candidates.

Fine-Tuning for Geospatial Domain

Base open-source encoders might not have been trained with geospatial specific data and often lack niche terms like “TROPOMI” or “quadtree.” Hence, we decided to do supervised fine tuning.

Data curation

  1. Positive pairs: 1,000 user-query ↔ ground-truth dataset titles or descriptions (e.g. “methane concentrations Melbourne” ↔ “COPERNICUS/S5P/OFFL/L3_CH4”).
  2. Hard negatives: For each query, 3–5 semantically similar but incorrect datasets (e.g. “L3_AER_LH”, “L3_CO”), mined via preliminary bi-encoder retrieval and manual spot-checking.
  3. Formatting: Wrap each example as InputExample(texts=[query, metadata], label=1.0) for positives, and label 0.0 for negatives.
json
{ "query": "methane concentrations around Melbourne", "positive": "COPERNICUS/S5P/OFFL/L3_CH4", // correct as it shows CH4 "negatives": ["COPERNICUS/S5P/OFFL/L3_AER_LH", "COPERNICUS/S5P/OFFL/L3_CO"] // wrong as it shows CO or aerosol }

Model Architecture

Consider this user query and dataset description:

Query: “Analyse vegetation cover in Singapore from 2017 to 2020”
Dataset metadata: “COPERNICUS/S5P/OFFL/L3_CH4 (2018–2020, 0.1° resolution, global)”

We want one vector that captures both the fine distinctions (place, date, resolution) and the overall theme (vegetation analysis, methane concentration). Here is how the two pooling steps work in practice:

  1. Attention Pooling
  • Tokens extracted from the query:
    ["Analyse", "vegetation", "cover", "in", "Singapore", "from", "2017", "to", "2020"]
  • The model learns to assign higher weights to:
    • “vegetation” (because it indicates the analysis type)
    • “Singapore” (the spatial focus)
    • “2017 to 2020” (the temporal window)
  • Less important words like “Analyse”, “in” or “to” receive lower weights
  • The weighted sum of all token embeddings yields a single 1 536-dimensional vector that emphasises the critical terms
  1. Mean Pooling
    • The same nine token embeddings are averaged equally.
    • This vector still knows it is about “vegetation cover in Singapore over a span of years” but without over-emphasising any single word.
  2. Concatenation (combine detail and gist)
    • We place the attention-pooled vector and the mean-pooled vector side by side, creating a 3 072-dimensional embedding
    • This combined vector can tell apart, for example,
      • A dataset tagged “2020 only, 1° resolution” (because attention pooling spotlights the dates and resolution)
      • Versus another tagged “2017–2020, 0.1° resolution” (because mean pooling retains the multi-year context)

Loss Function & Training

To teach our model to pull correct datasets closer and push incorrect ones away, we use Batch-All Triplet Loss with margin 0.2.

Triplet Loss—What & Why

  • Anchor (A): the query embedding, e.g.
    “Analyse vegetation cover in Singapore from 2017 to 2020.”
  • Positive (P): the correct dataset embedding, e.g.
    “Sentinel-2 NDVI (2017–2020, vegetation index, 10 m resolution, Singapore).”
  • Negative (N): a misleading dataset embedding, e.g.
    “Sentinel-2 NDWI (2017–2020, water index, 10 m resolution, Singapore)”

Batch-All Triplet Loss considers all valid (A,P,N) triplets in a batch and encourages distance(A, P) + margin < distance(A, N) with margin = 0.2. In cosine-similarity terms, this ensures the model scores the correct dataset at least 0.2 points higher than the hardest negative.

Example: If cos_sim(A,P) = 0.85, and the hardest N has cos_sim(A,N) = 0.75, then 0.85 – 0.75 = 0.10 < 0.20 (margin), so the loss nudges the model to push P and A closer or push N further away until that gap is at least 0.2.

Batch Construction

  • Batch size: 32 triplets
  • In each batch you have 32 queries, each paired with one positive and multiple negatives—Batch-All will form every valid (A,P,N) combination within those 32.
  • This amplifies the learning signal, since a single positive example generates multiple triplet comparisons per batch

Hard-Negative Mining

  1. Round 1 training: used initial pool of 3–5 negatives per query.
  2. Retrieval pass: encoded 100 testing queries, retrieve the top 10 candidates, and identified the false positives the model still ranks highly (e.g. NDWI for a vegetation query).
  3. Augment negatives: appended these false positives to your training set as negatives.
  4. Round 2 training: repeated the fine-tuning with this stronger negative set.

This fine-tuning process delivered a 30% boost in recall@100.

3. Cross-Encoder Reranking

After the bi-encoder narrows the corpus down to ~100 candidates, we can use a cross-encoder to further narrow and rerank them to a high-precision top 10 by jointly encoding the query and the metadata (see Nogueira & Cho, 2019). Though we initially considered reinforcement learning to finetune the model, in practice we can achieve most of the gains with a far simpler pipeline.

Bi-encoder vs Cross-encoder models

In a bi-encoder model,

  • The query goes through the transformer, and its tokens attend to each other
  • The metadata of the dataset goes through the transformer, and its tokens attend to each other.
  • After that, each side is collapsed into a single vector, and a similarity function (like cosine similarity) is used to estimate the relevance of the two vectors.

In a cross-encoder model,

  • The query and dataset metadata are concatenated into one sequence.
  • The transformer's self-attention layers span across the entire joint sequence, so query tokens can attend to dataset tokens and vice versa.
  • This means the model can detect more nuanced correspondences between the query and metadata.

This means a cross-encoder model is often more accurate in terms of scoring relevance, because they directly model relationships between query and dataset tokens.

The training pipeline

  1. Candidates per query — we use the bi-encoder from the previous step to generate ~100 datasets for each query
  2. LLM as a Judge → pairwise preferences — For each query, we sampled pairs (A, B) from the candiate pool and prompted the LLM, "Which is more relevant to the query?"
  3. Train with pairwise logistic loss — We need to optimise for σ(s(q,A)s(q,B))σ(s(q, A) − s(q, B)), where s(q,A)s(q, A) is the score that the cross-encoder model assigns for the more relevant dataset and s(q,B)s(q, B) is the score it assigns to the less relevant candidate.
  4. Hard-negative mining — After one round of training, we find the most over-scored false positives and add them into training again by forming fresh pairs (query, positive vs hard negative). Then, we retrain again so the model learns to distinguish these confusions.

A little code sketch

python
from sentence_transformers import CrossEncoder, losses, InputExample, util model = CrossEncoder("BAAI/bge-reranker-base", num_labels=1) train = [InputExample(texts=[q,p,n], label=1.0) for q,p,n in llm_pairs] model.fit(DataLoader(train,batch_size=16,shuffle=True), epochs=2, loss=losses.RankerPairwiseLoss(model)) # Hard-neg mining: rescore top-K candidates, add false positives as new (q,pos,neg) pairs hard = mine_hard_negatives(model, queries, cand_pool) model.fit(DataLoader([InputExample(texts=[q,p,n],label=1.0) for q,p,n in hard], batch_size=16,shuffle=True), epochs=1)

4. LLM Reasoning Layer

Despite our best efforts with the cross-encoder reranker, precision@5 plateaued at around 70-75 %, especially lagging on certain challenging queries. Often the model would surface a dataset that might be 80–90 % relevant, yet miss a better option hidden deeper in the shortlist (for example, a dataset with higher resolution might not have been picked as best dataset). To close this gap, we inserted a final reasoning step using a state-of-the-art LLM.

Why add LLM + CoT?

  • Cross-encoder limitations: even a fine-tuned reranker can overlook subtle factors like spatial/resolution detail or provenance notes.

  • Human-like reasoning: by prompting the LLM to “think out loud” as a geospatial analyst would, we get a systematic approach to reason through very similar datasets and capture the user's nuance-driven distinctions.

This step adds extra 1-2s latency and USD 0.02–0.05 per query, but lifts precision@5 consistently above 80%, and precision@1 at around an average of 95 %.

5. Regression Testing & Monitoring

  • Metric SLAs

    • Recall@150 ≥ 90 % for bi-encoder
    • Precision@10 & MRR ≥ 80 % for reranker
  • Drift Detection Nightly one-sided Kolmogorov–Smirnov test on per-query precision@10 (scipy.stats.ks_2samp).

  • Latency SLA P99 for bi-encoder + reranker + LLM step < 5 s, enforced in CI.

First, we measure P99 latency for the combined bi-encoder and cross-encoder stages. Our target is sub-5000 ms end-to-end, and in practice we hover around 2000-3000 ms.

Our evaluation suite is also built into our CI tests, to ensure any change in the codebase doesn't negatively impact the performance of the engine.

Finally, we test for "output drift" by running a daily one-sided Kolmogorov–Smirnov test on per-query precision@10 distributions. This will help detect subtle shifts in behaviour— say, a gradual drop in recall/precision for a certain type of geospatial tasks.

6. Future Improvements

GraphRAG to improve encoders

While traditional RAG pipelines retrieve documents (or datasets) in isolation, GraphRAG augments retrieval with an explicit knowledge graph that encodes relationships between entities—in our case, between geospatial tasks and datasets.

  • Modelling complementary datasets — Many analyses require multiple layers of data: for example, mapping urban air quality often combines various data sources, like aerosol height, cloud fraction, and surface albedo. In a GraphRAG setting, nodes can be created between datasets to map the relationships and to tell the encoders that these datasets could be used in tandem for analysis.

  • Capturing Hierarchical & Thematic Links — Administrative‐boundary datasets come in multiple levels (ADM0 — country, ADM1 — state, ADM2 — city), denoting the various levels of boundaries. A task querying "boundaries of Jakarta districts” should traverse the subdivision hierarchy to recommend both district boundaries and parent city boundaries if needed. Theme‐based edges (e.g. pollution → aerosol, pollution → CO) let the system surface related data even when the query keyword doesn’t match exactly.

This can definitely help to boost the performance of the engine, as many dataset interdependencies are not fully captured by pure semantic similarity alone. However, capturing and modelling relationships between datasets (and choosing which relationships to model) will surely be a challenging and complex task!

An article I found cool regarding this topic

Reinforcement Feedback Loop

By capturing explicit end-user signals—such as clicks, ratings or thumbs-up/down—and treating them as reward feedback in a reinforcement-learning loop, we can iteratively update our retrievers so they learn from real interactions and continuously improve both precision and recall.

Using Apache Airflow to handle dataset metadata

Currently, we have scrapers running every few days to capture dataset metadata from various data providers and bundle them up together. In the future, Apache Airflow can be used to simplify the process, and make it production-ready.

All in all, building this search engine has been an incredibly insightful experience for me, and I have deepened my understanding of retrieval architectures, fine-tuning and scalable AI workflows.

I hope you've found this writeup useful— if you're facing a similar problem or wanna know more, shoot me a message!