INTRODUCTION
During a recent engagement with a large-scale B2B industrial supply platform, we encountered a critical limitation in modern vector search. The client had millions of SKUs ranging from heavy machinery to precision tooling. We were tasked with modernizing their search infrastructure to move beyond rigid keyword matching and introduce semantic understanding.
The initial deployment of the vector search engine worked beautifully for broad queries. If a user searched for “backup power,” the system correctly retrieved diesel generators. However, the B2B reality is rarely that vague. Professional buyers search for specifications. When users searched for “12 kva diesel generator,” the system often returned “15 kva” or “10 kva” units mixed in with the correct results. To the embedding model, these products were semantically identical—they were all generators. To the buyer, the wrong voltage or power rating made the product useless.
We realized that standard embedding approaches were “washing out” the structured attribute data (key-value pairs) that defined the product’s utility. The model was prioritizing the general concept over the specific technical constraints. This challenge inspired us to rigorously test how structured attributes should be preprocessed and embedded to maximize precision in a reranking architecture.
PROBLEM CONTEXT
The system in question was a high-volume B2B ecommerce platform. The search architecture followed a standard two-stage pattern: a retrieval stage (candidates) and a reranking stage (precision).
The Data Structure:
Products consisted of a generic title (e.g., “Industrial Diesel Generator”) and a rich set of structured attributes stored as key-value pairs. For example:
power_rating: 12 kvafuel_type: dieselphase: 3cooling_type: air cooled
The Query Pattern:
Analysis of search logs revealed that 70% of high-intent queries were short (4–5 tokens) and specification-heavy, often containing alphanumeric codes, specific units, and integers (e.g., “cnc milling machine 3 axis”).
The core issue surfaced during the reranking phase. We were using product attribute embeddings to reorder the candidate list. However, because we hadn’t optimized how these key-value pairs were fed into the model, the vector space didn’t adequately separate a “12 kva” generator from a “20 kva” generator. The semantic distance was too small.
WHAT WENT WRONG
Our initial investigation involved analyzing the vector space visualization. We noticed distinct clustering issues caused by three primary factors:
1. The “Bag of Words” Effect in Flat Concatenation
Initially, we simply concatenated values: 12 kva diesel 3 air cooled. Without the context of keys (like “power rating”), the model treated “12” as an arbitrary number. In some cases, it associated “3” with the number of units in a pack rather than the electrical phase.
2. Tokenization of Alphanumerics
Standard tokenizers often split technical terms aggressively. A search for a “12kva” generator might result in tokens like [12, k, ##va], while the attribute might be stored as 12 kva. This mismatch reduced the similarity score during reranking.
3. Unit Ambiguity
The model struggled to equate “12000 watts” with “12 kva” purely through embeddings without explicit normalization. The semantic model treated them as different entities unless the training data had seen that specific equivalence frequently.
We realized that to fix this, we didn’t just need a better model; we needed a better data representation strategy before the data ever touched the neural network.
HOW WE APPROACHED THE SOLUTION
To solve this, our engineering team designed an experiment to compare different attribute serialization strategies. We needed to transform the structured JSON data into a format that the embedding model could digest while preserving the strict relationship between keys and values.
We evaluated four primary methods:
- Flat Concatenation: Dumping values into a string.
- JSON-style Text: Keeping syntax like
key: value. - Natural Language Generation (NLG): Using an LLM to write a sentence (“This generator has 12 kva power…”).
- Semantic Templating: A hybrid approach using verbose keys.
The Findings:
The NLG approach produced high-quality embeddings but was too slow and expensive for indexing millions of products. Flat concatenation lost too much context. The “JSON-style” approach was inconsistent because many embedding models are trained on natural text, not JSON syntax.
We determined that Semantic Templating was the optimal middle ground. This involves expanding abbreviated database keys into natural language phrases and appending the value and unit. This “Pseudo-Natural Language” structure provided the strongest signal for reranking.
Companies looking to hire ai developers for search optimization often overlook this step, assuming the model will magically understand database schema. In reality, explicit serialization is key.
FINAL IMPLEMENTATION
We implemented a pipeline that pre-processes attributes into a “dense semantic string” before embedding. We moved away from special separators (like pipes | or newlines) because standard BERT-based models sometimes treat these as noise or sentence breaks, which can dilute the context window.
Here is the sanitized logic we deployed for the attribute processor:
def serialize_attributes(attributes):
# Map technical DB keys to verbose natural language prefixes
# This guides the attention mechanism of the model
key_map = {
"power_rating": "power rating is",
"fuel_type": "runs on",
"phase": "electrical phase",
"cooling_type": "cooling system is",
"application": "designed for"
}
parts = []
for key, value in attributes.items():
if key in key_map:
# Normalize units here (e.g., lowercase, spacing)
clean_value = str(value).strip().lower()
prefix = key_map[key]
# Result: "power rating is 12 kva"
parts.append(f"{prefix} {clean_value}")
# Join with a comma and space to simulate a natural list
return ", ".join(parts)
Architecture Updates:
- Unit Normalization: Before serialization, we normalized all units (e.g., converting all power ratings to a standard base unit or ensuring consistent spacing like “12 kva” instead of “12kva”).
- Cross-Encoder for Reranking: While we used bi-encoders for the initial vector retrieval, we implemented a lightweight cross-encoder for the top 50 results. The cross-encoder received the query and the generated semantic string. This significantly boosted precision for numeric queries.
- Model Choice: We utilized a domain-tuned checkpoint of a sentence-transformer optimized for semantic similarity, which performed better on short technical queries than generic large language models.
This approach allowed us to maintain the speed of vector search while injecting the precision of structured data. If you are looking to hire Python developers for data engineering to build similar preprocessing pipelines, ensure they understand the nuances of tokenization in vector spaces.
LESSONS FOR ENGINEERING TEAMS
For teams building search for B2B or technical domains, here are the key takeaways from our experience:
- Verbose Keys Improve Context: Don’t embed
pwr: 12. Embedpower rating is 12. The extra tokens cost almost nothing in storage but provide massive value to the model’s attention mechanism. - Avoid JSON Syntax in Embeddings: Brackets, colons, and quotes often add noise. Models trained on Wikipedia or Common Crawl understand natural language sentences better than JSON dumps.
- Standardize Units Pre-Embedding: Embeddings cannot easily perform math. “12000 watts” and “12 kw” are semantically distant unless the model is specifically trained on unit conversion. Normalize your data first.
- Hybrid Search is Mandatory: Even with perfect embeddings, vector search struggles with exact number matching (e.g., Model 500 vs Model 5000). Always combine vector scores with a keyword-based matching score (BM25) for the final rank.
- Structured Reranking: When you hire software developers to build rerankers, ensure they treat attributes as first-class citizens, not just metadata appended to the description.
WRAP UP
Generating effective embeddings for structured attributes requires moving beyond simple data dumping. By transforming key-value pairs into semantically meaningful phrases and ensuring rigorous unit normalization, we transformed a “fuzzy” search experience into a precise tool for B2B buyers. The key is to treat your data preprocessing pipeline as part of the model architecture itself.
For organizations needing specialized engineering teams to architect scalable search and AI solutions, contact us to discuss your requirements.
Social Hashtags
#B2BSearch #SemanticSearch #SearchEmbeddings #AIForSearch #VectorSearch #EnterpriseSearch #NLPEngineering #MachineLearning #SearchOptimization #DataEngineering
Frequently Asked Questions
For reranking and standard dense retrieval, a single combined text block is far more efficient and usually sufficiently accurate. Generating one vector per attribute explodes storage costs and complicates query logic (e.g., which attribute vector matches the query vector?).
Yes, but format them as natural language descriptors rather than database keys. Using "Fuel Type: Diesel" or "runs on Diesel" is much better than just "Diesel," which could be a brand name, a fuel, or a product line.
Generally, no. Natural language separators like commas or periods work best with pre-trained models (like BERT or RoBERTa). Artificial tokens like [SEP] or pipes | can sometimes confuse models that weren't pre-trained with those specific delimiters.
Embeddings are notoriously bad at exact arithmetic. The best practice is a dual approach: use embeddings for semantic matching and use hard filters or BM25 (keyword search) for exact numeric constraints. Do not rely on vectors alone to distinguish between "12V" and "24V".
Success Stories That Inspire
See how our team takes complex business challenges and turns them into powerful, scalable digital solutions. From custom software and web applications to automation, integrations, and cloud-ready systems, each project reflects our commitment to innovation, performance, and long-term value.

California-based SMB Hired Dedicated Developers to Build a Photography SaaS Platform

Swedish Agency Built a Laravel-Based Staffing System by Hiring a Dedicated Remote Team

















