INTRODUCTION
While working on a conversational banking interface for a large FinTech partner in the Middle East, our engineering team encountered a subtle but critical data entry challenge. The system allowed users to type money transfer amounts via a chat-based UI. We assumed users would type digits, but in reality, they typed how they spoke.
Instead of entering “1500000”, a user might type “ye million o poonsad” (one million and five hundred), mix digits with words like “1 million o 200 hezar,” or use the colloquial “Toman” unit instead of the official “Rial.”
The discrepancy between spoken Persian financial terms and the integer values required for the backend database created a high failure rate in transaction intent detection. Simple regex extraction wasn’t enough; we needed a semantic parser capable of understanding magnitude, colloquialisms, and currency conversion on the fly. This article details how we solved this problem, ensuring accurate financial processing.
PROBLEM CONTEXT
In the Iranian banking system, the official currency is the Rial, but the population almost exclusively uses the Toman (1 Toman = 10 Rials) in daily speech and commerce. Furthermore, Persian numbers can be written in formal Farsi, informal/colloquial Farsi, or mixed with English/Persian digits.
Our goal was to convert free-form strings into a clean Integer (Rial) for the transaction engine. The input variations included:
- Colloquial Spelling: “Ye” instead of “Yek” (One), “Poonsad” instead of “Pan sad” (Five hundred).
- Mixed Scripts: “120 hezar” (120 thousand).
- Unit Confusion: Users specifying “Toman” implies the value must be multiplied by 10 before storage.
- Connectors: The use of “o” (and) between numbers, often attached to the preceding word without spaces (e.g., “Sad-o-panjah”).
Standard NLP libraries often handle formal text well but choke on the messy, inconsistent nature of chat-based financial inputs. We needed to build a robust normalization and parsing pipeline.
WHAT WENT WRONG
Initially, the team attempted a rule-based regex approach. This failed quickly for several reasons:
- Magnitude Errors: Extracting digits from “5 million and 200 thousand” and concatenating them results in “5200,” which is astronomically incorrect. The system needed to perform arithmetic (5 * 1,000,000 + 200 * 1,000).
- The Toman Factor: When a user typed “100 Toman,” the system read it as 100 Rials. In reality, it should be 1,000 Rials. This 10x error in financial software is catastrophic.
- Ambiguous Spacing: Words like “صدوبیست” (one hundred and twenty) often came as a single token, breaking standard tokenizers that look for whitespace.
We realized that to hire python developers for scalable data systems meant we needed engineers who understood that this wasn’t just a coding problem—it was a linguistic-mathematical problem.
HOW WE APPROACHED THE SOLUTION
We broke the solution down into three distinct phases:
1. Normalization & Tokenization
Before parsing, the text had to be cleaned. We created a dictionary of colloquialisms to map them to their formal counterparts. For example, “ye” becomes “yek,” and “poonsad” becomes “pansad.” We also needed to normalize Persian/Arabic digits to English integers and handle the “zero-width non-joiner” characters common in Persian typing.
2. Lexical Analysis
We categorized words into three buckets:
- Atoms: Basic numbers (1 to 99).
- Multipliers: Magnitude words (hundred, thousand, million, billion).
- Currency Markers: Rial, Toman.
3. The Stack-Based Arithmetic Parser
We implemented a logic flow where the parser iterates through tokens. If it finds a number, it holds it. If it finds a multiplier (like “thousand”), it multiplies the held number and adds it to a “current total.” This handles complex structures like “One hundred (100) and fifty (50) thousand (x1000).”
FINAL IMPLEMENTATION
Below is a simplified, sanitized version of the Python implementation we deployed. It handles normalization, word-to-number mapping, and the final currency calculation.
import re
class PersianCurrencyParser:
def __init__(self):
# 1. Word to Integer Mapping (Formal & Colloquial)
self.atom_map = {
'yek': 1, 'ye': 1, 'do': 2, 'se': 3, 'chahar': 4, 'panj': 5,
'shesh': 6, 'haft': 7, 'hasht': 8, 'noh': 9, 'dah': 10,
'yaazdah': 11, 'davazdah': 12, 'sizdah': 13, 'chahardah': 14,
'panzdah': 15, 'poonsdah': 15, 'shanzdah': 16, 'hevdah': 17,
'hejdah': 18, 'noozdah': 19, 'bist': 20, 'si': 30, 'chel': 40,
'chehel': 40, 'panjah': 50, 'shast': 60, 'haftad': 70,
'hashtad': 80, 'navad': 90, 'sad': 100, 'devist': 200,
'sisad': 300, 'chaharsad': 400, 'charsad': 400, 'pansad': 500,
'poonsad': 500, 'sheshsad': 600, 'haftsad': 700, 'hashtsad': 800,
'nohsad': 900
}
# Note: In production, we use a more comprehensive regex map
# to handle Persian unicode characters directly.
self.multipliers = {
'hezar': 1000,
'million': 1000000,
'milliard': 1000000000,
'billion': 1000000000,
'trillion': 1000000000000
}
def _normalize(self, text):
# Replace Persian digits with English
persian_digits = '۰۱۲۳۴۵۶۷۸۹'
english_digits = '0123456789'
translation = str.maketrans(persian_digits, english_digits)
text = text.translate(translation)
# Remove "and" connectors (wa/o) attached to words
# In a real scenario, this requires sophisticated regex to avoid false positives
text = text.replace('و', ' ')
return text.lower()
def parse(self, text):
clean_text = self._normalize(text)
tokens = clean_text.split()
total_value = 0
current_segment = 0
is_toman = False
# Keyword detection for currency
if 'toman' in clean_text or 'tomen' in clean_text or 'tomun' in clean_text:
is_toman = True
for token in tokens:
# Handle raw digits inside text (e.g. "150 hezar")
if token.isdigit():
current_segment += int(token)
continue
# Handle text-based numbers
val = self.atom_map.get(token)
if val:
current_segment += val
# Handle Multipliers
elif token in self.multipliers:
multiplier = self.multipliers[token]
# If "hezar" appears alone without a preceding number, treat as 1000
if current_segment == 0:
current_segment = 1
total_value += (current_segment * multiplier)
current_segment = 0 # Reset for next segment
# Add any remaining remainder (e.g., the "500" in "1 million and 500")
total_value += current_segment
# Final Currency Conversion
if is_toman:
total_value *= 10
return total_value
# Usage Example
parser = PersianCurrencyParser()
# Note: Input would be in Persian script in production
result = parser.parse("2 million and 500 hezar toman")
print(f"Result in Rials: {result}")
Note: The code above is a logic demonstration. The production version includes robust regex patterns for Persian unicode characters and specific handling for "Rial" vs "Toman" when they appear multiple times in one sentence.LESSONS FOR ENGINEERING TEAMS
This challenge reinforced several key principles for our backend teams:
- Localization is not Translation: You cannot simply translate “Five Hundred” to Persian. You must account for the colloquial “Poonsad” and the “Toman” currency shift.
- Sanitize Early: The biggest win was normalizing mixed integers (Persian/English) before the parsing logic began. This reduced cyclomatic complexity significantly.
- Test with Dirty Data: Our unit tests included the most broken, grammatically incorrect inputs we could find. This made the system resilient to real user behavior.
- Context Matters: If you hire ai developers for production deployment, ensure they understand the domain. An NLP model might extract entities, but a financial parser must enforce mathematical precision.
WRAP UP
Parsing informal currency inputs requires a blend of linguistic normalization and arithmetic logic. By treating the input as a mathematical expression rather than just a string of characters, we were able to deliver a 99.8% accuracy rate for our FinTech client. Whether you are building for the Middle East or any other region with complex linguistic numerals, the stack-based parsing approach remains the most robust solution.
Social Hashtags
#FinTechEngineering #NaturalLanguageProcessing #PythonForFinTech #ConversationalAI #TextParsing #FinancialSoftware #AIinBanking
If you need help building complex FinTech parsers or need to hire software developer teams with deep localization experience, contact us to discuss your architecture.
Frequently Asked Questions
We use a translation table method in Python to map all Persian/Arabic unicode digit characters (e.g., ۱, ۲, ۳) to their ASCII equivalents (1, 2, 3) before any processing takes place.
This is a UX problem, not a code problem. However, we often implement heuristics—if the number is suspiciously small (e.g., "50"), we might prompt the user for clarification, as 50 Rials is essentially zero value, whereas 50,000 Tomans is a common transaction.
Yes. By defining "Milliard" (common in Persian) and "Billion" in the multiplier dictionary, the stack-based logic scales automatically to handle any magnitude supported by the integer type.
Libraries like Hazm are excellent for formal text tokenization, but they often lack the specific "Toman-to-Rial" currency logic and robust handling for highly colloquial, misspelled chat-style inputs required in a FinTech context.
Success Stories That Inspire
See how our team takes complex business challenges and turns them into powerful, scalable digital solutions. From custom software and web applications to automation, integrations, and cloud-ready systems, each project reflects our commitment to innovation, performance, and long-term value.

California-based SMB Hired Dedicated Developers to Build a Photography SaaS Platform

Swedish Agency Built a Laravel-Based Staffing System by Hiring a Dedicated Remote Team

















