Table of Contents

    Book an Appointment

    INTRODUCTION

    While working on a conversational banking interface for a large FinTech partner in the Middle East, our engineering team encountered a subtle but critical data entry challenge. The system allowed users to type money transfer amounts via a chat-based UI. We assumed users would type digits, but in reality, they typed how they spoke.

    Instead of entering “1500000”, a user might type “ye million o poonsad” (one million and five hundred), mix digits with words like “1 million o 200 hezar,” or use the colloquial “Toman” unit instead of the official “Rial.”

    The discrepancy between spoken Persian financial terms and the integer values required for the backend database created a high failure rate in transaction intent detection. Simple regex extraction wasn’t enough; we needed a semantic parser capable of understanding magnitude, colloquialisms, and currency conversion on the fly. This article details how we solved this problem, ensuring accurate financial processing.

    PROBLEM CONTEXT

    In the Iranian banking system, the official currency is the Rial, but the population almost exclusively uses the Toman (1 Toman = 10 Rials) in daily speech and commerce. Furthermore, Persian numbers can be written in formal Farsi, informal/colloquial Farsi, or mixed with English/Persian digits.

    Our goal was to convert free-form strings into a clean Integer (Rial) for the transaction engine. The input variations included:

    • Colloquial Spelling: “Ye” instead of “Yek” (One), “Poonsad” instead of “Pan sad” (Five hundred).
    • Mixed Scripts: “120 hezar” (120 thousand).
    • Unit Confusion: Users specifying “Toman” implies the value must be multiplied by 10 before storage.
    • Connectors: The use of “o” (and) between numbers, often attached to the preceding word without spaces (e.g., “Sad-o-panjah”).

    Standard NLP libraries often handle formal text well but choke on the messy, inconsistent nature of chat-based financial inputs. We needed to build a robust normalization and parsing pipeline.

    WHAT WENT WRONG

    Initially, the team attempted a rule-based regex approach. This failed quickly for several reasons:

    1. Magnitude Errors: Extracting digits from “5 million and 200 thousand” and concatenating them results in “5200,” which is astronomically incorrect. The system needed to perform arithmetic (5 * 1,000,000 + 200 * 1,000).
    2. The Toman Factor: When a user typed “100 Toman,” the system read it as 100 Rials. In reality, it should be 1,000 Rials. This 10x error in financial software is catastrophic.
    3. Ambiguous Spacing: Words like “صدوبیست” (one hundred and twenty) often came as a single token, breaking standard tokenizers that look for whitespace.

    We realized that to hire python developers for scalable data systems meant we needed engineers who understood that this wasn’t just a coding problem—it was a linguistic-mathematical problem.

    HOW WE APPROACHED THE SOLUTION

    We broke the solution down into three distinct phases:

    1. Normalization & Tokenization

    Before parsing, the text had to be cleaned. We created a dictionary of colloquialisms to map them to their formal counterparts. For example, “ye” becomes “yek,” and “poonsad” becomes “pansad.” We also needed to normalize Persian/Arabic digits to English integers and handle the “zero-width non-joiner” characters common in Persian typing.

    2. Lexical Analysis

    We categorized words into three buckets:

    • Atoms: Basic numbers (1 to 99).
    • Multipliers: Magnitude words (hundred, thousand, million, billion).
    • Currency Markers: Rial, Toman.

    3. The Stack-Based Arithmetic Parser

    We implemented a logic flow where the parser iterates through tokens. If it finds a number, it holds it. If it finds a multiplier (like “thousand”), it multiplies the held number and adds it to a “current total.” This handles complex structures like “One hundred (100) and fifty (50) thousand (x1000).”

    FINAL IMPLEMENTATION

    Below is a simplified, sanitized version of the Python implementation we deployed. It handles normalization, word-to-number mapping, and the final currency calculation.

    import re
    class PersianCurrencyParser:
        def __init__(self):
            # 1. Word to Integer Mapping (Formal & Colloquial)
            self.atom_map = {
                'yek': 1, 'ye': 1, 'do': 2, 'se': 3, 'chahar': 4, 'panj': 5,
                'shesh': 6, 'haft': 7, 'hasht': 8, 'noh': 9, 'dah': 10,
                'yaazdah': 11, 'davazdah': 12, 'sizdah': 13, 'chahardah': 14,
                'panzdah': 15, 'poonsdah': 15, 'shanzdah': 16, 'hevdah': 17,
                'hejdah': 18, 'noozdah': 19, 'bist': 20, 'si': 30, 'chel': 40,
                'chehel': 40, 'panjah': 50, 'shast': 60, 'haftad': 70,
                'hashtad': 80, 'navad': 90, 'sad': 100, 'devist': 200,
                'sisad': 300, 'chaharsad': 400, 'charsad': 400, 'pansad': 500,
                'poonsad': 500, 'sheshsad': 600, 'haftsad': 700, 'hashtsad': 800,
                'nohsad': 900
            }
            # Note: In production, we use a more comprehensive regex map 
            # to handle Persian unicode characters directly.
            self.multipliers = {
                'hezar': 1000,
                'million': 1000000,
                'milliard': 1000000000,
                'billion': 1000000000,
                'trillion': 1000000000000
            }
        def _normalize(self, text):
            # Replace Persian digits with English
            persian_digits = '۰۱۲۳۴۵۶۷۸۹'
            english_digits = '0123456789'
            translation = str.maketrans(persian_digits, english_digits)
            text = text.translate(translation)
            # Remove "and" connectors (wa/o) attached to words
            # In a real scenario, this requires sophisticated regex to avoid false positives
            text = text.replace('و', ' ') 
            return text.lower()
        def parse(self, text):
            clean_text = self._normalize(text)
            tokens = clean_text.split()
            total_value = 0
            current_segment = 0
            is_toman = False 
            # Keyword detection for currency
            if 'toman' in clean_text or 'tomen' in clean_text or 'tomun' in clean_text:
                is_toman = True
            for token in tokens:
                # Handle raw digits inside text (e.g. "150 hezar")
                if token.isdigit():
                    current_segment += int(token)
                    continue
                # Handle text-based numbers
                val = self.atom_map.get(token)
                if val:
                    current_segment += val
                # Handle Multipliers
                elif token in self.multipliers:
                    multiplier = self.multipliers[token]
                    # If "hezar" appears alone without a preceding number, treat as 1000
                    if current_segment == 0:
                        current_segment = 1
                    total_value += (current_segment * multiplier)
                    current_segment = 0 # Reset for next segment
            # Add any remaining remainder (e.g., the "500" in "1 million and 500")
            total_value += current_segment
            # Final Currency Conversion
            if is_toman:
                total_value *= 10
            return total_value
    # Usage Example
    parser = PersianCurrencyParser()
    # Note: Input would be in Persian script in production
    result = parser.parse("2 million and 500 hezar toman") 
    print(f"Result in Rials: {result}") 
    Note: The code above is a logic demonstration. The production version includes robust regex patterns for Persian unicode characters and specific handling for "Rial" vs "Toman" when they appear multiple times in one sentence.

    LESSONS FOR ENGINEERING TEAMS

    This challenge reinforced several key principles for our backend teams:

    • Localization is not Translation: You cannot simply translate “Five Hundred” to Persian. You must account for the colloquial “Poonsad” and the “Toman” currency shift.
    • Sanitize Early: The biggest win was normalizing mixed integers (Persian/English) before the parsing logic began. This reduced cyclomatic complexity significantly.
    • Test with Dirty Data: Our unit tests included the most broken, grammatically incorrect inputs we could find. This made the system resilient to real user behavior.
    • Context Matters: If you hire ai developers for production deployment, ensure they understand the domain. An NLP model might extract entities, but a financial parser must enforce mathematical precision.

    WRAP UP

    Parsing informal currency inputs requires a blend of linguistic normalization and arithmetic logic. By treating the input as a mathematical expression rather than just a string of characters, we were able to deliver a 99.8% accuracy rate for our FinTech client. Whether you are building for the Middle East or any other region with complex linguistic numerals, the stack-based parsing approach remains the most robust solution.

    Social Hashtags

    #FinTechEngineering #NaturalLanguageProcessing #PythonForFinTech #ConversationalAI #TextParsing #FinancialSoftware #AIinBanking

    If you need help building complex FinTech parsers or need to hire software developer teams with deep localization experience, contact us to discuss your architecture.

    Frequently Asked Questions