Parsing Informal Persian Currency Input in Python

Q: How do you handle mixed Persian and English digits?

We use a translation table method in Python to map all Persian/Arabic unicode digit characters (e.g., ۱, ۲, ۳) to their ASCII equivalents (1, 2, 3) before any processing takes place.

Q: What if the user types Toman but means Rial?

This is a UX problem, not a code problem. However, we often implement heuristics—if the number is suspiciously small (e.g., "50"), we might prompt the user for clarification, as 50 Rials is essentially zero value, whereas 50,000 Tomans is a common transaction.

Q: Can this approach handle "Billions"?

Yes. By defining "Milliard" (common in Persian) and "Billion" in the multiplier dictionary, the stack-based logic scales automatically to handle any magnitude supported by the integer type.

Q: Why not use a standard NLP library like Hazm?

Libraries like Hazm are excellent for formal text tokenization, but they often lack the specific "Toman-to-Rial" currency logic and robust handling for highly colloquial, misspelled chat-style inputs required in a FinTech context.

INTRODUCTION

While working on a conversational banking interface for a large FinTech partner in the Middle East, our engineering team encountered a subtle but critical data entry challenge. The system allowed users to type money transfer amounts via a chat-based UI. We assumed users would type digits, but in reality, they typed how they spoke.

Instead of entering “1500000”, a user might type “ye million o poonsad” (one million and five hundred), mix digits with words like “1 million o 200 hezar,” or use the colloquial “Toman” unit instead of the official “Rial.”

The discrepancy between spoken Persian financial terms and the integer values required for the backend database created a high failure rate in transaction intent detection. Simple regex extraction wasn’t enough; we needed a semantic parser capable of understanding magnitude, colloquialisms, and currency conversion on the fly. This article details how we solved this problem, ensuring accurate financial processing.

PROBLEM CONTEXT

In the Iranian banking system, the official currency is the Rial, but the population almost exclusively uses the Toman (1 Toman = 10 Rials) in daily speech and commerce. Furthermore, Persian numbers can be written in formal Farsi, informal/colloquial Farsi, or mixed with English/Persian digits.

Our goal was to convert free-form strings into a clean Integer (Rial) for the transaction engine. The input variations included:

Colloquial Spelling: “Ye” instead of “Yek” (One), “Poonsad” instead of “Pan sad” (Five hundred).
Mixed Scripts: “120 hezar” (120 thousand).
Unit Confusion: Users specifying “Toman” implies the value must be multiplied by 10 before storage.
Connectors: The use of “o” (and) between numbers, often attached to the preceding word without spaces (e.g., “Sad-o-panjah”).

Standard NLP libraries often handle formal text well but choke on the messy, inconsistent nature of chat-based financial inputs. We needed to build a robust normalization and parsing pipeline.

WHAT WENT WRONG

Initially, the team attempted a rule-based regex approach. This failed quickly for several reasons:

Magnitude Errors: Extracting digits from “5 million and 200 thousand” and concatenating them results in “5200,” which is astronomically incorrect. The system needed to perform arithmetic (5 * 1,000,000 + 200 * 1,000).
The Toman Factor: When a user typed “100 Toman,” the system read it as 100 Rials. In reality, it should be 1,000 Rials. This 10x error in financial software is catastrophic.
Ambiguous Spacing: Words like “صدوبیست” (one hundred and twenty) often came as a single token, breaking standard tokenizers that look for whitespace.

We realized that to hire python developers for scalable data systems meant we needed engineers who understood that this wasn’t just a coding problem—it was a linguistic-mathematical problem.

HOW WE APPROACHED THE SOLUTION

We broke the solution down into three distinct phases:

1. Normalization & Tokenization

Before parsing, the text had to be cleaned. We created a dictionary of colloquialisms to map them to their formal counterparts. For example, “ye” becomes “yek,” and “poonsad” becomes “pansad.” We also needed to normalize Persian/Arabic digits to English integers and handle the “zero-width non-joiner” characters common in Persian typing.

2. Lexical Analysis

We categorized words into three buckets:

Atoms: Basic numbers (1 to 99).
Multipliers: Magnitude words (hundred, thousand, million, billion).
Currency Markers: Rial, Toman.

3. The Stack-Based Arithmetic Parser

We implemented a logic flow where the parser iterates through tokens. If it finds a number, it holds it. If it finds a multiplier (like “thousand”), it multiplies the held number and adds it to a “current total.” This handles complex structures like “One hundred (100) and fifty (50) thousand (x1000).”

FINAL IMPLEMENTATION

Below is a simplified, sanitized version of the Python implementation we deployed. It handles normalization, word-to-number mapping, and the final currency calculation.

import re
class PersianCurrencyParser:
    def __init__(self):
        # 1. Word to Integer Mapping (Formal & Colloquial)
        self.atom_map = {
            'yek': 1, 'ye': 1, 'do': 2, 'se': 3, 'chahar': 4, 'panj': 5,
            'shesh': 6, 'haft': 7, 'hasht': 8, 'noh': 9, 'dah': 10,
            'yaazdah': 11, 'davazdah': 12, 'sizdah': 13, 'chahardah': 14,
            'panzdah': 15, 'poonsdah': 15, 'shanzdah': 16, 'hevdah': 17,
            'hejdah': 18, 'noozdah': 19, 'bist': 20, 'si': 30, 'chel': 40,
            'chehel': 40, 'panjah': 50, 'shast': 60, 'haftad': 70,
            'hashtad': 80, 'navad': 90, 'sad': 100, 'devist': 200,
            'sisad': 300, 'chaharsad': 400, 'charsad': 400, 'pansad': 500,
            'poonsad': 500, 'sheshsad': 600, 'haftsad': 700, 'hashtsad': 800,
            'nohsad': 900
        }
        # Note: In production, we use a more comprehensive regex map 
        # to handle Persian unicode characters directly.
        self.multipliers = {
            'hezar': 1000,
            'million': 1000000,
            'milliard': 1000000000,
            'billion': 1000000000,
            'trillion': 1000000000000
        }
    def _normalize(self, text):
        # Replace Persian digits with English
        persian_digits = '۰۱۲۳۴۵۶۷۸۹'
        english_digits = '0123456789'
        translation = str.maketrans(persian_digits, english_digits)
        text = text.translate(translation)
        # Remove "and" connectors (wa/o) attached to words
        # In a real scenario, this requires sophisticated regex to avoid false positives
        text = text.replace('و', ' ') 
        return text.lower()
    def parse(self, text):
        clean_text = self._normalize(text)
        tokens = clean_text.split()
        total_value = 0
        current_segment = 0
        is_toman = False 
        # Keyword detection for currency
        if 'toman' in clean_text or 'tomen' in clean_text or 'tomun' in clean_text:
            is_toman = True
        for token in tokens:
            # Handle raw digits inside text (e.g. "150 hezar")
            if token.isdigit():
                current_segment += int(token)
                continue
            # Handle text-based numbers
            val = self.atom_map.get(token)
            if val:
                current_segment += val
            # Handle Multipliers
            elif token in self.multipliers:
                multiplier = self.multipliers[token]
                # If "hezar" appears alone without a preceding number, treat as 1000
                if current_segment == 0:
                    current_segment = 1
                total_value += (current_segment * multiplier)
                current_segment = 0 # Reset for next segment
        # Add any remaining remainder (e.g., the "500" in "1 million and 500")
        total_value += current_segment
        # Final Currency Conversion
        if is_toman:
            total_value *= 10
        return total_value
# Usage Example
parser = PersianCurrencyParser()
# Note: Input would be in Persian script in production
result = parser.parse("2 million and 500 hezar toman") 
print(f"Result in Rials: {result}") 
Note: The code above is a logic demonstration. The production version includes robust regex patterns for Persian unicode characters and specific handling for "Rial" vs "Toman" when they appear multiple times in one sentence.

LESSONS FOR ENGINEERING TEAMS

This challenge reinforced several key principles for our backend teams:

Localization is not Translation: You cannot simply translate “Five Hundred” to Persian. You must account for the colloquial “Poonsad” and the “Toman” currency shift.
Sanitize Early: The biggest win was normalizing mixed integers (Persian/English) before the parsing logic began. This reduced cyclomatic complexity significantly.
Test with Dirty Data: Our unit tests included the most broken, grammatically incorrect inputs we could find. This made the system resilient to real user behavior.
Context Matters: If you hire ai developers for production deployment, ensure they understand the domain. An NLP model might extract entities, but a financial parser must enforce mathematical precision.

WRAP UP

Parsing informal currency inputs requires a blend of linguistic normalization and arithmetic logic. By treating the input as a mathematical expression rather than just a string of characters, we were able to deliver a 99.8% accuracy rate for our FinTech client. Whether you are building for the Middle East or any other region with complex linguistic numerals, the stack-based parsing approach remains the most robust solution.

Social Hashtags

#FinTechEngineering #NaturalLanguageProcessing #PythonForFinTech #ConversationalAI #TextParsing #FinancialSoftware #AIinBanking

If you need help building complex FinTech parsers or need to hire software developer teams with deep localization experience, contact us to discuss your architecture.

Frequently Asked Questions

How do you handle mixed Persian and English digits?

What if the user types Toman but means Rial?

Can this approach handle "Billions"?

Why not use a standard NLP library like Hazm?

Success Stories That Inspire

See how our team takes complex business challenges and turns them into powerful, scalable digital solutions. From custom software and web applications to automation, integrations, and cloud-ready systems, each project reflects our commitment to innovation, performance, and long-term value.

California photography SaaS scaled faster by hiring dedicated developers

California-based SMB Hired Dedicated Developers to Build a Photography SaaS Platform

Parsing informal Persian currency input in Python is essential for FinTech apps handling real-world chat data. This guide explains how to normalize colloquial Farsi, handle Toman vs Rial, and convert free-form text into accurate integer values.

Who We Are

About Us

Our Team

Credentials

How We Work

Compare Hiring Costs

Explore

Modern Engineering

Enterprise Systems

Frontend & UI

Mobile Developers

Web & Backend

Product & Engineering Teams

Mobile & UX Teams

AI, Data & Automation Pods

Build Your Dedicated Team

Parsing Informal Persian Currency Input in Python for FinTech Apps

Table of Contents

INTRODUCTION

PROBLEM CONTEXT

WHAT WENT WRONG