Automated Financial Statement Extraction from PDFs Using LLMs

Posted on November 18, 2025

Automated Financial Statement Extraction from PDFs Using LLMs

Introduction

Corporate financial statements are crucial documents that show a company's financial health. However, when analyzing financial statements from numerous companies, manually opening each PDF document and finding the necessary numbers is tedious and time-consuming.

A bigger problem is that financial statement formats vary slightly between companies. Some companies label it as "Net Income," while others use "Net Loss." Even items with the same meaning can be expressed differently, like "Total Assets" versus "Assets."

To solve these problems, we built a system that automatically extracts and normalizes PDF financial statements using recently prominent Large Language Models (LLMs). This article introduces that process.

Why LLM?

Traditional PDF data extraction required the following work:

Converting PDF to text
Parsing data based on regular expressions or rules
Recognizing table formats and extracting data
Writing numerous conditional statements for exception handling

However, financial statements have different formats across companies, and sometimes include scanned image PDFs, making this approach limited.

LLMs flexibly solve these problems:

Contextual Understanding: LLMs already know what "Balance Sheet" is and how "Assets" and "Liabilities" are structured
Flexibility: Even with slightly different table formats or item names, LLMs can extract by understanding the meaning
Image Processing: Modern LLMs can read images directly, enabling processing of scanned PDFs
Structured Output: Using the Structured Output feature, you can receive data directly in the desired format

Comparison with Traditional OCR Approaches

Many might wonder, "Since OCR can extract text from PDFs, why use LLMs?" There are significant differences between the two approaches.

OCR Workflow

LLM Workflow

Specific Differences

1. Contextual Understanding Ability

OCR simply recognizes characters without understanding meaning:

OCR Result:
"Total Assets    1,234,567
 Total Liabilities      789,012"

→ Doesn't know what this means
→ Doesn't understand the relationship between assets and liabilities
→ Requires separate parsing logic

While OCR can be combined with Layout Analysis to recognize document structure or use language models in post-processing to evaluate reliability, this approach has a pipeline structure where each stage (Layout Analysis → OCR → Post-processing) operates independently. Therefore, errors from earlier stages propagate and accumulate to later stages.

LLMs understand meaning:

LLM Result:
{
  "type": "Balance Sheet",
  "items": [
    {"account": "Assets", "amount": 1234567},
    {"account": "Liabilities", "amount": 789012}
  ]
}

→ Recognizes this as a balance sheet
→ Understands hierarchical relationship between assets and liabilities
→ Immediately usable structured data

2. Table Structure Recognition

Financial statements have complex table formats. OCR:

Assets           Current    Prior
Current Assets   1,000,000  900,000
  Cash             500,000  400,000
  Receivables      500,000  500,000
Non-current       500,000  450,000

→ OCR recognizes this as simple text listing
→ Doesn't know indentation means "hierarchy"
→ Difficult to determine "Current" and "Prior" are different columns

LLMs automatically understand table structure:

{
  "periods": [
    {
      "period_name": "Current",
      "items": [
        {
          "account": "Current Assets",
          "amount": "1,000,000",
          "sub_accounts": [
            {"account": "Cash", "amount": "500,000"},
            {"account": "Receivables", "amount": "500,000"}
          ]
        }
      ]
    },
    {
      "period_name": "Prior",
      "items": [...]
    }
  ]
}

3. Unstructured Data Processing

Real financial statements vary widely:

Company A: "Total Assets"
Company B: "Total    Assets    " (many spaces)
Company C: "I. Total Assets"
Company D: "Assets Total"
Company E: "Total Assets" (English)

OCR approach requires processing logic for each:

# OCR post-processing example
if "Assets" in text and "Total" in text:
    account = "Assets"
elif "Assets Total" in text:
    account = "Assets"
elif "Total Assets" in text:
    account = "Assets"
# ... dozens of conditional statements

LLMs automatically understand these variations and extract in unified format.

4. Error Recovery

For PDFs with poor scan quality:

OCR Result: "Total Ass3ts 1,Z34,567"
         (e→3, 2→Z misrecognized)

→ Parsing fails
→ Manual correction needed

LLMs recover errors from context:

LLM: "Ass3ts" is contextually "Assets",
     "Z34" is likely "234"

→ {"account": "Total Assets", "amount": "1,234,567"}
→ Correctly extracted

5. Hierarchical Relationship Understanding

Financial statements have hierarchical structure:

OCR Characteristics:

Requires implementing indentation parsing logic
Needs table layout analysis
Different processing needed for each company's format

LLM Characteristics:

Automatically understands hierarchical relationships
Can extract to desired depth via prompts
No additional code needed
Difficult to control for special cases outside typical situations

6. Development and Maintenance Complexity

OCR-based system:

- OCR preprocessing (image correction, noise removal)
- Table region detection
- Cell separation and text extraction
- Hierarchical structure parsing
- Number format normalization
- Exception case handling
- Company-specific special format handling
...

LLM-based system:

Total project: About 200 lines

- Data model definition
- LLM API calls
- Basic normalization
- Utilities

Conclusion: When to Use Each Approach?

When OCR Approach is Suitable:

Completely fixed format documents with existing working templates
Simply extracting text is sufficient
Very high document volume where cost is critical
Processing must be done offline

When LLM Approach is Suitable:

Documents with various formats
Requires semantic understanding and structuring
Needs rapid development and prototyping
Complex documents like financial statements, contracts, reports

Our financial statement extraction case was typical of the latter.

System Architecture

Our system consists of three main stages:

Stage 1: Data Model Design

The first decision when designing a financial statement extraction system is "How to receive LLM output?" While multimodal LLMs can accept images as input, output remains text-based. We need to extract financial statements from PDFs and create structured data. If we received unconstrained free-form text, many difficulties could arise during parsing.

Why Choose Structured Output?

There are two main ways to get structured data from LLMs:

Method 1: Request JSON Generation via Prompt

response = llm.generate("Extract financial statement as JSON")
result = json.loads(response.text)

Method 2: Use Structured Output

response = client.generate(
    config={"response_schema": Account}
)

We chose Method 2 (Structured Output) for the following reasons.

Problems with Method 1:

Format Inconsistency: LLMs sometimes wrap in Markdown code blocks or add explanatory text
```
"Here is the financial statement data:

{"company": "ABC"}

That's all."
```
Field Name Variation: Uses both company_name and companyName with same prompt
Type Inconsistency: Returns numbers sometimes as strings, sometimes as numbers
Parsing Errors: JSON format can break

Structured Output solves these problems:

Structured Output acts as a constraint when LLMs generate tokens. When selecting the next token, LLMs only consider candidates conforming to the defined schema. Outputs violating the schema are fundamentally blocked at generation stage, always guaranteeing valid JSON. This ensures output consistency and prevents parsing errors.

from pydantic import BaseModel  # Python data validation library

class Account(BaseModel):
    account: str
    amount: str | None

# LLM must follow this schema
response = client.generate(
    config={"response_schema": Account}
)

Result:

Always correct JSON format
Field name consistency guaranteed
Required/optional fields clear

Why Use Loose Constraints?

Once decided on Structured Output, the next question is "How strictly to define data types?"

Why we defined number fields as str not int:

class BaseAccount(BaseModel):
    account: str = Field(description="Account name")
    amount: str | None = Field(description="Account value")  # str not int

Diversity in Number Notation:

Company A: "1,234,567"        (comma separator)
Company B: "1.234.567"        (dot separator, European style)
Company C: "(1,234,567)"      (parentheses mean negative)
Company D: "1,234,567.00"     (with decimal point)
Company E: "△1,234,567"       (△ means negative)
Company F: "N/A"              (text representing no value)

If we defined amount: int, LLMs would:

Be confused how to handle commas
Fail converting parentheses or special symbols to int
Might misinterpret "N/A" as 0

We could request LLMs to also normalize number formats. However, experiments showed LLMs lacked consistency when converting "(1,234)" to "-1234" and "△567" to "-567". Implementing clear rule-based normalization logic in Python code was more stable.

Accurate Conversion in Post-processing:

def _to_int(text: str | None) -> int | None:
    if text in {"NA", "N/A", "null"}:
        return None

    # Parentheses mean negative
    is_negative = text.startswith("(")
    text = text.replace("(", "").replace(")", "")

    # Remove △ symbol
    text = text.replace("△", "")

    # Remove commas
    text = text.replace(",", "")

    # Truncate decimal point
    if "." in text:
        text = text.split(".")[0]

    return -int(text) if is_negative else int(text)

Advantages of this approach:

LLM only needs to extract "as-is" (simple task)
Complex conversion logic handled clearly in Python code
Easy debugging by preserving original data
Only need to modify normalization logic when finding new formats

Financial Statement Structure Design

Having decided on Structured Output, we now must define how to represent financial statements. We designed the following hierarchical structure using Pydantic:

Defining structure this way clearly instructs LLMs to "extract data in this format."

Structured Output Limitations (Gemini API Case)

However, Gemini API doesn't support all JSON Schema features. Particularly, these constraints existed:

Unsupported Features:

Self-Reference
- Cannot define recursive structures
- Example: nodes referencing themselves in tree structures
$ref Keyword Limitations
- JSON Schema reference functionality is limited
anyOf, oneOf, allOf
- Complex conditional schemas not possible
Schema Complexity Limits
- Too deep nesting or many attributes may be rejected

Problem We Faced:

Initially we wanted to define hierarchical structure recursively:

# Ideal but impossible in Gemini
class Account(BaseModel):
    account: str
    amount: str | None
    sub_accounts: list['Account'] | None  # Self-reference not possible

This could represent infinite depth hierarchy, but Gemini doesn't support self-reference causing errors.

Solution:

We limited hierarchy to fixed depth and created separate classes:

# Actually used approach
class SubAccount(BaseAccount):
    # Sub-accounts no longer have sub_accounts
    pass

class Account(BaseAccount):
    # Allow only 2 levels
    sub_accounts: list[SubAccount] | None = Field(
        description="Sub-items of account",
        default=None
    )

Stage 2: LLM Prompt Design

Clearly telling LLMs what task to perform is crucial. Core content of our prompt:

Extract corporate financial statements.
- Financial statements consist of balance sheets and income statements
- Input files may be image-based PDFs
- Extract hierarchically in JSON format
  - Generally consists of 2 periods (prior, current)

Extract balance sheet to 2 levels only
- Example: Assets (level 1), Current Assets (level 2), Non-current Assets (level 2)

Extract income statement to 1 level only
- Example: Revenue (level 1), Cost of Sales (level 1), Gross Profit (level 1)

Important points:

Clear Scope Setting: Specific constraints like "only 2 levels"
Providing Examples: Clearly show which items to extract through examples
Considering Exceptions: Mention various cases like "may be image PDF"

Stage 3: Extraction Process

Actual extraction proceeds as follows using Google Gemini API:

Upload PDF file to Gemini
Wait for file processing
Request extraction with predefined data model and prompt
Receive results in JSON format

Core code is very simple:

# Upload PDF file
handle = client.files.upload(file=pdf_file)

# Request extraction from LLM
response = client.models.generate_content(
    model="gemini-2.5-pro",
    config={
        "system_instruction": prompt,
        "response_schema": data_model,
        "response_mime_type": "application/json"
    },
    contents=[handle]
)

Specifying the Pydantic model defined in Stage 1 for response_schema, LLM always generates consistently structured JSON.

Stage 4: Data Normalization

Data extracted by LLMs isn't perfect yet. For example:

"Total Assets" and "Assets" mean the same but expressed differently
"Net Loss" should be expressed as negative "Net Income"
Each company uses different account systems

Therefore separate normalization is needed:

normalization_rules = {
    "Total Assets": ("Assets", False),
    "Total Liabilities": ("Liabilities", False),
    "Net Loss": ("Net Income", True),  # True means sign reversal
    "Operating Loss": ("Operating Income", True),
    ...
}

Applying these rules:

"Net Loss 10 million" → converts to "Net Income -10 million"
All company data unified to same account system

Stage 5: Hierarchical Structure Recalculation

Financial statements have hierarchical structure. For example:

Assets = Current Assets + Non-current Assets
Current Assets = Cash and Cash Equivalents + Receivables + ...

If "Current Assets" value is missing, it can be calculated from sum of sub-items. We implemented this logic to improve data completeness.

Real Application Results

Processing hundreds of financial statement PDFs using this system:

Time Savings: Completed in hours what would take days manually
Accuracy: Over 95% accuracy in most cases
Consistency: All data normalized to same format for easy analysis

Of course it's not perfect. Occasionally LLMs:

Misinterpret unusually formatted financial statements
Misrecognize number digits

But only about 10% of total work requires manual review, still a huge efficiency improvement.

Limitations of This Approach

Of course this approach isn't perfect:

Cost: API costs for large-scale processing
Dependency: Reliance on external API service
Validation Needed: Cannot trust 100%, sample validation needed
Special Cases: Very unstructured financial statements still difficult

Conclusion

Using LLMs enables document processing automation in ways previously unimaginable. Particularly effective for tasks like financial statements that:

Have similar structure but varying detailed formats
Require specialized knowledge
Need large-scale processing

The key is not thinking of LLMs as "magic tools" but utilizing them appropriately:

Clear data model design
Specific prompt writing
Systematic post-processing and validation

Combining these three creates systems actually usable in practice.

Technology Stack

For reference, technologies used in this project:

LLM: Google Gemini 2.5 Pro
Language: Python 3.11+
Main Libraries:
- google-genai: Gemini API client
- pydantic: Data model definition and validation
- pandas: Data processing and analysis

Total code is about 200 lines, very concise. Because LLMs handle complex logic instead.

Tags:

Made with ☕️ and 😽 in San Francisco, CA.

Terms Privacy