Automated Financial Statement Extraction from PDFs Using LLMs

Automated Financial Statement Extraction from PDFs Using LLMs
Introduction
Corporate financial statements are crucial documents that show a company's financial health. However, when analyzing financial statements from numerous companies, manually opening each PDF document and finding the necessary numbers is tedious and time-consuming.
A bigger problem is that financial statement formats vary slightly between companies. Some companies label it as "Net Income," while others use "Net Loss." Even items with the same meaning can be expressed differently, like "Total Assets" versus "Assets."
To solve these problems, we built a system that automatically extracts and normalizes PDF financial statements using recently prominent Large Language Models (LLMs). This article introduces that process.
Why LLM?
Traditional PDF data extraction required the following work:
- Converting PDF to text
- Parsing data based on regular expressions or rules
- Recognizing table formats and extracting data
- Writing numerous conditional statements for exception handling
However, financial statements have different formats across companies, and sometimes include scanned image PDFs, making this approach limited.

LLMs flexibly solve these problems:
- Contextual Understanding: LLMs already know what "Balance Sheet" is and how "Assets" and "Liabilities" are structured
- Flexibility: Even with slightly different table formats or item names, LLMs can extract by understanding the meaning
- Image Processing: Modern LLMs can read images directly, enabling processing of scanned PDFs
- Structured Output: Using the Structured Output feature, you can receive data directly in the desired format
Comparison with Traditional OCR Approaches
Many might wonder, "Since OCR can extract text from PDFs, why use LLMs?" There are significant differences between the two approaches.
OCR Workflow
LLM Workflow
Specific Differences
1. Contextual Understanding Ability
OCR simply recognizes characters without understanding meaning:
OCR Result: "Total Assets 1,234,567 Total Liabilities 789,012" → Doesn't know what this means → Doesn't understand the relationship between assets and liabilities → Requires separate parsing logic
- While OCR can be combined with Layout Analysis to recognize document structure or use language models in post-processing to evaluate reliability, this approach has a pipeline structure where each stage (Layout Analysis → OCR → Post-processing) operates independently. Therefore, errors from earlier stages propagate and accumulate to later stages.
LLMs understand meaning:
LLM Result: { "type": "Balance Sheet", "items": [ {"account": "Assets", "amount": 1234567}, {"account": "Liabilities", "amount": 789012} ] } → Recognizes this as a balance sheet → Understands hierarchical relationship between assets and liabilities → Immediately usable structured data
2. Table Structure Recognition
Financial statements have complex table formats. OCR:
Assets Current Prior Current Assets 1,000,000 900,000 Cash 500,000 400,000 Receivables 500,000 500,000 Non-current 500,000 450,000 → OCR recognizes this as simple text listing → Doesn't know indentation means "hierarchy" → Difficult to determine "Current" and "Prior" are different columns
LLMs automatically understand table structure:
{ "periods": [ { "period_name": "Current", "items": [ { "account": "Current Assets", "amount": "1,000,000", "sub_accounts": [ {"account": "Cash", "amount": "500,000"}, {"account": "Receivables", "amount": "500,000"} ] } ] }, { "period_name": "Prior", "items": [...] } ] }
3. Unstructured Data Processing
Real financial statements vary widely:
Company A: "Total Assets" Company B: "Total Assets " (many spaces) Company C: "I. Total Assets" Company D: "Assets Total" Company E: "Total Assets" (English)
OCR approach requires processing logic for each:
# OCR post-processing example if "Assets" in text and "Total" in text: account = "Assets" elif "Assets Total" in text: account = "Assets" elif "Total Assets" in text: account = "Assets" # ... dozens of conditional statements
LLMs automatically understand these variations and extract in unified format.
4. Error Recovery
For PDFs with poor scan quality:
OCR Result: "Total Ass3ts 1,Z34,567" (e→3, 2→Z misrecognized) → Parsing fails → Manual correction needed
LLMs recover errors from context:
LLM: "Ass3ts" is contextually "Assets", "Z34" is likely "234" → {"account": "Total Assets", "amount": "1,234,567"} → Correctly extracted
5. Hierarchical Relationship Understanding
Financial statements have hierarchical structure:
OCR Characteristics:
- Requires implementing indentation parsing logic
- Needs table layout analysis
- Different processing needed for each company's format
LLM Characteristics:
- Automatically understands hierarchical relationships
- Can extract to desired depth via prompts
- No additional code needed
- Difficult to control for special cases outside typical situations
6. Development and Maintenance Complexity
OCR-based system:
- OCR preprocessing (image correction, noise removal) - Table region detection - Cell separation and text extraction - Hierarchical structure parsing - Number format normalization - Exception case handling - Company-specific special format handling ...
LLM-based system:
Total project: About 200 lines
- Data model definition - LLM API calls - Basic normalization - Utilities
Conclusion: When to Use Each Approach?
When OCR Approach is Suitable:
- Completely fixed format documents with existing working templates
- Simply extracting text is sufficient
- Very high document volume where cost is critical
- Processing must be done offline
When LLM Approach is Suitable:
- Documents with various formats
- Requires semantic understanding and structuring
- Needs rapid development and prototyping
- Complex documents like financial statements, contracts, reports
Our financial statement extraction case was typical of the latter.
System Architecture
Our system consists of three main stages:
Stage 1: Data Model Design
The first decision when designing a financial statement extraction system is "How to receive LLM output?" While multimodal LLMs can accept images as input, output remains text-based. We need to extract financial statements from PDFs and create structured data. If we received unconstrained free-form text, many difficulties could arise during parsing.
Why Choose Structured Output?
There are two main ways to get structured data from LLMs:
Method 1: Request JSON Generation via Prompt
response = llm.generate("Extract financial statement as JSON") result = json.loads(response.text)
Method 2: Use Structured Output
response = client.generate( config={"response_schema": Account} )
We chose Method 2 (Structured Output) for the following reasons.
Problems with Method 1:
-
Format Inconsistency: LLMs sometimes wrap in Markdown code blocks or add explanatory text
"Here is the financial statement data: {"company": "ABC"} That's all." -
Field Name Variation: Uses both
company_nameandcompanyNamewith same prompt -
Type Inconsistency: Returns numbers sometimes as strings, sometimes as numbers
-
Parsing Errors: JSON format can break
Structured Output solves these problems:
Structured Output acts as a constraint when LLMs generate tokens. When selecting the next token, LLMs only consider candidates conforming to the defined schema. Outputs violating the schema are fundamentally blocked at generation stage, always guaranteeing valid JSON. This ensures output consistency and prevents parsing errors.
from pydantic import BaseModel # Python data validation library class Account(BaseModel): account: str amount: str | None # LLM must follow this schema response = client.generate( config={"response_schema": Account} )
Result:
- Always correct JSON format
- Field name consistency guaranteed
- Required/optional fields clear
Why Use Loose Constraints?
Once decided on Structured Output, the next question is "How strictly to define data types?"
Why we defined number fields as str not int:
class BaseAccount(BaseModel): account: str = Field(description="Account name") amount: str | None = Field(description="Account value") # str not int
Diversity in Number Notation:
Company A: "1,234,567" (comma separator) Company B: "1.234.567" (dot separator, European style) Company C: "(1,234,567)" (parentheses mean negative) Company D: "1,234,567.00" (with decimal point) Company E: "△1,234,567" (△ means negative) Company F: "N/A" (text representing no value)
If we defined amount: int, LLMs would:
- Be confused how to handle commas
- Fail converting parentheses or special symbols to int
- Might misinterpret "N/A" as 0
We could request LLMs to also normalize number formats. However, experiments showed LLMs lacked consistency when converting "(1,234)" to "-1234" and "△567" to "-567". Implementing clear rule-based normalization logic in Python code was more stable.
Accurate Conversion in Post-processing:
def _to_int(text: str | None) -> int | None: if text in {"NA", "N/A", "null"}: return None # Parentheses mean negative is_negative = text.startswith("(") text = text.replace("(", "").replace(")", "") # Remove △ symbol text = text.replace("△", "") # Remove commas text = text.replace(",", "") # Truncate decimal point if "." in text: text = text.split(".")[0] return -int(text) if is_negative else int(text)
Advantages of this approach:
- LLM only needs to extract "as-is" (simple task)
- Complex conversion logic handled clearly in Python code
- Easy debugging by preserving original data
- Only need to modify normalization logic when finding new formats
Financial Statement Structure Design
Having decided on Structured Output, we now must define how to represent financial statements. We designed the following hierarchical structure using Pydantic:
Defining structure this way clearly instructs LLMs to "extract data in this format."
Structured Output Limitations (Gemini API Case)
However, Gemini API doesn't support all JSON Schema features. Particularly, these constraints existed:
Unsupported Features:
-
Self-Reference
- Cannot define recursive structures
- Example: nodes referencing themselves in tree structures
-
$ref Keyword Limitations
- JSON Schema reference functionality is limited
-
anyOf, oneOf, allOf
- Complex conditional schemas not possible
-
Schema Complexity Limits
- Too deep nesting or many attributes may be rejected
Problem We Faced:
Initially we wanted to define hierarchical structure recursively:
# Ideal but impossible in Gemini class Account(BaseModel): account: str amount: str | None sub_accounts: list['Account'] | None # Self-reference not possible
This could represent infinite depth hierarchy, but Gemini doesn't support self-reference causing errors.
Solution:
We limited hierarchy to fixed depth and created separate classes:
# Actually used approach class SubAccount(BaseAccount): # Sub-accounts no longer have sub_accounts pass class Account(BaseAccount): # Allow only 2 levels sub_accounts: list[SubAccount] | None = Field( description="Sub-items of account", default=None )
Stage 2: LLM Prompt Design
Clearly telling LLMs what task to perform is crucial. Core content of our prompt:
Extract corporate financial statements. - Financial statements consist of balance sheets and income statements - Input files may be image-based PDFs - Extract hierarchically in JSON format - Generally consists of 2 periods (prior, current) Extract balance sheet to 2 levels only - Example: Assets (level 1), Current Assets (level 2), Non-current Assets (level 2) Extract income statement to 1 level only - Example: Revenue (level 1), Cost of Sales (level 1), Gross Profit (level 1)
Important points:
- Clear Scope Setting: Specific constraints like "only 2 levels"
- Providing Examples: Clearly show which items to extract through examples
- Considering Exceptions: Mention various cases like "may be image PDF"
Stage 3: Extraction Process
Actual extraction proceeds as follows using Google Gemini API:
- Upload PDF file to Gemini
- Wait for file processing
- Request extraction with predefined data model and prompt
- Receive results in JSON format
Core code is very simple:
# Upload PDF file handle = client.files.upload(file=pdf_file) # Request extraction from LLM response = client.models.generate_content( model="gemini-2.5-pro", config={ "system_instruction": prompt, "response_schema": data_model, "response_mime_type": "application/json" }, contents=[handle] )
Specifying the Pydantic model defined in Stage 1 for response_schema, LLM always generates consistently structured JSON.
Stage 4: Data Normalization
Data extracted by LLMs isn't perfect yet. For example:
- "Total Assets" and "Assets" mean the same but expressed differently
- "Net Loss" should be expressed as negative "Net Income"
- Each company uses different account systems
Therefore separate normalization is needed:
normalization_rules = { "Total Assets": ("Assets", False), "Total Liabilities": ("Liabilities", False), "Net Loss": ("Net Income", True), # True means sign reversal "Operating Loss": ("Operating Income", True), ... }
Applying these rules:
- "Net Loss 10 million" → converts to "Net Income -10 million"
- All company data unified to same account system
Stage 5: Hierarchical Structure Recalculation
Financial statements have hierarchical structure. For example:
Assets = Current Assets + Non-current Assets Current Assets = Cash and Cash Equivalents + Receivables + ...
If "Current Assets" value is missing, it can be calculated from sum of sub-items. We implemented this logic to improve data completeness.
Real Application Results
Processing hundreds of financial statement PDFs using this system:
- Time Savings: Completed in hours what would take days manually
- Accuracy: Over 95% accuracy in most cases
- Consistency: All data normalized to same format for easy analysis
Of course it's not perfect. Occasionally LLMs:
- Misinterpret unusually formatted financial statements
- Misrecognize number digits
But only about 10% of total work requires manual review, still a huge efficiency improvement.
Limitations of This Approach
Of course this approach isn't perfect:
- Cost: API costs for large-scale processing
- Dependency: Reliance on external API service
- Validation Needed: Cannot trust 100%, sample validation needed
- Special Cases: Very unstructured financial statements still difficult
Conclusion
Using LLMs enables document processing automation in ways previously unimaginable. Particularly effective for tasks like financial statements that:
- Have similar structure but varying detailed formats
- Require specialized knowledge
- Need large-scale processing
The key is not thinking of LLMs as "magic tools" but utilizing them appropriately:
- Clear data model design
- Specific prompt writing
- Systematic post-processing and validation
Combining these three creates systems actually usable in practice.
Technology Stack
For reference, technologies used in this project:
- LLM: Google Gemini 2.5 Pro
- Language: Python 3.11+
- Main Libraries:
google-genai: Gemini API clientpydantic: Data model definition and validationpandas: Data processing and analysis
Total code is about 200 lines, very concise. Because LLMs handle complex logic instead.