Data Flow

The Financial Data Extractor processes financial data through three main phases:

Phase 1: Scraping & Classification

User initiates extraction for a company
FastAPI creates job record → sends to Celery
Worker 1 scrapes investor relations website
Identifies all PDFs (annual reports, presentations, etc.)
Classifies documents by type using:
- Filename patterns
- Document metadata
- Content sampling
Stores PDFs in MinIO object storage
Creates database records for each PDF with metadata

Scraping Process

The scraping worker uses Crawl4AI with LLM assistance to:

Navigate investor relations websites
Discover PDF links
Classify documents (Annual Report, Quarterly Report, Presentation, etc.)
Extract metadata (fiscal year, publication date)

Document Classification

Documents are classified using multiple strategies:

Filename Patterns: “annual-report-2023.pdf” → Annual Report
URL Patterns: “/annual-reports/” → Annual Report
Content Analysis: LLM samples document content to determine type
Metadata: Publication dates, document titles

Phase 2: Parsing & Extraction

For each Annual Report PDF:
Worker 2 extracts text/tables using:
- PyMuPDF / pdfplumber for structured tables
- OCR (Tesseract/AWS Textract) for scanned documents
Sends extracted content + prompt to LLM (via OpenRouter):
- “Extract Income Statement, Balance Sheet, Cash Flow Statement”
- “Return as structured JSON with all line items”
LLM returns structured financial data
Validates data structure and completeness
Stores raw extraction in database (JSON column)

PDF Processing

Structured PDFs:

PyMuPDF extracts tables directly
pdfplumber extracts text with layout information
Tables are converted to structured data

Scanned PDFs:

OCR engine extracts text
Layout analysis identifies table structures
Text is sent to LLM for extraction

LLM Extraction

The extraction process:

Preprocessing: Extract relevant sections (find financial statements)
Prompt Engineering: Detailed prompts with examples and constraints
LLM Call: Send to OpenRouter with structured output format
Validation: Verify JSON structure and required fields
Storage: Save raw extraction with metadata

Extraction Metadata

Each extraction includes:

Confidence Score: How certain the extraction is
Model Used: Which LLM model was used
Tokens Used: Cost tracking
Processing Time: Performance metrics
Data Lineage: Source document information

Phase 3: Normalization & Compilation

For each statement type across all years:
Collect all line items from all reports
Normalize line item names:
- “Revenue” vs “Total Revenue” vs “Revenues”
- Apply fuzzy matching + manual mappings
Detect restated data:
- 2024 report contains 2022, 2023 data → use this over 2022, 2023 reports
Build unified table with 10 years of columns
Fill in data prioritizing latest sources
Store compiled view in database
Generate metadata:
- Data lineage (which report each value came from)
- Confidence scores
- Gaps or inconsistencies

Line Item Normalization

Fuzzy Matching:

Uses rapidfuzz library for string similarity
Groups similar line items (e.g., “Revenue”, “Revenues”, “Total Revenue”)
Configurable similarity threshold

Manual Mappings:

User-defined overrides for specific mappings
Priority over fuzzy matching
Company-specific mappings

Restatement Detection:

Newer reports often contain restated historical data
System detects and prioritizes restated values
Maintains data lineage for audit trail

Compilation Process

Collect: Gather all extractions for a company and statement type
Normalize: Apply fuzzy matching and manual mappings
Merge: Combine line items across all years
Prioritize: Use latest sources for each value
Validate: Check for consistency and completeness
Store: Save compiled statement with metadata

Compiled Statement Structure

{
  "company_id": 1,
  "statement_type": "Income Statement",
  "data": {
    "line_items": [
      {
        "name": "Revenue",
        "years": {
          "2015": 1000000,
          "2016": 1100000,
          ...
        },
        "metadata": {
          "sources": {
            "2015": "document_5",
            "2016": "document_6"
          },
          "confidence": {
            "2015": 0.95,
            "2016": 0.98
          }
        }
      }
    ]
  }
}

Error Handling & Retries

Scraping Errors

Network timeouts → Retry with exponential backoff
Rate limiting → Wait and retry
Invalid URLs → Log and skip

Extraction Errors

LLM API errors → Retry up to 3 times
Invalid JSON → Log error, mark extraction as failed
Timeout → Retry with longer timeout

Compilation Errors

Missing data → Mark as gap, continue compilation
Inconsistent data → Flag for manual review
Validation errors → Log and skip invalid entries

Performance Considerations

Caching

Redis caches frequently accessed data
PDF content cached after first extraction
Compiled statements cached for fast retrieval

Parallelization

Multiple Celery workers process tasks in parallel
Document processing can be parallelized
Statement compilation parallelized by statement type

Optimization

LLM responses cached to avoid duplicate extractions
Incremental compilation (only recompile changed statements)
Database indexes for fast queries

Data Quality

Validation Steps

Schema Validation: Verify JSON structure matches expected schema
Business Logic Validation: Check calculations (e.g., Assets = Liabilities + Equity)
Consistency Checks: Compare values across years for anomalies
Completeness Checks: Verify required fields are present

Quality Metrics

Confidence Scores: Per-extraction and per-value confidence
Coverage: Percentage of expected line items extracted
Accuracy: Manual verification results (when available)
Consistency: Variance in values across reports

Technology Decisions - Why we chose each component
Task Processing - Celery task system details
Database Schema - Data storage structure