Financial Data Extractor

An automated platform that scrapes, classifies, parses, and compiles multi-year financial statements (Income Statement, Balance Sheet, Cash Flow Statement) from European company investor relations websites.

Project Overview

The Financial Data Extractor automates the labor-intensive process of collecting and standardizing financial data from annual reports. It handles:

Web Scraping: Automated discovery and download of annual reports from investor relations websites
Document Classification: Intelligent categorization of PDFs (Annual Reports, Presentations, etc.)
Data Extraction: LLM-powered parsing of financial statements from PDF documents
Normalization: Fuzzy matching and deduplication of line items across multiple years
Compilation: Aggregation of 10 years of financial data into unified views

Core Objectives

Scrape & Classify: Identify and categorize PDFs from investor relations websites
Parse: Extract financial data from Annual Reports using LLM (via OpenRouter)
Compile: Aggregate 10 years of financial data into unified views
Deduplicate: Align and merge similarly-named line items across years
Prioritize Latest: Use restated data from newer reports when available

System Architecture

The Financial Data Extractor follows a modern, layered architecture with clear separation between frontend, API, processing, and data layers:

graph LR
    subgraph "User Interface"
        UI[Next.js Frontend<br/>React 19 + TypeScript]
    end

    subgraph "API & Services"
        API[FastAPI Backend<br/>REST API]
    end

    subgraph "Background Processing"
        QUEUE[Celery Task Queue]
        WORKERS[Celery Workers<br/>Scraping • Extraction • Compilation]
    end

    subgraph "Data Storage"
        DB[(PostgreSQL<br/>Companies, Documents,<br/>Extractions, Statements)]
        CACHE[(Redis<br/>Cache & Message Broker)]
        STORAGE[(MinIO<br/>PDF Storage)]
    end

    subgraph "External Services"
        LLM[OpenRouter<br/>LLM API]
    end

    UI -->|HTTP/REST| API
    API -->|Queue Tasks| QUEUE
    QUEUE --> WORKERS
    API --> DB
    API --> CACHE
    WORKERS --> LLM
    WORKERS --> STORAGE
    WORKERS --> DB
    WORKERS --> CACHE

    classDef ui fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef processing fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef data fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    classDef external fill:#fce4ec,stroke:#880e4f,stroke-width:2px

    class UI ui
    class API api
    class QUEUE,WORKERS processing
    class DB,CACHE,STORAGE data
    class LLM external

Key Components:

Frontend: Next.js 15 application providing user interface and data visualization
Backend API: FastAPI REST API handling requests and business logic
Task Processing: Celery workers for async operations (scraping, extraction, compilation)
Data Layer: PostgreSQL for structured data, Redis for caching, MinIO for file storage
LLM Integration: OpenRouter API gateway for flexible model selection

For detailed architecture information, see the Architecture Overview documentation.

Target Companies

Initial Scope: 6 European companies seeded in database migrations
- AstraZeneca PLC (AZN - LSE, NASDAQ)
- SAP SE (SAP - XETRA, NYSE)
- Siemens AG (SIE - XETRA)
- ASML Holding N.V. (ASML - Euronext Amsterdam, NASDAQ)
- Unilever PLC (ULVR - LSE, UNA - Euronext Amsterdam, UL - NYSE)
- Allianz SE (ALV - XETRA)
Scalable: Architecture supports adding more companies dynamically via API

Data Flow

The system processes financial data through three main phases:

Scraping & Classification - Discover and categorize PDFs from investor relations websites
Parsing & Extraction - Extract financial statements using LLM
Normalization & Compilation - Normalize and compile 10 years of data

See the Data Flow documentation for detailed workflow information.

Technology Decisions

Key technology choices include:

OpenRouter - LLM API gateway for flexible model selection
PostgreSQL - JSONB support for flexible data structures
Celery - Distributed task queue for async processing
FastAPI - High-performance async API framework
Next.js 15 - Modern React framework with Server Components
React Query - Data fetching and caching
MinIO - S3-compatible object storage

See the Technology Decisions documentation for detailed rationale behind each choice.

Technology Stack

Backend

FastAPI - High-performance async web framework
Celery - Distributed task queue for background processing
PostgreSQL - Primary database with JSONB support
Redis - Caching layer and Celery message broker
SQLAlchemy - ORM for database operations
Alembic - Database migrations

Frontend

Next.js 15 - React framework with App Router
React - UI library
TailwindCSS - Utility-first CSS framework
shadcn/ui - Component library

Processing & AI

OpenRouter - LLM API gateway for accessing multiple models (GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet)
PyMuPDF - PDF processing and table extraction
pdfplumber - Alternative PDF text extraction
rapidfuzz - Fuzzy string matching for line item normalization
Crawl4AI - LLM-friendly web crawler for investor relations websites

Monitoring & Observability

The platform includes a comprehensive observability stack for monitoring, metrics, and logging:

Monitoring Stack

Prometheus - Metrics collection and storage (port 9090)
Grafana - Metrics visualization and dashboards (port 3200)
Loki - Log aggregation (port 3100)
Promtail - Log shipper for container logs
Flower - Celery task monitoring (port 5555)
PostgreSQL Exporter - Database metrics (port 9187)
Redis Exporter - Cache and broker metrics (port 9121)

Key Metrics

Business Metrics:

Total companies processed
Total PDFs classified
Statements extracted per day
Data quality scores
Extraction success rates

Technical Metrics:

API latency (p50, p95, p99) - via Prometheus from FastAPI /metrics
Celery queue depth - monitored via Flower and Redis exporter
Task success/failure rates - tracked in Flower and Prometheus
LLM API costs and latency - tracked via custom metrics
Database query performance - PostgreSQL exporter metrics
Redis connection pool usage - Redis exporter metrics
Storage usage (MinIO) - via MinIO console

Alerts:

Task failure rate > 5%
LLM API errors (429, 500, timeout)
Queue backlog > 1000 tasks
Database connection pool exhaustion
Redis memory usage > 80%
Disk space < 10% free

Dashboards

Grafana Dashboards (Pre-configured):

API Performance - Request latency, throughput, error rates from FastAPI
Database Metrics - PostgreSQL connection pool, query performance, transaction rates
Redis Metrics - Memory usage, connection count, command rates
Celery Tasks - Task execution times, success/failure rates, queue depths (via Prometheus)
Infrastructure - CPU, memory, disk, network usage

Log Aggregation:

All container logs aggregated via Promtail → Loki
Query logs via Grafana Explore view
Structured logging from FastAPI with request IDs
Celery worker logs with task context

Access

Grafana: http://localhost:3200 (admin/admin)
Prometheus: http://localhost:9090
Flower: http://localhost:5555
Loki: http://localhost:3100

Security Considerations

Rate Limiting: Prevent abuse of expensive extraction endpoints
Authentication: OAuth2 with JWT tokens
API Keys: Secure storage of OpenRouter API keys (env vars)
Input Validation: Sanitize company URLs to prevent SSRF
File Validation: Verify PDFs, scan for malware
Data Privacy: GDPR compliance for European companies

Quick Start

Get started quickly with the Getting Started Guide which includes:

Installation - Complete setup instructions
First Steps - Tutorial for your first extraction

For detailed setup information, see the Infrastructure Development Setup.

Documentation

This documentation site provides comprehensive guides organized by category:

Getting Started

Getting Started Guide - Quick start and installation
- Installation - Detailed setup instructions
- First Steps - Tutorial for your first extraction

Architecture

Architecture Overview - System design and architecture
- Data Flow - Detailed workflow from scraping to compilation
- Technology Decisions - Rationale behind technology choices

Backend

Backend Overview - FastAPI backend architecture, database, services, and testing
- Backend Architecture - Connection pool management, dependency injection, repository pattern, and exception handling
- Backend Testing - pytest setup, unit tests, integration tests with testcontainers

Frontend

Frontend Overview - Next.js 15 frontend architecture, components, React Query, and testing
- Frontend Architecture - Next.js 15 architecture, React Query, components, and development guide
- Frontend Testing - Vitest unit testing, React Testing Library, and testing strategies
- Frontend DevTools - React Query DevTools, ESLint plugin, and frontend debugging tools

API

API Overview - REST API documentation and reference
- API Reference - Complete REST API documentation with all endpoints, request/response formats, and examples

Database

Database Overview - Database schema, migrations, queries, and operations
- Database Schema - Table structures, relationships, and JSONB formats
- Database Migrations - Alembic migration commands and workflows
- Database Queries - Useful SQL queries for data inspection

Infrastructure

Infrastructure Overview - Docker setup, development environment, and task processing
- Development Setup - Docker Compose setup, service management, and monitoring stack
- Task Processing - Celery task system, workers, Flower monitoring
- Object Storage - MinIO object storage setup and usage

Testing

Testing Overview - Overview of testing strategies for backend and frontend
- Backend Testing - pytest guide for FastAPI backend
- Frontend Testing - Vitest guide for Next.js frontend

Development Tools

Development Tools - IDE configuration and development environment
- Cursor IDE Configuration - Cursor rules, VS Code settings, debug configurations, and tasks

License

Financial Data Extractor is released under the Apache 2.0 License. See the LICENSE file for more details.