Technology Decisions

This document explains the rationale behind key technology choices in the Financial Data Extractor.

Why LLM for Extraction?

The platform uses OpenRouter as an API gateway to access multiple LLM providers (OpenAI, Anthropic, etc.), allowing flexible model selection:

Flexibility: Handles various report formats without custom parsers
Accuracy: State-of-the-art text understanding from multiple providers
Hierarchy: Understands nested line items and relationships
Multi-language: Can handle European languages
Model Selection: Choose optimal models per task (e.g., GPT-4o-mini for scraping, GPT-4o for extraction)

Configuration

Scraping Model: openai/gpt-4o-mini (fast, cost-effective for URL discovery)
Extraction Model: openai/gpt-4o-mini (configurable, can use GPT-4o or Claude 3.5 Sonnet for better accuracy)
API Gateway: OpenRouter provides unified interface to multiple providers

Alternatives Considered

Traditional OCR + Rule-based parsing: Too brittle, requires custom parsers for each format
LayoutLM/DocAI: Requires training data and model training
AWS Textract: Good but less flexible than modern LLMs
Direct OpenAI API: OpenRouter provides better flexibility and cost management

Why OpenRouter?

Multi-Provider: Access to OpenAI, Anthropic, Google, and more
Cost Optimization: Choose cost-effective models per task
Unified API: Single interface for all providers
Model Selection: Easy switching between models
Analytics: Built-in usage tracking and cost monitoring

Why PostgreSQL?

JSONB: Perfect for storing raw extractions and metadata
Relational: Strong for company/document relationships
Mature: Excellent tooling and performance
ACID: Critical for financial data integrity
Async Support: Excellent async driver (asyncpg) for FastAPI

JSONB Benefits

Store flexible financial data structures
Efficient querying with GIN indexes
No schema changes needed for new statement formats
Maintains relational benefits for structured data

Alternatives Considered

MongoDB: Good for JSON, but weaker relational capabilities
MySQL: Good relational, but weaker JSON support
SQLite: Good for development, but not suitable for production scale

Why Celery?

Async Processing: PDFs take minutes to process, LLM calls can take 2-5 minutes per document
Retries: Handle API failures, rate limits gracefully with exponential backoff
Monitoring: Flower dashboard for real-time task tracking and history
Workflows: Complex pipelines (scrape → classify → extract → compile) with task chaining
Queue Management: Dedicated queues for different task types (scraping, extraction, compilation, orchestration)
Scalability: Horizontal scaling with multiple workers across queues

Task Types

Scraping Queue: Web scraping tasks
Extraction Queue: PDF processing and LLM extraction
Compilation Queue: Normalization and compilation
Orchestration Queue: High-level workflow coordination

Alternatives Considered

RQ (Redis Queue): Simpler but less feature-rich
Dramatiq: Good but less mature ecosystem
AWS SQS + Lambda: Good for cloud, but vendor lock-in
Direct async functions: Doesn’t handle long-running tasks well

Why FastAPI?

Performance: One of the fastest Python frameworks
Async Support: Native async/await support
Auto Documentation: Automatic OpenAPI/Swagger generation
Type Safety: Pydantic models for validation
Modern Python: Uses latest Python features (type hints, async)

Key Features Used

Dependency Injection: Clean service layer architecture
Background Tasks: For lightweight async operations
WebSockets: For real-time updates (future)
Middleware: CORS, request ID, timeout handling

Alternatives Considered

Flask: Good but slower, less async support
Django: Too heavy, synchronous by default
Tornado: Good async but less modern
Starlette: FastAPI is built on Starlette, but FastAPI adds more features

Why Next.js 15?

App Router: Modern routing with Server Components
Server Components: Better performance and SEO
React 19: Latest React features
TypeScript: Full type safety
Built-in Optimizations: Image optimization, code splitting, etc.

Key Features Used

Server Components: Initial rendering on server
Client Components: Interactivity with React Query
API Routes: Backend integration (if needed)
Static Generation: For documentation pages

Alternatives Considered

Create React App: Deprecated, no SSR
Vite + React: Good but requires manual SSR setup
Remix: Good but less mature ecosystem
SvelteKit: Good but smaller ecosystem

Why React Query?

Caching: Automatic request caching and deduplication
Background Updates: Automatic data synchronization
Optimistic Updates: Better UX for mutations
DevTools: Excellent debugging tools
Error Handling: Built-in error states and retries

Benefits

Automatic Caching: No manual cache management
Request Deduplication: Multiple components share same request
Background Refetching: Keep data fresh automatically
Optimistic Updates: Update UI before server confirms

Alternatives Considered

SWR: Good but less feature-rich
Apollo Client: Good but GraphQL-focused
Redux + RTK Query: Good but more complex
Manual fetch: Too much boilerplate

Why MinIO?

S3-Compatible: Easy migration to AWS S3
Local Development: Run locally with Docker
Scalability: Handles large files well
Cost-Effective: Free for development, cheaper than S3 for production
API Compatibility: Works with any S3 client

Use Cases

PDF Storage: Store all scraped PDFs
Backup: Easy backup to S3
Development: Local development without cloud costs

Alternatives Considered

Local File System: Simple but doesn’t scale
AWS S3: Good but vendor lock-in, costs
Google Cloud Storage: Good but vendor lock-in
Azure Blob Storage: Good but vendor lock-in

Why Redis?

Caching: Fast in-memory caching
Celery Broker: Message broker for task queue
Performance: Sub-millisecond latency
Persistence: Optional persistence for durability
Data Structures: Rich data types (sets, lists, etc.)

Use Cases

API Response Caching: Cache expensive API calls
Session Storage: User sessions (future)
Rate Limiting: Track API usage (future)
Task Queue: Celery message broker

Alternatives Considered

Memcached: Good caching but no broker functionality
RabbitMQ: Good broker but heavier, not needed for caching
PostgreSQL: Can be used as broker but slower

Monitoring Stack

Why Prometheus + Grafana?

Prometheus: Industry standard for metrics
Grafana: Best visualization tool
Integration: Easy integration with FastAPI, PostgreSQL, Redis
Alerts: Built-in alerting capabilities

Why Loki?

Log Aggregation: Centralized log storage
Grafana Integration: View logs in Grafana
Label-based Queries: Efficient log queries
Cost-Effective: Cheaper than cloud log services

Why Flower?

Celery-Specific: Built for Celery monitoring
Real-time: Live task monitoring
Task History: Persistent task history
Easy Setup: Simple Docker setup

Data Flow - How data moves through the system
Backend Architecture - Backend implementation
Infrastructure Setup - Service configuration

Technology Decisions

Why LLM for Extraction?

Configuration

Alternatives Considered

Why OpenRouter?

Why PostgreSQL?

JSONB Benefits

Alternatives Considered

Why Celery?

Task Types

Alternatives Considered

Why FastAPI?

Key Features Used

Alternatives Considered

Why Next.js 15?

Key Features Used

Alternatives Considered

Why React Query?

Benefits

Alternatives Considered

Why MinIO?

Use Cases

Alternatives Considered

Why Redis?

Use Cases

Alternatives Considered

Monitoring Stack

Why Prometheus + Grafana?

Why Loki?

Why Flower?

Related Documentation