API Reference

The Financial Data Extractor provides a RESTful API built with FastAPI for managing companies, documents, financial statement extractions, and compiled statements.

Base URL

http://localhost:3030/api/v1

API Documentation

The API includes comprehensive interactive documentation automatically generated by FastAPI:

Swagger UI: http://localhost:3030/docs
ReDoc: http://localhost:3030/redoc
OpenAPI Schema: http://localhost:3030/openapi.json

Common Endpoints

Health Check

GET /healthcheck

Returns the health status of the API.

Response:

{
  "status": "Healthy"
}

List All Endpoints

GET /endpoints/

Returns a list of all available routes in the application.

Response:

{
  "endpoints": [
    {
      "path": "/healthcheck",
      "name": "health_check",
      "methods": ["GET"]
    },
    ...
  ]
}

Metrics

GET /metrics

Prometheus metrics endpoint for monitoring and observability.

API Resources

Companies

Manage company information and metadata.

List All Companies

GET /api/v1/companies

Query Parameters:

skip (int, default: 0): Number of records to skip
limit (int, default: 100, max: 100): Maximum number of records to return

Response: Array of company objects

Get Company by ID

GET /api/v1/companies/{company-id}

Path Parameters:

company-id (int): Unique company identifier

Response: Company object

Get Company by Ticker

GET /api/v1/companies/ticker/{ticker}

Path Parameters:

ticker (string): Stock ticker symbol (e.g., “ADYEN”)

Response: Company object

Create Company

POST /api/v1/companies

Request Body:

{
  "name": "AstraZeneca PLC",
  "ir_url": "https://www.astrazeneca.com/investor-relations/annual-reports.html",
  "primary_ticker": "AZN",
  "tickers": [
    {"ticker": "AZN", "exchange": "LSE"},
    {"ticker": "AZN", "exchange": "NASDAQ"}
  ]
}

Note:

primary_ticker is optional but recommended
tickers is an optional array of objects with ticker and exchange fields
ir_url is required and should point to the company’s investor relations page

Response: Created company object

Update Company

PUT /api/v1/companies/{company-id}

Path Parameters:

company-id (int): Unique company identifier

Request Body: Company update object

Response: Updated company object

Delete Company

DELETE /api/v1/companies/{company-id}

Path Parameters:

company-id (int): Unique company identifier

Response: 204 No Content

Documents

Manage PDF documents (annual reports, presentations, etc.) associated with companies.

List Documents by Company

GET /api/v1/documents/companies/{company-id}

Path Parameters:

company-id (int): Unique company identifier

Query Parameters:

skip (int, default: 0): Number of records to skip
limit (int, default: 100, max: 100): Maximum number of records to return

Response: Array of document objects

Get Document by ID

GET /api/v1/documents/{document-id}

Path Parameters:

document-id (int): Unique document identifier

Response: Document object

Get Documents by Company and Fiscal Year

GET /api/v1/documents/companies/{company-id}/fiscal-year/{fiscal-year}

Path Parameters:

company-id (int): Unique company identifier
fiscal-year (int, 1900-2100): Fiscal year

Response: Array of document objects

Get Documents by Company and Type

GET /api/v1/documents/companies/{company-id}/type/{document-type}

Path Parameters:

company-id (int): Unique company identifier
document-type (string): Document type (e.g., “Annual Report”, “Presentation”)

Query Parameters:

skip (int, default: 0): Number of records to skip
limit (int, default: 100, max: 100): Maximum number of records to return

Response: Array of document objects

Create Document

POST /api/v1/documents

Request Body:

{
  "company_id": 1,
  "document_type": "Annual Report",
  "fiscal_year": 2023,
  "filing_date": "2024-03-15",
  "file_path": "/path/to/document.pdf",
  "metadata": {
    "pages": 150,
    "language": "en",
    "source_url": "https://example.com/annual-report-2023.pdf"
  }
}

Response: Created document object

Update Document

PUT /api/v1/documents/{document-id}

Path Parameters:

document-id (int): Unique document identifier

Request Body: Document update object

Response: Updated document object

Delete Document

DELETE /api/v1/documents/{document-id}

Path Parameters:

document-id (int): Unique document identifier

Response: 204 No Content

List PDFs from Storage

GET /api/v1/documents/storage/companies/{company-id}

Path Parameters:

company-id (int): Unique company identifier

Query Parameters:

fiscal_year (int, optional, 1900-2100): Filter by fiscal year

Response: JSON object with list of PDF files from storage

{
  "company_id": 1,
  "fiscal_year": null,
  "prefix": "company_1/",
  "count": 5,
  "files": [
    {
      "object_key": "company_1/2023/annual_report_2023.pdf",
      "size": 5242880,
      "last_modified": "2024-01-15T10:30:00Z"
    }
  ]
}

Description: Lists all PDF files stored in MinIO/local storage for a company. This queries storage directly, not the database. Useful for checking what files are actually stored.

Download PDF from Storage

GET /api/v1/documents/storage/download

Query Parameters:

object_key (string): Object key (path) of the file in storage (URL-encoded)

Response: PDF file as binary response with Content-Type: application/pdf

Description: Downloads a PDF file directly from MinIO/local storage by object key. The object key should be URL-encoded if it contains special characters.

Example:

curl "http://localhost:3030/api/v1/documents/storage/download?object_key=company_1%2F2023%2Fannual_report_2023.pdf" \
  --output document.pdf

Extractions

Manage extracted financial statements from documents.

List Extractions by Document

GET /api/v1/extractions/documents/{document-id}

Path Parameters:

document-id (int): Unique document identifier

Response: Array of extraction objects

Get Extraction by ID

GET /api/v1/extractions/{extraction-id}

Path Parameters:

extraction-id (int): Unique extraction identifier

Response: Extraction object

Get Extraction by Document and Statement Type

GET /api/v1/extractions/documents/{document-id}/statement-type/{statement-type}

Path Parameters:

document-id (int): Unique document identifier
statement-type (string): Statement type (e.g., “Income Statement”, “Balance Sheet”, “Cash Flow Statement”)

Response: Extraction object

Create Extraction

POST /api/v1/extractions

Request Body:

{
  "document_id": 1,
  "statement_type": "Income Statement",
  "extracted_data": {
    "Revenue": 1000000,
    "Operating Expenses": 600000,
    "Net Income": 400000
  },
  "confidence_score": 0.95,
    "metadata": {
      "extraction_method": "LLM",
      "model": "openai/gpt-4o-mini",
      "tokens_used": 1500
    }
}

Response: Created extraction object

Update Extraction

PUT /api/v1/extractions/{extraction-id}

Path Parameters:

extraction-id (int): Unique extraction identifier

Request Body: Extraction update object

Response: Updated extraction object

Delete Extraction

DELETE /api/v1/extractions/{extraction-id}

Path Parameters:

extraction-id (int): Unique extraction identifier

Response: 204 No Content

Compiled Statements

Manage compiled multi-year financial statements.

List Compiled Statements by Company

GET /api/v1/compiled-statements/companies/{company-id}

Path Parameters:

company-id (int): Unique company identifier

Response: Array of compiled statement objects

Get Compiled Statement by ID

GET /api/v1/compiled-statements/{compiled-statement-id}

Path Parameters:

compiled-statement-id (int): Unique compiled statement identifier

Response: Compiled statement object

Get Compiled Statement by Company and Statement Type

GET /api/v1/compiled-statements/companies/{company-id}/statement-type/{statement-type}

Path Parameters:

company-id (int): Unique company identifier
statement-type (string): Statement type

Response: Compiled statement object

Create or Update Compiled Statement

POST /api/v1/compiled-statements

Request Body:

{
  "company_id": 1,
  "statement_type": "Income Statement",
  "compiled_data": {
    "Revenue": {
      "2020": 900000,
      "2021": 950000,
      "2022": 1000000,
      "2023": 1050000
    },
    "Net Income": {
      "2020": 350000,
      "2021": 370000,
      "2022": 400000,
      "2023": 420000
    }
  },
  "data_lineage": {
    "2023": {
      "source_document_id": 5,
      "source_extraction_id": 12
    }
  }
}

Response: Compiled statement object

Update Compiled Statement

PUT /api/v1/compiled-statements/{compiled-statement-id}

Path Parameters:

compiled-statement-id (int): Unique compiled statement identifier

Request Body: Compiled statement update object

Response: Updated compiled statement object

Delete Compiled Statement

DELETE /api/v1/compiled-statements/{compiled-statement-id}

Path Parameters:

compiled-statement-id (int): Unique compiled statement identifier

Response: 204 No Content

Tasks

Manage asynchronous Celery tasks for long-running operations.

For detailed information about task processing, see Task Processing Documentation.

Trigger Company Financial Data Extraction

POST /api/v1/tasks/companies/{company_id}/extract

Path Parameters:

company_id (int): Unique company identifier

Response: Task response with task ID

{
  "task_id": "a00d8c65-c7fd-4360-8f4c-836b0df25f59",
  "status": "PENDING",
  "message": "Financial data extraction started for company 1"
}

Description: Triggers the complete financial data extraction workflow:

Scrapes investor relations website
Discovers and classifies documents
Downloads PDFs
Extracts financial statements
Normalizes and compiles statements

Estimated Duration: 10 minutes - 2 hours

Trigger Investor Relations Scraping

POST /api/v1/tasks/companies/{company_id}/scrape

Path Parameters:

company_id (int): Unique company identifier

Response: Task response with task ID

{
  "task_id": "b12e9d76-d8ae-5471-9f5d-947c1ef36f60",
  "status": "PENDING",
  "message": "Scraping started for company 1"
}

Description: Scrapes the investor relations website to discover PDF documents.

Estimated Duration: 30 seconds - 5 minutes

Trigger Company Statements Recompilation

POST /api/v1/tasks/companies/{company_id}/recompile

Path Parameters:

company_id (int): Unique company identifier

Response: Task response with task ID

Description: Recompiles all financial statements after new extractions. Useful when new documents are processed and statements need updating.

Estimated Duration: 1-5 minutes

Trigger Batch Document Processing

POST /api/v1/tasks/companies/{company_id}/process-documents

Path Parameters:

company_id (int): Unique company identifier

Response: Task response with task ID

{
  "task_id": "c23f0e87-e9bf-6582-0a6e-a58d2ef47f71",
  "status": "PENDING",
  "message": "Batch document processing started for company 1"
}

Description: Processes all documents for a company through classify, download, and extract steps. This is useful for batch processing multiple documents after scraping.

Estimated Duration: 10 minutes - 1 hour (depends on number of documents)

Trigger Document Processing

POST /api/v1/tasks/documents/{document_id}/process

Path Parameters:

document_id (int): Unique document identifier

Response: Task response with task ID

Description: Processes a document end-to-end:

Classifies document type
Downloads PDF (if needed)
Extracts financial statements (for annual reports)

Estimated Duration: 2-10 minutes

Trigger PDF Download

POST /api/v1/tasks/documents/{document_id}/download

Path Parameters:

document_id (int): Unique document identifier

Response: Task response with task ID

Description: Downloads PDF document from URL and stores locally.

Estimated Duration: 5-30 seconds

Trigger Document Classification

POST /api/v1/tasks/documents/{document_id}/classify

Path Parameters:

document_id (int): Unique document identifier

Response: Task response with task ID

Description: Classifies document by type (annual_report, quarterly_report, etc.) using filename patterns, URL patterns, and content analysis.

Estimated Duration: 1-5 seconds

Trigger Financial Statement Extraction

POST /api/v1/tasks/documents/{document_id}/extract

Path Parameters:

document_id (int): Unique document identifier

Response: Task response with task ID

Description: Extracts financial statements (Income Statement, Balance Sheet, Cash Flow Statement) from PDF using LLM.

Estimated Duration: 2-5 minutes per document

Get Task Status

GET /api/v1/tasks/{task_id}

Path Parameters:

task_id (string): Celery task identifier

Response: Task status with result or error

{
  "task_id": "a00d8c65-c7fd-4360-8f4c-836b0df25f59",
  "status": "SUCCESS",
  "result": {
    "task_id": "...",
    "company_id": 1,
    "status": "success",
    "discovered_count": 12,
    "created_count": 12,
    "documents": [...]
  },
  "error": null
}

Status Values:

PENDING - Task is waiting to be processed
STARTED - Task has started execution
PROGRESS - Task is in progress (check result.meta for details)
SUCCESS - Task completed successfully
FAILURE - Task failed (check error field)
RETRY - Task is being retried
REVOKED - Task was cancelled

Description: Checks the current status and result of a Celery task. Results expire after 1 hour. Use Flower dashboard for persistent task history.

Common Response Formats

Success Response

All successful responses follow a standard format:

GET/PUT Responses:

{
  "id": 1,
  "name": "Example",
  "created_at": "2024-01-01T00:00:00Z",
  "updated_at": "2024-01-01T00:00:00Z"
}

POST Responses:

{
  "id": 1,
  "name": "Created Example",
  "created_at": "2024-01-01T00:00:00Z",
  "updated_at": "2024-01-01T00:00:00Z"
}

DELETE Responses:

204 No Content (empty body)

Error Response

Error responses include detailed information:

{
  "detail": "Error description",
  "error_code": "ERROR_CODE",
  "timestamp": "2024-01-01T00:00:00Z"
}

Validation Error Response

{
  "detail": [
    {
      "loc": ["body", "name"],
      "msg": "field required",
      "type": "value_error.missing"
    }
  ]
}

HTTP Status Codes

200 OK: Request succeeded
201 Created: Resource created successfully
204 No Content: Resource deleted successfully
400 Bad Request: Invalid request data
404 Not Found: Resource not found
422 Unprocessable Entity: Validation error
500 Internal Server Error: Server error

Authentication

Currently, the API operates without authentication in development. Production deployments should implement OAuth2 with JWT tokens.

Rate Limiting

No rate limits are currently enforced. Production deployments should implement rate limiting to prevent abuse of expensive operations (especially extraction endpoints).

Monitoring

The API exposes Prometheus metrics at /metrics for monitoring:

Request latency (p50, p95, p99)
Request counts by endpoint and method
Error rates by status code
Request and response body sizes

CORS

CORS is configured to allow cross-origin requests from specified origins. In development, this is typically set to ["*"] for all origins. Production should restrict to specific domains.

Request Timeout

Default request timeout is 60 seconds. Long-running operations should be handled asynchronously via Celery tasks. See Task Processing Documentation for details.

Examples

Complete Workflow: Adding a Company and Extracting Financial Data

# 1. Create a company
curl -X POST http://localhost:3030/api/v1/companies \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Adyen",
    "ticker": "ADYEN",
    "sector": "Financial Services",
    "country": "Netherlands",
    "investor_relations_url": "https://www.adyen.com/investor-relations",
    "currency": "EUR"
  }'

# 2. Get the company ID from response (e.g., company ID is 1)

# 3. Create a document record
curl -X POST http://localhost:3030/api/v1/documents \
  -H "Content-Type: application/json" \
  -d '{
    "company_id": 1,
    "document_type": "Annual Report",
    "fiscal_year": 2023,
    "filing_date": "2024-03-15",
    "file_path": "/data/pdfs/adyen_annual_report_2023.pdf",
    "metadata": {
      "pages": 150,
      "language": "en"
    }
  }'

# 4. Get the document ID from response (e.g., document ID is 1)

# 5. Create an extraction (typically done by Celery worker)
curl -X POST http://localhost:3030/api/v1/extractions \
  -H "Content-Type: application/json" \
  -d '{
    "document_id": 1,
    "statement_type": "Income Statement",
    "extracted_data": {
      "Revenue": 1000000000,
      "Cost of Revenue": 300000000,
      "Gross Profit": 700000000
    },
    "confidence_score": 0.98
  }'

# 6. Get all companies
curl http://localhost:3030/api/v1/companies

# 7. Get all documents for a company
curl http://localhost:3030/api/v1/documents/companies/1

# 8. Get compiled statements for a company
curl http://localhost:3030/api/v1/compiled-statements/companies/1

API Versioning

The API uses URL-based versioning with the /api/v1 prefix. Future versions will be added as /api/v2, etc., maintaining backward compatibility with previous versions.

Pagination

List endpoints support pagination via skip and limit query parameters:

skip: Number of records to skip (default: 0)
limit: Maximum number of records to return (default: 100, max: 100)

Example:

GET /api/v1/companies?skip=20&limit=50

Filtering and Sorting

Currently, filtering and sorting are not implemented. Future versions will add support for:

Filtering by metadata fields
Sorting by creation date, updated date, or other fields
Full-text search on company names and document types

Database Schema - Data models and relationships
Infrastructure Development - Docker setup and running the API
Backend Overview - Backend architecture and development guide

API Reference

Base URL

API Documentation

Common Endpoints

Health Check

List All Endpoints

Metrics

API Resources

Companies

List All Companies

Get Company by ID

Get Company by Ticker

Create Company

Update Company

Delete Company

Documents

List Documents by Company

Get Document by ID

Get Documents by Company and Fiscal Year

Get Documents by Company and Type

Create Document

Update Document

Delete Document

List PDFs from Storage

Download PDF from Storage

Extractions

List Extractions by Document

Get Extraction by ID

Get Extraction by Document and Statement Type

Create Extraction

Update Extraction

Delete Extraction

Compiled Statements

List Compiled Statements by Company

Get Compiled Statement by ID

Get Compiled Statement by Company and Statement Type

Create or Update Compiled Statement

Update Compiled Statement

Delete Compiled Statement

Tasks

Trigger Company Financial Data Extraction

Trigger Investor Relations Scraping

Trigger Company Statements Recompilation

Trigger Batch Document Processing

Trigger Document Processing

Trigger PDF Download

Trigger Document Classification

Trigger Financial Statement Extraction

Get Task Status

Common Response Formats

Success Response

Error Response

Validation Error Response

HTTP Status Codes

Authentication

Rate Limiting

Monitoring

CORS

Request Timeout

Examples

Complete Workflow: Adding a Company and Extracting Financial Data

API Versioning

Pagination

Filtering and Sorting

Related Documentation