API Reference

The Financial Data Extractor provides a RESTful API built with FastAPI for managing companies, documents, financial statement extractions, and compiled statements.

Base URL

http://localhost:3030/api/v1

API Documentation

The API includes comprehensive interactive documentation automatically generated by FastAPI:

  • Swagger UI: http://localhost:3030/docs
  • ReDoc: http://localhost:3030/redoc
  • OpenAPI Schema: http://localhost:3030/openapi.json

Common Endpoints

Health Check

GET /healthcheck

Returns the health status of the API.

Response:

{
  "status": "Healthy"
}

List All Endpoints

GET /endpoints/

Returns a list of all available routes in the application.

Response:

{
  "endpoints": [
    {
      "path": "/healthcheck",
      "name": "health_check",
      "methods": ["GET"]
    },
    ...
  ]
}

Metrics

GET /metrics

Prometheus metrics endpoint for monitoring and observability.

API Resources

Companies

Manage company information and metadata.

List All Companies

GET /api/v1/companies

Query Parameters:

  • skip (int, default: 0): Number of records to skip
  • limit (int, default: 100, max: 100): Maximum number of records to return

Response: Array of company objects

Get Company by ID

GET /api/v1/companies/{company-id}

Path Parameters:

  • company-id (int): Unique company identifier

Response: Company object

Get Company by Ticker

GET /api/v1/companies/ticker/{ticker}

Path Parameters:

  • ticker (string): Stock ticker symbol (e.g., “ADYEN”)

Response: Company object

Create Company

POST /api/v1/companies

Request Body:

{
  "name": "AstraZeneca PLC",
  "ir_url": "https://www.astrazeneca.com/investor-relations/annual-reports.html",
  "primary_ticker": "AZN",
  "tickers": [
    {"ticker": "AZN", "exchange": "LSE"},
    {"ticker": "AZN", "exchange": "NASDAQ"}
  ]
}

Note:

  • primary_ticker is optional but recommended
  • tickers is an optional array of objects with ticker and exchange fields
  • ir_url is required and should point to the company’s investor relations page

Response: Created company object

Update Company

PUT /api/v1/companies/{company-id}

Path Parameters:

  • company-id (int): Unique company identifier

Request Body: Company update object

Response: Updated company object

Delete Company

DELETE /api/v1/companies/{company-id}

Path Parameters:

  • company-id (int): Unique company identifier

Response: 204 No Content


Documents

Manage PDF documents (annual reports, presentations, etc.) associated with companies.

List Documents by Company

GET /api/v1/documents/companies/{company-id}

Path Parameters:

  • company-id (int): Unique company identifier

Query Parameters:

  • skip (int, default: 0): Number of records to skip
  • limit (int, default: 100, max: 100): Maximum number of records to return

Response: Array of document objects

Get Document by ID

GET /api/v1/documents/{document-id}

Path Parameters:

  • document-id (int): Unique document identifier

Response: Document object

Get Documents by Company and Fiscal Year

GET /api/v1/documents/companies/{company-id}/fiscal-year/{fiscal-year}

Path Parameters:

  • company-id (int): Unique company identifier
  • fiscal-year (int, 1900-2100): Fiscal year

Response: Array of document objects

Get Documents by Company and Type

GET /api/v1/documents/companies/{company-id}/type/{document-type}

Path Parameters:

  • company-id (int): Unique company identifier
  • document-type (string): Document type (e.g., “Annual Report”, “Presentation”)

Query Parameters:

  • skip (int, default: 0): Number of records to skip
  • limit (int, default: 100, max: 100): Maximum number of records to return

Response: Array of document objects

Create Document

POST /api/v1/documents

Request Body:

{
  "company_id": 1,
  "document_type": "Annual Report",
  "fiscal_year": 2023,
  "filing_date": "2024-03-15",
  "file_path": "/path/to/document.pdf",
  "metadata": {
    "pages": 150,
    "language": "en",
    "source_url": "https://example.com/annual-report-2023.pdf"
  }
}

Response: Created document object

Update Document

PUT /api/v1/documents/{document-id}

Path Parameters:

  • document-id (int): Unique document identifier

Request Body: Document update object

Response: Updated document object

Delete Document

DELETE /api/v1/documents/{document-id}

Path Parameters:

  • document-id (int): Unique document identifier

Response: 204 No Content

List PDFs from Storage

GET /api/v1/documents/storage/companies/{company-id}

Path Parameters:

  • company-id (int): Unique company identifier

Query Parameters:

  • fiscal_year (int, optional, 1900-2100): Filter by fiscal year

Response: JSON object with list of PDF files from storage

{
  "company_id": 1,
  "fiscal_year": null,
  "prefix": "company_1/",
  "count": 5,
  "files": [
    {
      "object_key": "company_1/2023/annual_report_2023.pdf",
      "size": 5242880,
      "last_modified": "2024-01-15T10:30:00Z"
    }
  ]
}

Description: Lists all PDF files stored in MinIO/local storage for a company. This queries storage directly, not the database. Useful for checking what files are actually stored.


Download PDF from Storage

GET /api/v1/documents/storage/download

Query Parameters:

  • object_key (string): Object key (path) of the file in storage (URL-encoded)

Response: PDF file as binary response with Content-Type: application/pdf

Description: Downloads a PDF file directly from MinIO/local storage by object key. The object key should be URL-encoded if it contains special characters.

Example:

curl "http://localhost:3030/api/v1/documents/storage/download?object_key=company_1%2F2023%2Fannual_report_2023.pdf" \
  --output document.pdf

Extractions

Manage extracted financial statements from documents.

List Extractions by Document

GET /api/v1/extractions/documents/{document-id}

Path Parameters:

  • document-id (int): Unique document identifier

Response: Array of extraction objects

Get Extraction by ID

GET /api/v1/extractions/{extraction-id}

Path Parameters:

  • extraction-id (int): Unique extraction identifier

Response: Extraction object

Get Extraction by Document and Statement Type

GET /api/v1/extractions/documents/{document-id}/statement-type/{statement-type}

Path Parameters:

  • document-id (int): Unique document identifier
  • statement-type (string): Statement type (e.g., “Income Statement”, “Balance Sheet”, “Cash Flow Statement”)

Response: Extraction object

Create Extraction

POST /api/v1/extractions

Request Body:

{
  "document_id": 1,
  "statement_type": "Income Statement",
  "extracted_data": {
    "Revenue": 1000000,
    "Operating Expenses": 600000,
    "Net Income": 400000
  },
  "confidence_score": 0.95,
    "metadata": {
      "extraction_method": "LLM",
      "model": "openai/gpt-4o-mini",
      "tokens_used": 1500
    }
}

Response: Created extraction object

Update Extraction

PUT /api/v1/extractions/{extraction-id}

Path Parameters:

  • extraction-id (int): Unique extraction identifier

Request Body: Extraction update object

Response: Updated extraction object

Delete Extraction

DELETE /api/v1/extractions/{extraction-id}

Path Parameters:

  • extraction-id (int): Unique extraction identifier

Response: 204 No Content


Compiled Statements

Manage compiled multi-year financial statements.

List Compiled Statements by Company

GET /api/v1/compiled-statements/companies/{company-id}

Path Parameters:

  • company-id (int): Unique company identifier

Response: Array of compiled statement objects

Get Compiled Statement by ID

GET /api/v1/compiled-statements/{compiled-statement-id}

Path Parameters:

  • compiled-statement-id (int): Unique compiled statement identifier

Response: Compiled statement object

Get Compiled Statement by Company and Statement Type

GET /api/v1/compiled-statements/companies/{company-id}/statement-type/{statement-type}

Path Parameters:

  • company-id (int): Unique company identifier
  • statement-type (string): Statement type

Response: Compiled statement object

Create or Update Compiled Statement

POST /api/v1/compiled-statements

Request Body:

{
  "company_id": 1,
  "statement_type": "Income Statement",
  "compiled_data": {
    "Revenue": {
      "2020": 900000,
      "2021": 950000,
      "2022": 1000000,
      "2023": 1050000
    },
    "Net Income": {
      "2020": 350000,
      "2021": 370000,
      "2022": 400000,
      "2023": 420000
    }
  },
  "data_lineage": {
    "2023": {
      "source_document_id": 5,
      "source_extraction_id": 12
    }
  }
}

Response: Compiled statement object

Update Compiled Statement

PUT /api/v1/compiled-statements/{compiled-statement-id}

Path Parameters:

  • compiled-statement-id (int): Unique compiled statement identifier

Request Body: Compiled statement update object

Response: Updated compiled statement object

Delete Compiled Statement

DELETE /api/v1/compiled-statements/{compiled-statement-id}

Path Parameters:

  • compiled-statement-id (int): Unique compiled statement identifier

Response: 204 No Content


Tasks

Manage asynchronous Celery tasks for long-running operations.

For detailed information about task processing, see Task Processing Documentation.

Trigger Company Financial Data Extraction

POST /api/v1/tasks/companies/{company_id}/extract

Path Parameters:

  • company_id (int): Unique company identifier

Response: Task response with task ID

{
  "task_id": "a00d8c65-c7fd-4360-8f4c-836b0df25f59",
  "status": "PENDING",
  "message": "Financial data extraction started for company 1"
}

Description: Triggers the complete financial data extraction workflow:

  1. Scrapes investor relations website
  2. Discovers and classifies documents
  3. Downloads PDFs
  4. Extracts financial statements
  5. Normalizes and compiles statements

Estimated Duration: 10 minutes - 2 hours


Trigger Investor Relations Scraping

POST /api/v1/tasks/companies/{company_id}/scrape

Path Parameters:

  • company_id (int): Unique company identifier

Response: Task response with task ID

{
  "task_id": "b12e9d76-d8ae-5471-9f5d-947c1ef36f60",
  "status": "PENDING",
  "message": "Scraping started for company 1"
}

Description: Scrapes the investor relations website to discover PDF documents.

Estimated Duration: 30 seconds - 5 minutes


Trigger Company Statements Recompilation

POST /api/v1/tasks/companies/{company_id}/recompile

Path Parameters:

  • company_id (int): Unique company identifier

Response: Task response with task ID

Description: Recompiles all financial statements after new extractions. Useful when new documents are processed and statements need updating.

Estimated Duration: 1-5 minutes


Trigger Batch Document Processing

POST /api/v1/tasks/companies/{company_id}/process-documents

Path Parameters:

  • company_id (int): Unique company identifier

Response: Task response with task ID

{
  "task_id": "c23f0e87-e9bf-6582-0a6e-a58d2ef47f71",
  "status": "PENDING",
  "message": "Batch document processing started for company 1"
}

Description: Processes all documents for a company through classify, download, and extract steps. This is useful for batch processing multiple documents after scraping.

Estimated Duration: 10 minutes - 1 hour (depends on number of documents)


Trigger Document Processing

POST /api/v1/tasks/documents/{document_id}/process

Path Parameters:

  • document_id (int): Unique document identifier

Response: Task response with task ID

Description: Processes a document end-to-end:

  1. Classifies document type
  2. Downloads PDF (if needed)
  3. Extracts financial statements (for annual reports)

Estimated Duration: 2-10 minutes


Trigger PDF Download

POST /api/v1/tasks/documents/{document_id}/download

Path Parameters:

  • document_id (int): Unique document identifier

Response: Task response with task ID

Description: Downloads PDF document from URL and stores locally.

Estimated Duration: 5-30 seconds


Trigger Document Classification

POST /api/v1/tasks/documents/{document_id}/classify

Path Parameters:

  • document_id (int): Unique document identifier

Response: Task response with task ID

Description: Classifies document by type (annual_report, quarterly_report, etc.) using filename patterns, URL patterns, and content analysis.

Estimated Duration: 1-5 seconds


Trigger Financial Statement Extraction

POST /api/v1/tasks/documents/{document_id}/extract

Path Parameters:

  • document_id (int): Unique document identifier

Response: Task response with task ID

Description: Extracts financial statements (Income Statement, Balance Sheet, Cash Flow Statement) from PDF using LLM.

Estimated Duration: 2-5 minutes per document


Get Task Status

GET /api/v1/tasks/{task_id}

Path Parameters:

  • task_id (string): Celery task identifier

Response: Task status with result or error

{
  "task_id": "a00d8c65-c7fd-4360-8f4c-836b0df25f59",
  "status": "SUCCESS",
  "result": {
    "task_id": "...",
    "company_id": 1,
    "status": "success",
    "discovered_count": 12,
    "created_count": 12,
    "documents": [...]
  },
  "error": null
}

Status Values:

  • PENDING - Task is waiting to be processed
  • STARTED - Task has started execution
  • PROGRESS - Task is in progress (check result.meta for details)
  • SUCCESS - Task completed successfully
  • FAILURE - Task failed (check error field)
  • RETRY - Task is being retried
  • REVOKED - Task was cancelled

Description: Checks the current status and result of a Celery task. Results expire after 1 hour. Use Flower dashboard for persistent task history.


Common Response Formats

Success Response

All successful responses follow a standard format:

GET/PUT Responses:

{
  "id": 1,
  "name": "Example",
  "created_at": "2024-01-01T00:00:00Z",
  "updated_at": "2024-01-01T00:00:00Z"
}

POST Responses:

{
  "id": 1,
  "name": "Created Example",
  "created_at": "2024-01-01T00:00:00Z",
  "updated_at": "2024-01-01T00:00:00Z"
}

DELETE Responses:

  • 204 No Content (empty body)

Error Response

Error responses include detailed information:

{
  "detail": "Error description",
  "error_code": "ERROR_CODE",
  "timestamp": "2024-01-01T00:00:00Z"
}

Validation Error Response

{
  "detail": [
    {
      "loc": ["body", "name"],
      "msg": "field required",
      "type": "value_error.missing"
    }
  ]
}

HTTP Status Codes

  • 200 OK: Request succeeded
  • 201 Created: Resource created successfully
  • 204 No Content: Resource deleted successfully
  • 400 Bad Request: Invalid request data
  • 404 Not Found: Resource not found
  • 422 Unprocessable Entity: Validation error
  • 500 Internal Server Error: Server error

Authentication

Currently, the API operates without authentication in development. Production deployments should implement OAuth2 with JWT tokens.

Rate Limiting

No rate limits are currently enforced. Production deployments should implement rate limiting to prevent abuse of expensive operations (especially extraction endpoints).

Monitoring

The API exposes Prometheus metrics at /metrics for monitoring:

  • Request latency (p50, p95, p99)
  • Request counts by endpoint and method
  • Error rates by status code
  • Request and response body sizes

CORS

CORS is configured to allow cross-origin requests from specified origins. In development, this is typically set to ["*"] for all origins. Production should restrict to specific domains.

Request Timeout

Default request timeout is 60 seconds. Long-running operations should be handled asynchronously via Celery tasks. See Task Processing Documentation for details.

Examples

Complete Workflow: Adding a Company and Extracting Financial Data

# 1. Create a company
curl -X POST http://localhost:3030/api/v1/companies \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Adyen",
    "ticker": "ADYEN",
    "sector": "Financial Services",
    "country": "Netherlands",
    "investor_relations_url": "https://www.adyen.com/investor-relations",
    "currency": "EUR"
  }'

# 2. Get the company ID from response (e.g., company ID is 1)

# 3. Create a document record
curl -X POST http://localhost:3030/api/v1/documents \
  -H "Content-Type: application/json" \
  -d '{
    "company_id": 1,
    "document_type": "Annual Report",
    "fiscal_year": 2023,
    "filing_date": "2024-03-15",
    "file_path": "/data/pdfs/adyen_annual_report_2023.pdf",
    "metadata": {
      "pages": 150,
      "language": "en"
    }
  }'

# 4. Get the document ID from response (e.g., document ID is 1)

# 5. Create an extraction (typically done by Celery worker)
curl -X POST http://localhost:3030/api/v1/extractions \
  -H "Content-Type: application/json" \
  -d '{
    "document_id": 1,
    "statement_type": "Income Statement",
    "extracted_data": {
      "Revenue": 1000000000,
      "Cost of Revenue": 300000000,
      "Gross Profit": 700000000
    },
    "confidence_score": 0.98
  }'

# 6. Get all companies
curl http://localhost:3030/api/v1/companies

# 7. Get all documents for a company
curl http://localhost:3030/api/v1/documents/companies/1

# 8. Get compiled statements for a company
curl http://localhost:3030/api/v1/compiled-statements/companies/1

API Versioning

The API uses URL-based versioning with the /api/v1 prefix. Future versions will be added as /api/v2, etc., maintaining backward compatibility with previous versions.

Pagination

List endpoints support pagination via skip and limit query parameters:

  • skip: Number of records to skip (default: 0)
  • limit: Maximum number of records to return (default: 100, max: 100)

Example:

GET /api/v1/companies?skip=20&limit=50

Filtering and Sorting

Currently, filtering and sorting are not implemented. Future versions will add support for:

  • Filtering by metadata fields
  • Sorting by creation date, updated date, or other fields
  • Full-text search on company names and document types