API Reference
The Financial Data Extractor provides a RESTful API built with FastAPI for managing companies, documents, financial statement extractions, and compiled statements.
Base URL
http://localhost:3030/api/v1
API Documentation
The API includes comprehensive interactive documentation automatically generated by FastAPI:
- Swagger UI:
http://localhost:3030/docs - ReDoc:
http://localhost:3030/redoc - OpenAPI Schema:
http://localhost:3030/openapi.json
Common Endpoints
Health Check
GET /healthcheck
Returns the health status of the API.
Response:
{
"status": "Healthy"
}
List All Endpoints
GET /endpoints/
Returns a list of all available routes in the application.
Response:
{
"endpoints": [
{
"path": "/healthcheck",
"name": "health_check",
"methods": ["GET"]
},
...
]
}
Metrics
GET /metrics
Prometheus metrics endpoint for monitoring and observability.
API Resources
Companies
Manage company information and metadata.
List All Companies
GET /api/v1/companies
Query Parameters:
skip(int, default: 0): Number of records to skiplimit(int, default: 100, max: 100): Maximum number of records to return
Response: Array of company objects
Get Company by ID
GET /api/v1/companies/{company-id}
Path Parameters:
company-id(int): Unique company identifier
Response: Company object
Get Company by Ticker
GET /api/v1/companies/ticker/{ticker}
Path Parameters:
ticker(string): Stock ticker symbol (e.g., “ADYEN”)
Response: Company object
Create Company
POST /api/v1/companies
Request Body:
{
"name": "AstraZeneca PLC",
"ir_url": "https://www.astrazeneca.com/investor-relations/annual-reports.html",
"primary_ticker": "AZN",
"tickers": [
{"ticker": "AZN", "exchange": "LSE"},
{"ticker": "AZN", "exchange": "NASDAQ"}
]
}
Note:
primary_tickeris optional but recommendedtickersis an optional array of objects withtickerandexchangefieldsir_urlis required and should point to the company’s investor relations page
Response: Created company object
Update Company
PUT /api/v1/companies/{company-id}
Path Parameters:
company-id(int): Unique company identifier
Request Body: Company update object
Response: Updated company object
Delete Company
DELETE /api/v1/companies/{company-id}
Path Parameters:
company-id(int): Unique company identifier
Response: 204 No Content
Documents
Manage PDF documents (annual reports, presentations, etc.) associated with companies.
List Documents by Company
GET /api/v1/documents/companies/{company-id}
Path Parameters:
company-id(int): Unique company identifier
Query Parameters:
skip(int, default: 0): Number of records to skiplimit(int, default: 100, max: 100): Maximum number of records to return
Response: Array of document objects
Get Document by ID
GET /api/v1/documents/{document-id}
Path Parameters:
document-id(int): Unique document identifier
Response: Document object
Get Documents by Company and Fiscal Year
GET /api/v1/documents/companies/{company-id}/fiscal-year/{fiscal-year}
Path Parameters:
company-id(int): Unique company identifierfiscal-year(int, 1900-2100): Fiscal year
Response: Array of document objects
Get Documents by Company and Type
GET /api/v1/documents/companies/{company-id}/type/{document-type}
Path Parameters:
company-id(int): Unique company identifierdocument-type(string): Document type (e.g., “Annual Report”, “Presentation”)
Query Parameters:
skip(int, default: 0): Number of records to skiplimit(int, default: 100, max: 100): Maximum number of records to return
Response: Array of document objects
Create Document
POST /api/v1/documents
Request Body:
{
"company_id": 1,
"document_type": "Annual Report",
"fiscal_year": 2023,
"filing_date": "2024-03-15",
"file_path": "/path/to/document.pdf",
"metadata": {
"pages": 150,
"language": "en",
"source_url": "https://example.com/annual-report-2023.pdf"
}
}
Response: Created document object
Update Document
PUT /api/v1/documents/{document-id}
Path Parameters:
document-id(int): Unique document identifier
Request Body: Document update object
Response: Updated document object
Delete Document
DELETE /api/v1/documents/{document-id}
Path Parameters:
document-id(int): Unique document identifier
Response: 204 No Content
List PDFs from Storage
GET /api/v1/documents/storage/companies/{company-id}
Path Parameters:
company-id(int): Unique company identifier
Query Parameters:
fiscal_year(int, optional, 1900-2100): Filter by fiscal year
Response: JSON object with list of PDF files from storage
{
"company_id": 1,
"fiscal_year": null,
"prefix": "company_1/",
"count": 5,
"files": [
{
"object_key": "company_1/2023/annual_report_2023.pdf",
"size": 5242880,
"last_modified": "2024-01-15T10:30:00Z"
}
]
}
Description: Lists all PDF files stored in MinIO/local storage for a company. This queries storage directly, not the database. Useful for checking what files are actually stored.
Download PDF from Storage
GET /api/v1/documents/storage/download
Query Parameters:
object_key(string): Object key (path) of the file in storage (URL-encoded)
Response: PDF file as binary response with Content-Type: application/pdf
Description: Downloads a PDF file directly from MinIO/local storage by object key. The object key should be URL-encoded if it contains special characters.
Example:
curl "http://localhost:3030/api/v1/documents/storage/download?object_key=company_1%2F2023%2Fannual_report_2023.pdf" \
--output document.pdf
Extractions
Manage extracted financial statements from documents.
List Extractions by Document
GET /api/v1/extractions/documents/{document-id}
Path Parameters:
document-id(int): Unique document identifier
Response: Array of extraction objects
Get Extraction by ID
GET /api/v1/extractions/{extraction-id}
Path Parameters:
extraction-id(int): Unique extraction identifier
Response: Extraction object
Get Extraction by Document and Statement Type
GET /api/v1/extractions/documents/{document-id}/statement-type/{statement-type}
Path Parameters:
document-id(int): Unique document identifierstatement-type(string): Statement type (e.g., “Income Statement”, “Balance Sheet”, “Cash Flow Statement”)
Response: Extraction object
Create Extraction
POST /api/v1/extractions
Request Body:
{
"document_id": 1,
"statement_type": "Income Statement",
"extracted_data": {
"Revenue": 1000000,
"Operating Expenses": 600000,
"Net Income": 400000
},
"confidence_score": 0.95,
"metadata": {
"extraction_method": "LLM",
"model": "openai/gpt-4o-mini",
"tokens_used": 1500
}
}
Response: Created extraction object
Update Extraction
PUT /api/v1/extractions/{extraction-id}
Path Parameters:
extraction-id(int): Unique extraction identifier
Request Body: Extraction update object
Response: Updated extraction object
Delete Extraction
DELETE /api/v1/extractions/{extraction-id}
Path Parameters:
extraction-id(int): Unique extraction identifier
Response: 204 No Content
Compiled Statements
Manage compiled multi-year financial statements.
List Compiled Statements by Company
GET /api/v1/compiled-statements/companies/{company-id}
Path Parameters:
company-id(int): Unique company identifier
Response: Array of compiled statement objects
Get Compiled Statement by ID
GET /api/v1/compiled-statements/{compiled-statement-id}
Path Parameters:
compiled-statement-id(int): Unique compiled statement identifier
Response: Compiled statement object
Get Compiled Statement by Company and Statement Type
GET /api/v1/compiled-statements/companies/{company-id}/statement-type/{statement-type}
Path Parameters:
company-id(int): Unique company identifierstatement-type(string): Statement type
Response: Compiled statement object
Create or Update Compiled Statement
POST /api/v1/compiled-statements
Request Body:
{
"company_id": 1,
"statement_type": "Income Statement",
"compiled_data": {
"Revenue": {
"2020": 900000,
"2021": 950000,
"2022": 1000000,
"2023": 1050000
},
"Net Income": {
"2020": 350000,
"2021": 370000,
"2022": 400000,
"2023": 420000
}
},
"data_lineage": {
"2023": {
"source_document_id": 5,
"source_extraction_id": 12
}
}
}
Response: Compiled statement object
Update Compiled Statement
PUT /api/v1/compiled-statements/{compiled-statement-id}
Path Parameters:
compiled-statement-id(int): Unique compiled statement identifier
Request Body: Compiled statement update object
Response: Updated compiled statement object
Delete Compiled Statement
DELETE /api/v1/compiled-statements/{compiled-statement-id}
Path Parameters:
compiled-statement-id(int): Unique compiled statement identifier
Response: 204 No Content
Tasks
Manage asynchronous Celery tasks for long-running operations.
For detailed information about task processing, see Task Processing Documentation.
Trigger Company Financial Data Extraction
POST /api/v1/tasks/companies/{company_id}/extract
Path Parameters:
company_id(int): Unique company identifier
Response: Task response with task ID
{
"task_id": "a00d8c65-c7fd-4360-8f4c-836b0df25f59",
"status": "PENDING",
"message": "Financial data extraction started for company 1"
}
Description: Triggers the complete financial data extraction workflow:
- Scrapes investor relations website
- Discovers and classifies documents
- Downloads PDFs
- Extracts financial statements
- Normalizes and compiles statements
Estimated Duration: 10 minutes - 2 hours
Trigger Investor Relations Scraping
POST /api/v1/tasks/companies/{company_id}/scrape
Path Parameters:
company_id(int): Unique company identifier
Response: Task response with task ID
{
"task_id": "b12e9d76-d8ae-5471-9f5d-947c1ef36f60",
"status": "PENDING",
"message": "Scraping started for company 1"
}
Description: Scrapes the investor relations website to discover PDF documents.
Estimated Duration: 30 seconds - 5 minutes
Trigger Company Statements Recompilation
POST /api/v1/tasks/companies/{company_id}/recompile
Path Parameters:
company_id(int): Unique company identifier
Response: Task response with task ID
Description: Recompiles all financial statements after new extractions. Useful when new documents are processed and statements need updating.
Estimated Duration: 1-5 minutes
Trigger Batch Document Processing
POST /api/v1/tasks/companies/{company_id}/process-documents
Path Parameters:
company_id(int): Unique company identifier
Response: Task response with task ID
{
"task_id": "c23f0e87-e9bf-6582-0a6e-a58d2ef47f71",
"status": "PENDING",
"message": "Batch document processing started for company 1"
}
Description: Processes all documents for a company through classify, download, and extract steps. This is useful for batch processing multiple documents after scraping.
Estimated Duration: 10 minutes - 1 hour (depends on number of documents)
Trigger Document Processing
POST /api/v1/tasks/documents/{document_id}/process
Path Parameters:
document_id(int): Unique document identifier
Response: Task response with task ID
Description: Processes a document end-to-end:
- Classifies document type
- Downloads PDF (if needed)
- Extracts financial statements (for annual reports)
Estimated Duration: 2-10 minutes
Trigger PDF Download
POST /api/v1/tasks/documents/{document_id}/download
Path Parameters:
document_id(int): Unique document identifier
Response: Task response with task ID
Description: Downloads PDF document from URL and stores locally.
Estimated Duration: 5-30 seconds
Trigger Document Classification
POST /api/v1/tasks/documents/{document_id}/classify
Path Parameters:
document_id(int): Unique document identifier
Response: Task response with task ID
Description: Classifies document by type (annual_report, quarterly_report, etc.) using filename patterns, URL patterns, and content analysis.
Estimated Duration: 1-5 seconds
Trigger Financial Statement Extraction
POST /api/v1/tasks/documents/{document_id}/extract
Path Parameters:
document_id(int): Unique document identifier
Response: Task response with task ID
Description: Extracts financial statements (Income Statement, Balance Sheet, Cash Flow Statement) from PDF using LLM.
Estimated Duration: 2-5 minutes per document
Get Task Status
GET /api/v1/tasks/{task_id}
Path Parameters:
task_id(string): Celery task identifier
Response: Task status with result or error
{
"task_id": "a00d8c65-c7fd-4360-8f4c-836b0df25f59",
"status": "SUCCESS",
"result": {
"task_id": "...",
"company_id": 1,
"status": "success",
"discovered_count": 12,
"created_count": 12,
"documents": [...]
},
"error": null
}
Status Values:
PENDING- Task is waiting to be processedSTARTED- Task has started executionPROGRESS- Task is in progress (checkresult.metafor details)SUCCESS- Task completed successfullyFAILURE- Task failed (checkerrorfield)RETRY- Task is being retriedREVOKED- Task was cancelled
Description: Checks the current status and result of a Celery task. Results expire after 1 hour. Use Flower dashboard for persistent task history.
Common Response Formats
Success Response
All successful responses follow a standard format:
GET/PUT Responses:
{
"id": 1,
"name": "Example",
"created_at": "2024-01-01T00:00:00Z",
"updated_at": "2024-01-01T00:00:00Z"
}
POST Responses:
{
"id": 1,
"name": "Created Example",
"created_at": "2024-01-01T00:00:00Z",
"updated_at": "2024-01-01T00:00:00Z"
}
DELETE Responses:
- 204 No Content (empty body)
Error Response
Error responses include detailed information:
{
"detail": "Error description",
"error_code": "ERROR_CODE",
"timestamp": "2024-01-01T00:00:00Z"
}
Validation Error Response
{
"detail": [
{
"loc": ["body", "name"],
"msg": "field required",
"type": "value_error.missing"
}
]
}
HTTP Status Codes
200 OK: Request succeeded201 Created: Resource created successfully204 No Content: Resource deleted successfully400 Bad Request: Invalid request data404 Not Found: Resource not found422 Unprocessable Entity: Validation error500 Internal Server Error: Server error
Authentication
Currently, the API operates without authentication in development. Production deployments should implement OAuth2 with JWT tokens.
Rate Limiting
No rate limits are currently enforced. Production deployments should implement rate limiting to prevent abuse of expensive operations (especially extraction endpoints).
Monitoring
The API exposes Prometheus metrics at /metrics for monitoring:
- Request latency (p50, p95, p99)
- Request counts by endpoint and method
- Error rates by status code
- Request and response body sizes
CORS
CORS is configured to allow cross-origin requests from specified origins. In development, this is typically set to ["*"] for all origins. Production should restrict to specific domains.
Request Timeout
Default request timeout is 60 seconds. Long-running operations should be handled asynchronously via Celery tasks. See Task Processing Documentation for details.
Examples
Complete Workflow: Adding a Company and Extracting Financial Data
# 1. Create a company
curl -X POST http://localhost:3030/api/v1/companies \
-H "Content-Type: application/json" \
-d '{
"name": "Adyen",
"ticker": "ADYEN",
"sector": "Financial Services",
"country": "Netherlands",
"investor_relations_url": "https://www.adyen.com/investor-relations",
"currency": "EUR"
}'
# 2. Get the company ID from response (e.g., company ID is 1)
# 3. Create a document record
curl -X POST http://localhost:3030/api/v1/documents \
-H "Content-Type: application/json" \
-d '{
"company_id": 1,
"document_type": "Annual Report",
"fiscal_year": 2023,
"filing_date": "2024-03-15",
"file_path": "/data/pdfs/adyen_annual_report_2023.pdf",
"metadata": {
"pages": 150,
"language": "en"
}
}'
# 4. Get the document ID from response (e.g., document ID is 1)
# 5. Create an extraction (typically done by Celery worker)
curl -X POST http://localhost:3030/api/v1/extractions \
-H "Content-Type: application/json" \
-d '{
"document_id": 1,
"statement_type": "Income Statement",
"extracted_data": {
"Revenue": 1000000000,
"Cost of Revenue": 300000000,
"Gross Profit": 700000000
},
"confidence_score": 0.98
}'
# 6. Get all companies
curl http://localhost:3030/api/v1/companies
# 7. Get all documents for a company
curl http://localhost:3030/api/v1/documents/companies/1
# 8. Get compiled statements for a company
curl http://localhost:3030/api/v1/compiled-statements/companies/1
API Versioning
The API uses URL-based versioning with the /api/v1 prefix. Future versions will be added as /api/v2, etc., maintaining backward compatibility with previous versions.
Pagination
List endpoints support pagination via skip and limit query parameters:
skip: Number of records to skip (default: 0)limit: Maximum number of records to return (default: 100, max: 100)
Example:
GET /api/v1/companies?skip=20&limit=50
Filtering and Sorting
Currently, filtering and sorting are not implemented. Future versions will add support for:
- Filtering by metadata fields
- Sorting by creation date, updated date, or other fields
- Full-text search on company names and document types
Related Documentation
- Database Schema - Data models and relationships
- Infrastructure Development - Docker setup and running the API
- Backend Overview - Backend architecture and development guide