First Steps - Your First Extraction

This tutorial will guide you through extracting financial data for the first time.

Prerequisites

All services running (see Installation Guide)
Backend API running on http://localhost:3030
Celery worker running
Frontend running on http://localhost:3000

Step 1: View Companies

The system comes with 6 pre-seeded European companies:

AstraZeneca PLC (AZN)
SAP SE (SAP)
Siemens AG (SIE)
ASML Holding N.V. (ASML)
Unilever PLC (ULVR)
Allianz SE (ALV)

Via Frontend:

Open http://localhost:3000
Navigate to the Companies page
You should see all 6 companies listed

Via API:

curl http://localhost:3030/api/v1/companies

Step 2: Trigger Extraction

Option A: Via Frontend

Select a company from the list (e.g., AstraZeneca)
Click “Extract Financial Data” button
Monitor the task status in the task monitor component

Option B: Via API

# Trigger extraction for company ID 1 (AstraZeneca)
curl -X POST http://localhost:3030/api/v1/tasks/companies/1/extract

# Response:
# {
#   "task_id": "a00d8c65-c7fd-4360-8f4c-836b0df25f59",
#   "status": "PENDING",
#   "message": "Financial data extraction started for company 1"
# }

Step 3: Monitor Progress

Via Frontend

The task status monitor shows real-time progress:

Scraping - Discovering PDFs from investor relations website
Classifying - Categorizing documents
Downloading - Fetching PDF files
Extracting - Extracting financial statements using LLM
Compiling - Normalizing and compiling statements

Via API

# Check task status (replace TASK_ID with actual task ID)
curl http://localhost:3030/api/v1/tasks/TASK_ID

# Response:
# {
#   "task_id": "a00d8c65-c7fd-4360-8f4c-836b0df25f59",
#   "status": "SUCCESS",
#   "result": { ... },
#   "error": null
# }

Via Flower Dashboard

Open http://localhost:5555
Navigate to “Tasks” tab
Find your task by ID or company name
View detailed execution timeline

Step 4: View Results

View Documents

Via Frontend:

Navigate to the company’s detail page
Click “Documents” tab
See all discovered annual reports

Via API:

# List documents for company 1
curl http://localhost:3030/api/v1/documents/companies/1

View Extractions

Via Frontend:

Navigate to the company’s detail page
Click “Extractions” tab
See raw financial statement extractions

Via API:

# List extractions for a document
curl http://localhost:3030/api/v1/extractions/documents/DOCUMENT_ID

View Compiled Statements

Via Frontend:

Navigate to the company’s detail page
Click “Statements” tab
Select statement type (Income Statement, Balance Sheet, Cash Flow)
View 10-year compiled financial data

Via API:

# Get compiled income statement for company 1
curl http://localhost:3030/api/v1/compiled-statements/companies/1/statement-type/Income%20Statement

Step 5: Understand the Data

Compiled Statement Structure

Each compiled statement contains:

Line Items - Normalized financial line items (e.g., “Revenue”, “Total Assets”)
Years - Columns for each fiscal year (e.g., 2015-2024)
Values - Financial values in the company’s reporting currency
Metadata - Data lineage, confidence scores, gaps

Data Quality Indicators

Confidence Scores - How certain the extraction is
Data Lineage - Which report each value came from
Restatements - Newer reports may contain restated historical data
Gaps - Missing years or line items

Common Tasks

Extract Data for All Companies

# For each company, trigger extraction
for company_id in 1 2 3 4 5 6; do
  curl -X POST http://localhost:3030/api/v1/tasks/companies/$company_id/extract
done

Recompile Statements

If you’ve added new extractions and want to recompile:

curl -X POST http://localhost:3030/api/v1/tasks/companies/1/recompile

Download a PDF

# Get document info
curl http://localhost:3030/api/v1/documents/DOCUMENT_ID

# Download PDF from storage
curl "http://localhost:3030/api/v1/documents/storage/download?object_key=company_1/2023/annual_report.pdf" \
  --output document.pdf

Next Steps

Now that you’ve completed your first extraction:

Explore the API - Learn all available endpoints
Understand the Architecture - Learn how the system works
Database Schema - Understand data structures
Task Processing - Learn about Celery tasks

Troubleshooting

Extraction Takes Too Long:

Normal: Full extraction can take 10 minutes to 2 hours per company
Check Celery worker logs for progress
Monitor Flower dashboard for task status

No Documents Found:

Check investor relations URL is correct
Verify website is accessible
Check scraping worker logs

LLM Extraction Fails:

Verify OpenRouter API key is valid
Check API credits/balance
Review extraction worker logs

Compilation Shows Missing Data:

Some years may not have reports
Check document list for available fiscal years
Verify extractions were successful

Tips

Start Small: Test with one company first
Monitor Resources: Watch Docker container resources during extraction
Use Flower: Flower dashboard provides excellent visibility into task execution
Check Logs: Review worker logs for detailed execution information
Verify Data: Always verify extracted data looks correct before trusting it