First Steps - Your First Extraction
This tutorial will guide you through extracting financial data for the first time.
Prerequisites
- All services running (see Installation Guide)
- Backend API running on
http://localhost:3030 - Celery worker running
- Frontend running on
http://localhost:3000
Step 1: View Companies
The system comes with 6 pre-seeded European companies:
- AstraZeneca PLC (AZN)
- SAP SE (SAP)
- Siemens AG (SIE)
- ASML Holding N.V. (ASML)
- Unilever PLC (ULVR)
- Allianz SE (ALV)
Via Frontend:
- Open http://localhost:3000
- Navigate to the Companies page
- You should see all 6 companies listed
Via API:
curl http://localhost:3030/api/v1/companies
Step 2: Trigger Extraction
Option A: Via Frontend
- Select a company from the list (e.g., AstraZeneca)
- Click “Extract Financial Data” button
- Monitor the task status in the task monitor component
Option B: Via API
# Trigger extraction for company ID 1 (AstraZeneca)
curl -X POST http://localhost:3030/api/v1/tasks/companies/1/extract
# Response:
# {
# "task_id": "a00d8c65-c7fd-4360-8f4c-836b0df25f59",
# "status": "PENDING",
# "message": "Financial data extraction started for company 1"
# }
Step 3: Monitor Progress
Via Frontend
The task status monitor shows real-time progress:
- Scraping - Discovering PDFs from investor relations website
- Classifying - Categorizing documents
- Downloading - Fetching PDF files
- Extracting - Extracting financial statements using LLM
- Compiling - Normalizing and compiling statements
Via API
# Check task status (replace TASK_ID with actual task ID)
curl http://localhost:3030/api/v1/tasks/TASK_ID
# Response:
# {
# "task_id": "a00d8c65-c7fd-4360-8f4c-836b0df25f59",
# "status": "SUCCESS",
# "result": { ... },
# "error": null
# }
Via Flower Dashboard
- Open http://localhost:5555
- Navigate to “Tasks” tab
- Find your task by ID or company name
- View detailed execution timeline
Step 4: View Results
View Documents
Via Frontend:
- Navigate to the company’s detail page
- Click “Documents” tab
- See all discovered annual reports
Via API:
# List documents for company 1
curl http://localhost:3030/api/v1/documents/companies/1
View Extractions
Via Frontend:
- Navigate to the company’s detail page
- Click “Extractions” tab
- See raw financial statement extractions
Via API:
# List extractions for a document
curl http://localhost:3030/api/v1/extractions/documents/DOCUMENT_ID
View Compiled Statements
Via Frontend:
- Navigate to the company’s detail page
- Click “Statements” tab
- Select statement type (Income Statement, Balance Sheet, Cash Flow)
- View 10-year compiled financial data
Via API:
# Get compiled income statement for company 1
curl http://localhost:3030/api/v1/compiled-statements/companies/1/statement-type/Income%20Statement
Step 5: Understand the Data
Compiled Statement Structure
Each compiled statement contains:
- Line Items - Normalized financial line items (e.g., “Revenue”, “Total Assets”)
- Years - Columns for each fiscal year (e.g., 2015-2024)
- Values - Financial values in the company’s reporting currency
- Metadata - Data lineage, confidence scores, gaps
Data Quality Indicators
- Confidence Scores - How certain the extraction is
- Data Lineage - Which report each value came from
- Restatements - Newer reports may contain restated historical data
- Gaps - Missing years or line items
Common Tasks
Extract Data for All Companies
# For each company, trigger extraction
for company_id in 1 2 3 4 5 6; do
curl -X POST http://localhost:3030/api/v1/tasks/companies/$company_id/extract
done
Recompile Statements
If you’ve added new extractions and want to recompile:
curl -X POST http://localhost:3030/api/v1/tasks/companies/1/recompile
Download a PDF
# Get document info
curl http://localhost:3030/api/v1/documents/DOCUMENT_ID
# Download PDF from storage
curl "http://localhost:3030/api/v1/documents/storage/download?object_key=company_1/2023/annual_report.pdf" \
--output document.pdf
Next Steps
Now that you’ve completed your first extraction:
- Explore the API - Learn all available endpoints
- Understand the Architecture - Learn how the system works
- Database Schema - Understand data structures
- Task Processing - Learn about Celery tasks
Troubleshooting
Extraction Takes Too Long:
- Normal: Full extraction can take 10 minutes to 2 hours per company
- Check Celery worker logs for progress
- Monitor Flower dashboard for task status
No Documents Found:
- Check investor relations URL is correct
- Verify website is accessible
- Check scraping worker logs
LLM Extraction Fails:
- Verify OpenRouter API key is valid
- Check API credits/balance
- Review extraction worker logs
Compilation Shows Missing Data:
- Some years may not have reports
- Check document list for available fiscal years
- Verify extractions were successful
Tips
- Start Small: Test with one company first
- Monitor Resources: Watch Docker container resources during extraction
- Use Flower: Flower dashboard provides excellent visibility into task execution
- Check Logs: Review worker logs for detailed execution information
- Verify Data: Always verify extracted data looks correct before trusting it