Object Storage with MinIO
This document describes how the financial data extractor uses MinIO for object storage of PDF documents.
Overview
The platform uses MinIO (S3-compatible object storage) for storing PDF files instead of local file system storage. This provides:
- Scalability: Object storage is designed for large-scale data
- Reliability: Built-in redundancy and data protection
- Portability: S3-compatible, can be migrated to AWS S3 or other providers
- Development: Easy local development with Docker
Architecture
Storage Service
The platform includes a unified storage service (app.core.storage) that supports:
- MinIOStorageService: For object storage (S3-compatible)
- LegacyLocalStorageService: Fallback for local file system
- StorageService: Unified wrapper with automatic fallback
Configuration
Storage is configured via environment variables in .env:
# MinIO (S3-compatible) object storage configuration
MINIO_ENABLED=true
MINIO_ENDPOINT=localhost:9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin
MINIO_BUCKET_NAME=financial-documents
MINIO_USE_SSL=false
Object Key Structure
PDF files are stored using the following key structure:
company_{company_id}/{fiscal_year}/{filename}.pdf
Example:
company_1/2023/annual_report_2023.pdf
Local Development
Starting MinIO with Docker Compose
MinIO is included in the development docker-compose setup:
cd infrastructure/docker
docker-compose -f docker-compose.dev.yml up -d minio
MinIO web console: http://localhost:9001
- Username:
minioadmin - Password:
minioadmin
MinIO S3 API: http://localhost:9000
Accessing Files
You can access files via:
- MinIO Console: http://localhost:9001
- Storage Service API: Use the
IStorageServiceinterface in code - Direct S3 API: Compatible with any S3 client
Bucket Creation
The bucket is automatically created on first use if it doesn’t exist. This is handled by the MinIOStorageService initialization.
Production Deployment
Using AWS S3
To use AWS S3 instead of MinIO, update .env:
MINIO_ENABLED=true
MINIO_ENDPOINT=s3.amazonaws.com
MINIO_ACCESS_KEY=your_aws_access_key
MINIO_SECRET_KEY=your_aws_secret_key
MINIO_BUCKET_NAME=financial-documents-prod
MINIO_USE_SSL=true
Using MinIO in Production
For production MinIO deployment, consider:
- High Availability: Use distributed MinIO mode
- SSL/TLS: Enable SSL for secure connections
- Backup: Implement regular backups
- Monitoring: Set up MinIO monitoring
Migration from Local Storage
Automatic Fallback
The storage service supports automatic fallback:
- If
MINIO_ENABLED=false, files are stored locally - Local storage uses the same object key structure
- All code is compatible with both storage types
Data Migration
To migrate existing local PDFs to MinIO:
- Keep
MINIO_ENABLED=falseinitially - Download/process existing files
- Set
MINIO_ENABLED=truefor new uploads - Optionally write a migration script to copy existing files
Code Usage
Basic Operations
from app.core.storage import StorageServiceConfig, create_storage_service
# Create storage service
config = StorageServiceConfig(
enabled=True,
endpoint="localhost:9000",
access_key="minioadmin",
secret_key="minioadmin",
bucket_name="financial-documents",
use_ssl=False,
)
storage = create_storage_service(config)
# Save file
object_key = "company_1/2023/annual_report.pdf"
storage_path = await storage.save_file(
file_content=pdf_bytes,
object_key=object_key,
content_type="application/pdf",
)
# Get file
pdf_content = await storage.get_file(object_key)
# Delete file
await storage.delete_file(object_key)
# Check existence
exists = await storage.file_exists(object_key)
# Calculate hash
file_hash = await storage.calculate_file_hash(object_key)
# Get URL
url = storage.get_file_url(object_key)
In Workers
Workers automatically receive a storage service instance:
def __init__(self, session, progress_callback, storage_service):
self.storage_service = storage_service
Monitoring
Health Checks
MinIO includes built-in health checks:
curl http://localhost:9000/minio/health/live
Metrics
Monitor MinIO using:
- MinIO Console dashboard
- Prometheus metrics endpoint
- S3-compatible API metrics
Troubleshooting
Connection Issues
- Check MinIO is running:
docker ps | grep minio - Verify endpoint:
curl http://localhost:9000/minio/health/live - Check credentials in
.env
Bucket Not Found
- Buckets are auto-created on first use
- Check permissions on access keys
- Review MinIO logs
File Not Found
- Verify object key format
- Check bucket name matches configuration
- Review MinIO console for file existence
Security
Development
Default credentials are fine for local development but should never be used in production.
Production
- Use strong access keys
- Enable SSL/TLS
- Implement bucket policies
- Use IAM roles (AWS) or access policies (MinIO)
- Enable encryption at rest