Scale ingestion
Use Filedge at scale by controlling three things:
- how many files each Run scans
- how much work each file contains
- how many workers write to the destination at the same time
The safest pattern is to split incoming data into bounded partitions, run one or more workers per partition, and let the Audit DB coordinate which worker owns each File.
Quick recipe
- Land complete files into time-partitioned prefixes:
s3://warehouse-landing/orders/dt=2026-05-24/hour=00/
s3://warehouse-landing/orders/dt=2026-05-24/hour=01/
- Compact tiny files before loading:
filedge compact \
--watched-dir s3://warehouse-landing/orders/dt=2026-05-24/hour=00/ \
--output s3://warehouse-landing/orders-compacted/dt=2026-05-24/hour=00/ \
--max-files 5000 \
--compress
- Run workers against bounded prefixes:
filedge run \
--dir s3://warehouse-landing/orders-compacted/dt=2026-05-24/hour=00/ \
--config pipeline.yaml \
--audit-db-url "$FILEDGE_AUDIT_DB_URL" \
--json \
--log-format json \
--no-progress
- Use a shared PostgreSQL Audit DB for parallel workers:
- Set
stale_timeout_minuteslonger than the slowest normal file load:
Scale dimensions
| Need | Use |
|---|---|
| Too many tiny files | filedge compact before filedge run |
| Too many files to scan | Partition the Watched Directory by time, tenant, region, or source |
| Very large files | Increase batch_size within memory limits, or split upstream into multiple complete files |
| More throughput | Run parallel workers on separate prefixes or non-overlapping file patterns |
| Shared worker coordination | Use a PostgreSQL Audit DB |
| Cheap local development | SQLite destination and audit DB, single worker |
Partition the Watched Directory
filedge run scans one flat directory or object-store prefix. Keep each Run bounded by making the prefix itself the unit of scheduling.
Good partition keys:
- time:
dt=YYYY-MM-DD/hour=HH - source:
stripe/,salesforce/,orders/ - tenant or region:
tenant=acme/,region=us/ - workload size:
large/,small/,backfill/
Example scheduler shape:
00:05 filedge run --dir s3://landing/orders/dt=2026-05-24/hour=00/ ...
01:05 filedge run --dir s3://landing/orders/dt=2026-05-24/hour=01/ ...
02:05 filedge run --dir s3://landing/orders/dt=2026-05-24/hour=02/ ...
This avoids repeatedly listing a huge bucket prefix and keeps retries localized to the partition where a failure happened.
Compact small files
Many small files are expensive because each File has its own hash, audit record, destination write, and commit. If an upstream system produces event-sized or page-sized NDJSON files, compact them first.
filedge compact \
--watched-dir s3://landing/events/dt=2026-05-24/hour=00/ \
--output s3://landing/events-compacted/dt=2026-05-24/hour=00/ \
--max-files 5000 \
--compress
Then load the compacted prefix:
filedge run \
--dir s3://landing/events-compacted/dt=2026-05-24/hour=00/ \
--config pipeline.yaml \
--audit-db-url "$FILEDGE_AUDIT_DB_URL"
See Compact small files for manifest mode, delete-source mode, and crash recovery.
Parallel workers
Parallel workers are safe when they share an Audit DB and the destination connector can handle concurrent writes. Each worker tries to claim PENDING files by Content Hash; only one worker can move a File to PROCESSING.
Prefer one of these shapes:
Parallel by prefix
Run one worker per partition:
filedge run --dir s3://landing/orders/dt=2026-05-24/hour=00/ --config pipeline.yaml
filedge run --dir s3://landing/orders/dt=2026-05-24/hour=01/ --config pipeline.yaml
filedge run --dir s3://landing/orders/dt=2026-05-24/hour=02/ --config pipeline.yaml
This is the easiest mode to operate because workers do not repeatedly scan the same files.
Parallel by file pattern
If files must land in one flat prefix, shard them by filename and run non-overlapping patterns:
Then schedule both:
filedge run --dir s3://landing/orders/ --config pipeline-a.yaml
filedge run --dir s3://landing/orders/ --config pipeline-b.yaml
Use file patterns only when they are truly non-overlapping. Overlap is still protected by the Audit DB, but each worker pays the scan and hash cost for files it can see.
Parallel against the same prefix
Multiple workers can point at the same --dir and the same pipeline.yaml, but this is usually less efficient because each worker scans and hashes the same files before claiming work. Use it only when destination write time dominates scan time.
When doing this:
- use PostgreSQL for the Audit DB
- set
stale_timeout_minuteslonger than the slowest normal File load - use
--no-progress --log-format json --jsonin schedulers - watch destination connector limits, quotas, and table locks
Large files
Filedge streams rows and writes them in batches, so memory is bounded by batch_size, not full file size.
Start with:
Tune from there:
- increase
batch_sizewhen database round trips dominate and memory is stable - decrease
batch_sizewhen workers use too much memory - split very large upstream exports into several complete files if one file takes longer than the desired retry window
- avoid making files so small that destination transaction overhead dominates
As a rule of thumb, aim for files that take minutes, not hours, to load. Long-running files reduce parallelism and make stale-lock tuning harder.
Connector guidance
| Destination | Parallel guidance |
|---|---|
| PostgreSQL | Good for parallel workers; monitor table indexes, locks, and connection limits |
| BigQuery | Good for parallel bulk loads; monitor load-job quotas and the 7-day job ID idempotency window |
| Databricks | Good for staged warehouse loads; monitor warehouse concurrency and staging throughput |
| DuckDB | Single-writer; run one worker per destination file |
| SQLite destination | Single-writer; use for local dev or small single-worker deployments |
The Audit DB can be SQLite for local single-worker runs. Use PostgreSQL as the Audit DB when multiple workers run at the same time.
Operational checklist
Before increasing parallelism:
filedge healthcheckpasses for the Audit DB and destination connector- prefixes are bounded by time or another partition key
stale_timeout_minutesis greater than the expected slowest File load- workers emit JSON logs and JSON run summaries
filedge status --jsonis monitored forFAILEDand long-livedPROCESSINGfiles- destination quotas, lock behavior, and connection limits are understood
After increasing parallelism:
- compare
files_scanned,committed,failed,rows_committed, andduration_sfrom--json - watch for repeated retries of the same Content Hash
- watch destination-side latency and throttling
- reduce worker count if scan cost or destination contention grows faster than committed throughput
Backfills
For backfills, keep historical data isolated from the normal ingestion path:
Run backfill workers with their own schedule and a conservative worker count. They may share the same Audit DB and destination table as normal ingestion; Content Hash idempotency still prevents duplicate file loads.
If the backfill uses a different schema or destination table, use a separate pipeline.yaml.