Ingesting API-Sourced Files
Filedge does not fetch from SaaS APIs directly. It ingests Files.
For API sources such as Stripe, Salesforce, HubSpot, Jira, or GitHub, use an upstream Fetcher to land complete NDJSON files in a Watched Directory. Then run Filedge against that directory. This keeps API-sourced data on the same ingestion path as file drops: Content Hash deduplication, PENDING -> COMMITTED audit state, strict validation, row-level provenance, and filedge status visibility.
See ADR-0006.
The boundary:
The Fetcher owns API behavior. Filedge starts when complete files appear in the Watched Directory.
Fetchers
A Fetcher can be any tool or job that writes complete files:
- a dlt pipeline configured to write NDJSON files
- a vendor export job
- a custom script
- Airbyte or Meltano configured to write files
- an existing internal ingestion job
dlt is a useful option because it already handles many SaaS APIs, but it is not a Filedge dependency and it is not the loader of record. The loader of record is still filedge run.
The Fetcher must guarantee:
- only complete files are promoted into the Watched Directory
- failed or partial fetches remain in staging or are deleted
- filenames are unique enough for operators to understand where they came from
- file contents are stable once visible to Filedge
Example: Stripe Events -> S3 -> BigQuery
1. Fetch to files
Configure your Fetcher to write complete NDJSON files to a staging prefix, then promote them to the Watched Directory:
s3://my-bucket/api-staging/stripe/
stripe_events_20260522T140000_0001.ndjson.tmp
s3://my-bucket/landing/stripe/
stripe_events_20260522T140000_0001.ndjson
The exact Fetcher command is intentionally outside Filedge's contract. For dlt, that may be a small project-specific Python script. For another organization, it may be a scheduled vendor export or an internal platform job.
2. Ingest
filedge run \
--dir s3://my-bucket/landing/stripe/ \
--config pipeline.yaml \
--audit-db-url $FILEDGE_AUDIT_DB_URL
3. Configure the file schema
pipeline.yaml describes the files that the Fetcher lands:
format: ndjson
connector:
type: bigquery
project: my-gcp-project
dataset: raw
destination_table: stripe_events
write_mode: append
columns:
- source: id
dest: id
type: string
required: true
- source: type
dest: type
type: string
required: true
- source: created
dest: created
type: timestamp
required: true
- source: data
dest: data
type: string
required: false
Example: Salesforce -> local files -> PostgreSQL
An internal job or external Fetcher lands complete files:
Then Filedge ingests those files:
filedge run \
--dir /data/landing/salesforce/ \
--config pipeline.yaml \
--audit-db-url $FILEDGE_AUDIT_DB_URL
Scheduling
Schedule the Fetcher before filedge run. Different API sources can run on different cadences. Filedge does not need to know which tool produced the files.
Cloud Scheduler / EventBridge:
Every 15 min:
├── run-stripe-fetcher --output s3://.../landing/stripe/
Every hour:
├── run-salesforce-fetcher --output s3://.../landing/salesforce/
Every hour + 10 min:
├── filedge run --dir s3://.../landing/stripe/ --config pipeline.yaml ...
└── filedge run --dir s3://.../landing/salesforce/ --config pipeline.yaml ...
Airflow:
fetch_stripe = BashOperator(
task_id="fetch_stripe",
bash_command="run-stripe-fetcher --output s3://.../landing/stripe/",
)
ingest_stripe = BashOperator(
task_id="ingest_stripe",
bash_command="filedge run --dir s3://.../landing/stripe/ --config pipeline.yaml ...",
)
fetch_stripe >> ingest_stripe
Audit visibility
Because API-sourced data passes through the same pipeline:
filedge status --audit-db-url $FILEDGE_AUDIT_DB_URL
PENDING: 0
PROCESSING: 0
COMMITTED: 142
FAILED: 1
Recent failures:
stripe_20260522T140000_0001.ndjson: schema mismatch on column 'amount_decimal'
Every destination row carries _source_file_hash linking it back to the exact fetched file. An auditor asking "what Stripe data landed on 2026-05-22 and which rows did it produce?" can answer from Filedge's Audit DB and destination provenance columns without knowing which Fetcher produced the file.
Responsibility boundary
| Concern | Owner |
|---|---|
| API authentication | Fetcher |
| Pagination | Fetcher |
| Rate limiting | Fetcher |
| Incremental cursor (last fetched ID/timestamp) | Fetcher |
| Concurrent fetch prevention | Fetcher or scheduler |
| Partial-fetch atomicity (staging -> Watched Directory) | Fetcher |
| File-level deduplication (content hash) | filedge run |
| PENDING → COMMITTED state machine | filedge run |
Row-level provenance (_source_file_hash) |
filedge run |
| Retry on ingestion failure | filedge run |
Operator visibility (filedge status) |
filedge run |