Design decisions
These ADRs record the non-obvious choices made during development — decisions that would otherwise be rediscovered from scratch by anyone reading the code.
ADR-0001: Single-transaction commit
Ingested records and the file-level audit marker are written in a single database transaction. Either both land or neither does.
The alternative — a separate audit store updated after the data write — requires a two-phase commit or saga pattern to be safe, which still leaves a failure window. This constraint means the destination must be a transactional database, which is an acceptable trade-off for a reliability-first system.
Superseded by: ADR-0004 (the audit/connector split later revisited this constraint).
ADR-0002: Content hash as idempotency key
Files are identified by SHA-256 of their bytes, not by filename.
This prevents a specific failure mode: a file re-dropped with the same name but corrected content would be silently skipped if filename were the key. The inverse is also handled: two files with identical content but different names are treated as one load.
The trade-off: two genuinely distinct files with the same content produce one set of rows. That is correct — an audit system should not double-count identical data.
ADR-0003: Whole-file failure on validation error (strict mode)
If any row in a file fails schema validation, the entire file fails — no records are committed.
Lenient mode (commit valid rows, discard bad ones) was rejected because partial commits make destination completeness unverifiable. A table that received 9,800 of 10,000 rows looks identical to one that received all 10,000. FAILED is an unambiguous signal: nothing landed, retry is safe, fix the source data.
ADR-0004: Separate audit DB from destination connector
The audit database (state machine, attempt counts, provenance) and the destination (ingested rows) are separate systems with separate connections and transactions.
This enables pluggable destinations (BigQuery, DuckDB, etc.) that cannot share a transaction with a SQL audit DB. The connector is responsible for making write_rows idempotent per file_hash; the stale-lock reclaim handles the failure window between the two writes.
ADR-0005: SFTP out of scope
SFTP is not a supported Watched Directory source. Use a dedicated sync layer (rclone, lftp, AWS Transfer Family) to land files in a local directory or cloud bucket first.
Two reasons: SFTP protocol diversity and auth complexity are better handled by purpose-built tools; and implementing reliable incremental SFTP sync is a project in itself that would dominate the codebase without being the core value.
ADR-0006: API sources materialized as files before ingestion
API data (Stripe, Salesforce, HubSpot, etc.) is materialized as complete NDJSON files by an upstream Fetcher before filedge run ingests it. dlt can be one such Fetcher, but it is not a Filedge dependency and does not load directly to the destination in the Filedge model.
This preserves a single audit model for all data sources: Filedge starts at the File boundary, then applies Content Hash deduplication, strict validation, row-level provenance, and the same audit state machine. For fintech operators, audit uniformity across all sources is non-negotiable.
ADR-0007: Queue sources materialized as files before ingestion
Queue data (Kafka, SQS, Kinesis, etc.) is materialized as complete NDJSON or Parquet files by an upstream Queue Materializer before filedge run ingests it. Kafka Connect, Flink, Spark, Vector, Benthos, cloud delivery services, and custom consumers can all play this role.
This keeps Filedge's boundary consistent with SFTP and API sources: external tools handle transport-specific complexity, then Filedge applies Content Hash deduplication, strict validation, row-level provenance, retry behavior, and the same audit state machine to complete Files.
ADR-0008: Schema inference confidence tiers
filedge inspect annotates each inferred column with a confidence tier (high / low / ambiguous) rather than silently picking the most specific type or defaulting everything to string.
Aggressive inference misleads operators when sparse nulls or format exceptions appear beyond the sample window. Conservative inference produces configs full of string columns that defeat the tool's purpose. Annotated tiers give operators exactly the signal they need: "this column is fine, that one needs your eyes."
ADR-0009: Warehouse CDC uses Destination-side Applied File Markers
Warehouse CDC Connectors use a Destination-side Applied File Marker keyed by destination table and Content Hash to make retries safe across the Audit DB / Destination failure window.
This is needed because replaying a CDC File re-applies business-key mutations, which cannot be made safe by row-level _source_file_hash alone. The Applied File Marker complements the Audit DB; it does not replace the Audit Record.