Skip to content

Content hash (SHA-256) as the file idempotency key

Files are identified for deduplication by their SHA-256 content hash, not their filename. Filename is stored as metadata but is not the identity. This prevents a specific failure mode: a file re-dropped with the same name but corrected content would be silently skipped if filename were the key, causing the correction to never land. The inverse is also handled safely: two files with identical content but different names are treated as one load. The tradeoff is that two genuinely distinct files that happen to have the same name and different content both commit — producing two sets of rows in the destination. This is correct behavior for an audit system; downstream deduplication using provenance columns is the operator's responsibility.