The challenges of soft delete

Recorded: Jan. 21, 2026, 11:03 a.m.

Original

Summarized

The challenges of soft delete | atlas9

atlas9 / blog

The challenges of soft delete

01.19.26

Software projects often implement "soft delete", maybe with a deleted boolean or an archived_at timestamp column.
If customers accidentally delete their data, they can recover it, which makes work easier for customer support teams.
Perhaps archived records are even required for compliance or audit reasons.
I've run into some trouble with soft delete designs. I'll cover those, and ponder ideas for how I'd build this in the future.
Adding an archived_at column seems to ooze complexity out into queries, operations, and applications.
Recovering deleted records does happen, but 99% of archived records are never going to be read.
So, the database tables will have a lot of dead data. Depending on access patterns, that might even be a significant amount of data.
I've seen APIs that didn't work well with Terraform, so Terraform would delete + recreate records on every run, and over time that led
to millions of dead rows. Your database can probably handle the extra bytes, and storage is fairly cheap, so it's not necessarily a problem, at first.
Hopefully, the project decided on a retention period in the beginning, and set up a periodic job to clean up those rows.
Unfortunately, I'd bet that a significant percentage of projects did neither – it's really easy to ignore the archived data for a long time.
At some point, someone might want to restore a database backup. Hopefully that's for fun and profit and not because you lost the production database at 11 am.
If your project is popular, you might have a giant database full of dead data that takes a long time to recreate from a dump file.
archived_at columns also complicate queries, operations, and application code. Applications need to make sure they always avoid the archived data that's sitting
right next to the live data. Indexes need to be careful to avoid archived rows. Manual queries run for debugging or analytics are longer and more complicated.
There's always a risk that archived data accidentally leaks in when it's not wanted. The complexity grows when there are mapping tables involved.
Migrations have to deal with archived data too. Migrations may involve more than just schema changes – perhaps you need to fix a mistake with default values, or add a new column and backfill values.
Is that going to work on records from 2 years ago? I've done migrations where these questions were not trivial to answer.
Restoring an archived record is not always as simple as just running SET archived_at = null – creating a record may involve making calls to external systems as well.
I've seen complex restoration code that was always a buggy, partial implementation of the "create" API endpoint. In the end, we removed the specialized restoration code
and required all restoration to go through the standard APIs – that simplified the server implementation, and ensured that old data that had since become invalid, could not
be restored incorrectly – it needs to pass the new validation rules.
I'm not a fan of the archived_at column approach. It's simple at first, but in my experience, it's full of pitfalls down the line.
Let's look at some alternatives (in PostgreSQL): application events, triggers, and logical replication.
All these approaches store archived data separately from live data – that may be a separate database table, a separate database, object storage, etc.
Application level archiving
One team I worked with took the approach of emitting an event at the application layer when a record was deleted. The event was sent to SQS, and another service would archive that object to S3 (among other things).
This had a few big benefits:

The primary database and application code were substantially simpler.
Deleting a resource involved cleaning up resources in various external systems.
Handling this in an async background system improved performance and reliability.
The record and all its related records can be serialized to JSON in an application-friendly layout, rather than a serialized database table layout, so it's easier to work with.

The tradeoffs:

It's more likely to have a bug in the application code, and indeed this happened more than
once, which meant archived records were lost and manual cleanup of external resources was necessary.
It's more infrastructure to understand and operate: multiple services, a message queue, etc.
Archived objects in S3 were not easy to query – finding records to restore required extra tooling from the customer support teams.

Triggers
A trigger can copy a row to an archive table before it's deleted. The archive table can be a single, generic table that stores JSON blobs:
CREATE TABLE archive (
id UUID PRIMARY KEY,
table_name TEXT NOT NULL,
record_id TEXT NOT NULL,
data JSONB NOT NULL,
archived_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
caused_by_table TEXT,
caused_by_id TEXT
);

CREATE INDEX idx_archive_table_record ON archive(table_name, record_id);
CREATE INDEX idx_archive_archived_at ON archive(archived_at);

The trigger function converts the deleted row to JSON:
CREATE OR REPLACE FUNCTION archive_on_delete()
RETURNS TRIGGER AS $$
BEGIN
INSERT INTO archive (id, table_name, record_id, data)
VALUES (
gen_random_uuid(),
TG_TABLE_NAME,
OLD.id::TEXT,
to_jsonb(OLD)
);
RETURN OLD;
END;
$$ LANGUAGE plpgsql;

Attach this trigger to any table you want to archive:
CREATE TRIGGER archive_users
BEFORE DELETE ON users
FOR EACH ROW EXECUTE FUNCTION archive_on_delete();

CREATE TRIGGER archive_documents
BEFORE DELETE ON documents
FOR EACH ROW EXECUTE FUNCTION archive_on_delete();

Handling foreign key cascades
When a parent record is deleted, PostgreSQL cascades the delete to child records. These child deletes also fire triggers, but in the context of a cascade, you often want to know why a record was deleted.
One approach is to use a session variable to track the root cause:
CREATE OR REPLACE FUNCTION archive_on_delete()
RETURNS TRIGGER AS $$
DECLARE
cause_table TEXT;
cause_id TEXT;
BEGIN
-- Check if we're in a cascade context
cause_table := current_setting('archive.cause_table', true);
cause_id := current_setting('archive.cause_id', true);

-- If this is a top-level delete, set ourselves as the cause
IF cause_table IS NULL THEN
PERFORM set_config('archive.cause_table', TG_TABLE_NAME, true);
PERFORM set_config('archive.cause_id', OLD.id::TEXT, true);
cause_table := TG_TABLE_NAME;
cause_id := OLD.id::TEXT;
END IF;

INSERT INTO archive (id, table_name, record_id, data, caused_by_table, caused_by_id)
VALUES (
gen_random_uuid(),
TG_TABLE_NAME,
OLD.id::TEXT,
to_jsonb(OLD),
cause_table,
cause_id
);
RETURN OLD;
END;
$$ LANGUAGE plpgsql;

Now when you delete a user, you can see which archived documents were deleted because of that user:
SELECT * FROM archive
WHERE caused_by_table = 'users'
AND caused_by_id = '123e4567-e89b-12d3-a456-426614174000';

Tradeoffs
Triggers add some overhead to deletes, and the archive table will grow. But:

Live tables stay clean – no archived_at columns, no dead rows
Cleaning up the archive table is trivial with WHERE archived_at < NOW() - INTERVAL '90 days'.
Queries don't need to filter out archived records
Indexes stay efficient
Applications and migrations only deal with live data
Backups of the main tables are smaller

The archive table can even live in a separate tablespace or be partitioned by time if it grows large.
WAL-based Change Data Capture
PostgreSQL's write-ahead log (WAL) records every change to the database. Change data capture (CDC) tools can read the WAL and stream those changes to external systems. For archiving, you'd filter for DELETE events and write the deleted records to another datastore.
Debezium is the most well-known tool for this. It connects to PostgreSQL's logical replication slot, reads changes, and publishes them to Kafka. From there, a consumer writes the data wherever you want – S3, Elasticsearch, another database, etc.
PostgreSQL → Debezium → Kafka → Consumer → Archive Storage

For simpler setups, there are lighter-weight alternatives:

pgstream – streams WAL changes directly to webhooks or message queues without Kafka
wal2json – a PostgreSQL plugin that outputs WAL changes as JSON, which you can consume with a custom script
pg_recvlogical – PostgreSQL's built-in tool for reading logical replication streams

Operational complexity
The main downside is operational overhead. You're running additional services that need to be monitored, maintained, and made fault-tolerant. Debezium with Kafka is a significant infrastructure investment – Kafka alone requires careful tuning and monitoring.
The lighter-weight alternatives reduce this burden but shift reliability concerns to your custom code. If your consumer crashes or falls behind, you need to handle that gracefully.
WAL retention and max_wal_size
A critical configuration is max_wal_size in PostgreSQL. The database retains WAL segments until all replication slots have consumed them. If your CDC consumer stops processing – due to a bug, network issue, or downstream failure – WAL segments accumulate on the primary.
If this continues unchecked, the primary database can run out of disk space and crash.
PostgreSQL 13+ has max_slot_wal_keep_size to limit how much WAL a slot can retain:
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';

If a slot falls too far behind, PostgreSQL invalidates it rather than filling the disk. This protects the primary but means your CDC pipeline loses data and needs to be re-synced from a snapshot.
You need monitoring and alerting on replication slot lag. If a slot starts falling behind, you want to know before it becomes a crisis.
Tradeoffs
WAL-based CDC provides:

Captures all changes without modifying application code or adding triggers
Can stream to any destination (object storage, data warehouses, search indexes)
The primary database has no additional query load – it just writes WAL as normal

But:

Significant operational complexity, especially with Kafka-based setups
Risk to primary database stability if consumers fall behind
Schema changes require careful coordination between source and consumers
More infrastructure to understand, deploy, and debug

This approach makes the most sense when you already have Kafka or similar infrastructure, or when you need to stream changes to multiple destinations beyond just archiving.
Replica that doesn't process deletes
This is an idea I had never considered until I wrote this post – I haven't tested this, it's just an idea.
What if you kept a PostgreSQL replica (e.g. using logical replication) that just didn't process DELETE queries? Would it
effectively accumulate records and updates without conflict over time?
One potential benefit of this is that the archive can be easily queried, so finding old data is simple.
Would the replica have any information about deletes? Could it separate live from deleted data? Would you be able to find a record that was "deleted 2 hours ago in account 123" for a customer? Perhaps instead of ignoring DELETE queries entirely, you could have a specialized replica that transforms DELETE events into an archived_at column.
One potential pitfall here could be schema migrations – would the archive run into difficulty applying migrations over time?
Another downside might be cost – running a replica and keeping all that storage could have a non-trivial cost: it costs money and has operational overhead.
Wrapping up
If I were starting a new project today and needed soft delete, I'd reach for the trigger-based approach first. It's simple to set up, keeps live tables clean, and doesn't require extra infrastructure. The archive table is easy to query when you need it, and easy to ignore when you don't.
If you have thoughts, comments, feedback, shoot me an email at atlas9@eabuc.com.

The design of soft deletes presents considerable challenges, often leading to complex and difficult-to-maintain systems. This document, authored by atlas9, explores various approaches to archiving deleted records in a PostgreSQL database, highlighting their respective tradeoffs. The core issue is the tendency for simple solutions—such as adding an `archived_at` column—to introduce significant operational overhead and complexity down the line.

A frequently observed problem is the creation of ‘dead data’ – records perpetually residing in the database, often due to overlooked retention periods or incorrect configurations. This can lead to massive storage consumption and performance degradation. It’s crucial to establish and enforce a retention policy from the outset, coupled with a periodic job to remove old, archived records. Unfortunately, many projects fail to implement this effectively, leaving the database cluttered with irrelevant data.

The author advocates for a trigger-based approach as a baseline solution. Using triggers to automatically copy deleted rows to an archive table offers a relatively straightforward method for capturing historical data. This approach keeps live tables clean, avoids complex query filtering, and simplifies migrations. A key improvement suggested is to incorporate session variables to track the cause of a delete, particularly within a cascade scenario. This allows tracing a delete back to its root cause, essential for auditing and debugging. A well-designed index on the archive table—specifically targeting the `archived_at` column—is critical for efficient querying.

However, even the trigger-based approach is susceptible to challenges. The reliance on triggers introduces a performance overhead and potentially increases the size of the archive table. Furthermore, proper maintenance—including periodic cleaning—is paramount. If a retention policy isn't maintained, the archive table will continue to grow indefinitely.

An alternative, and potentially more robust, approach involves Write-Ahead Logging (WAL) based Change Data Capture (CDC). This method, leveraging tools like Debezium, pgstream, or pg_recvlogical, offers a continuous stream of changes to external destinations – such as object storage (S3), data warehouses, or search indexes. This avoids the need for manual archiving and reduces the risk of missed events. While conceptually more elegant, this approach introduces significant operational complexity, particularly when using message queues like Kafka. Monitoring replication slot lag is crucial to prevent the primary database from running out of disk space.

A less-explored concept, discussed but not tested by the author, involves maintaining a dedicated replica that ignores DELETE queries. This approach could simplify querying for historical data but raises concerns about managing schema migrations and potential inconsistencies.

Key considerations highlighted throughout the document include data integrity, performance optimization, and operational complexity. The author emphasizes the importance of regularly reviewing and adjusting the chosen approach, acknowledging that there is no ‘one-size-fits-all’ solution. The document stresses the potential for significant costs – both financial and operational – associated with poorly designed soft delete implementations. Ultimately, the most successful strategy balances the need for historical data with the operational realities of maintaining a large, complex database system.