One Email. Two Invoices. The 10-Month Bug Hiding Behind a Slack Alert.
Table of Contents
A Slack alert said “duplicate notification.”
A 2-line fix would have shipped it the same day.
Both were wrong.
What the trace actually said — once I bothered to pull the queries — was: one email, two invoices in production, for ten months, ~7,119 emails deep. One of them had been reprocessed 346 times in a 20-minute window. Another had spawned 128 invoices off a single PDF.
The gap between “a bit of extra Slack noise” and “we are silently creating duplicate payables in a procurement system” was one SQL query and the discipline not to ship the obvious fix.
This is the story of that query, the framing that almost cost me the right answer, and the four-line PostgreSQL trick that let us add a database-level guarantee to a table with thousands of existing violations — without a data migration, without a maintenance window, without touching the historical mess.
The pipeline
The B2B procurement SaaS I was working with ingests vendor invoices the way most do: vendors email PDFs to a shared mailbox, Gmail Pub/Sub pushes a notification, a Spring Batch job picks it up, OCR runs, an invoice gets created.
Vendor PDF → Gmail Pub/Sub (at-least-once) → Webhook
→ Spring Batch gmailJob
→ Download + OCR
→ Invoice in the procurement system
I’d just merged a notification refactor — instead of every retry attempt firing its own Slack alert, one afterJob listener decides what to emit per email. The plan that morning was small: send a few test emails (valid PDF, password-protected, corrupt, non-invoice), verify the right Slack channels lit up the right way, close the ticket, get coffee.
The fourth test broke the plan.
The symptom
A single valid PDF, sent once. Two Slack alerts came back, ~2.3 seconds apart. Same content, same payload, slightly different receivedAt.
The cheap explanation was right there: “the listener is double-emitting.” I’d just rewritten that listener. It was the obvious suspect, the change was fresh in my head, the fix would have been one Set<String> of already-emitted keys and a 12-line PR.
I opened the file. Then I checked the alert’s receivedAt field. The listener stamps it at emit time — LocalDateTime.now() the moment the Slack message is built. The listener also dedups within a single afterJob call; it cannot emit the same alert twice within one invocation.
Two alerts seconds apart, then, can’t be one loop running twice. They’re two afterJob invocations. Two job executions. The same email was being processed by two different Spring Batch jobs.
That is a completely different bug. The listener was emitting once per job, exactly as designed. The duplicate wasn’t in the notification layer at all.
I almost shipped the listener dedup anyway, because at this point the duplicate Slack alert was technically fixable — it would have stopped the noise. It would also have left whatever the second job was doing entirely intact, in production, invisible.
Lesson, embedded: when the symptom is “two of X”, the first question is not how to dedup X. It’s whether the two X’s are actually independent. Timestamps, IDs, and DB rows answer that. The source code at the symptom layer almost never does.
The query that changed the bug
Spring Batch keeps its own bookkeeping. Every run lands in batch_job_execution, with parameters in batch_job_execution_params. For the time window of the double-alert test:
SELECT je.job_instance_id,
je.status,
je.start_time,
je.end_time,
max(CASE WHEN p.parameter_name = 'historyId'
THEN p.parameter_value END) AS history_id
FROM batch_job_execution je
JOIN batch_job_execution_params p
ON p.job_execution_id = je.job_execution_id
WHERE je.start_time BETWEEN '2026-05-27 11:59:25'
AND '2026-05-27 12:00:10'
GROUP BY je.job_instance_id, je.status, je.start_time, je.end_time
ORDER BY je.start_time;
Two rows. Two distinct job_instance_ids. Two different historyIds. Both COMPLETED. Overlapping start/end times — the second job had started while the first one was still running.
One email. Two concurrent jobs. Both green.
Then the per-attachment table, for that same message_id:
4 rows, all message_id 19e694dc751e2617:
- 2 tracking rows (attachment_id IS NULL), both PROCESSED
- 2 per-attachment rows for the valid PDF, with two different
file_public_ids — two separate Cloudinary uploads:
att_3287208765481938993.pdf
att_15335496159877274327.pdf
And finally:
SELECT count(*) FROM invoice
WHERE public_id IN ('att_3287208765481938993.pdf',
'att_15335496159877274327.pdf');
-- 2
Two invoices. In production. For one email I had sent thirty seconds ago.
In a procurement system, an “invoice” is a payable. A duplicate invoice is a double-payment risk. The “duplicate notification” framing the Slack alert had handed me was hiding a financial data-integrity bug. The bug is rarely where the alert points.
Why the dedup that was supposed to prevent this didn’t
The team’s design was sound — they had thought about this exact failure mode. The intent was to rely on Spring Batch’s own job-instance dedup: same job + same identifying parameters → same JobInstance → the second jobLauncher.run(...) is rejected with JobInstanceAlreadyCompleteException. It’s a textbook pattern. It works.
It works if you key the parameters on something stable.
The call site looked like this:
new JobParametersBuilder()
.addString("inboxEmail", inbox.getEmail())
.addString("email", message.getId()) // Gmail messageId
.addString("historyId", String.valueOf(historyId))
.toJobParameters();
The bug is one of those three lines.
historyId is not an identifier of the email. It’s an identifier of the Pub/Sub push event — the “something changed in this mailbox” notification Gmail just sent you. For a single email you can get many history events: message added pushes one, a label flip pushes another, Pub/Sub at-least-once redelivery pushes a third, and each of them carries a different historyId.
Different historyId → different JobInstance → both pass the “already complete?” check → both run → both create invoices.
The mechanism wasn’t broken. The key was. Spring Batch’s job-instance dedup is only as good as the parameters you feed it. Param choices are design choices, not implementation details. One commit, ten months earlier, had been quietly bleeding duplicate payables into production ever since.
I floated the natural fix on Slack: re-key the job on messageId instead of historyId. Same email → same JobInstance → no duplicate job. Three-line change.
Another tech lead pushed back, and his pushback is the reason this article is worth writing.
The reframe: message dedup vs. attachment dedup
The pushback was one sentence:
“Then we need to go down to attachment IDs. A processed message does not mean we haven’t lost any invoices on the way.”
It’s correct, and it’s the part most “build idempotent pipelines” advice skips.
The job uses an AtLeastOneCompletedAggregator: if an email has two attachments and one succeeds while the other fails, the job still ends COMPLETED. The tracking row is marked PROCESSED. But the failed attachment never produced an invoice.
If you dedup at the message level, then on the next Pub/Sub redelivery:
- You correctly refuse to reprocess the message (no duplicate invoice — good).
- You also refuse to retry the failed attachment, which never created its invoice. You have silently lost a payable (bad).
The two goals — no duplicates and no losses — have different natural granularities. Duplicate-protection wants the broadest “should run once” unit. Loss-protection wants the narrowest “independent work” unit. In this pipeline those are different, and the right answer for both is the smaller one: the attachment.
Most idempotency pain I’ve seen in production traces back to this exact mistake — getting the mechanism right at the wrong granularity. The right unit matters more than the right tool.
The second surprise: Gmail’s attachmentId is also unstable
Fine, attachment-level dedup. The codebase already had one:
existsByAttachmentIdAndStatusIn(attachmentId, statuses);
Reasonable. Gmail returns an attachmentId with every messages.get(...) response — it identifies a specific MIME part of a specific message. Natural to treat it as stable.
Then I looked at the two per-attachment rows for the same PDF, side by side:
attachment_id: ANGjdJ-ICuErbtUjKIFYPh7Vo2ph...
attachment_id: ANGjdJ8N1fm4v9uRrmbrTv1UOiX3...
Same file. Same MIME part. Same bytes. Two different Gmail-issued identifiers.
Gmail’s attachmentId is per-fetch, not per-attachment. Every existing dedup in the codebase was keyed on a value that changed every time.
External identifiers are external. Just because a third-party API hands you something that looks like an ID doesn’t mean it’s stable across calls. The fastest, highest-leverage thing you can do in any pipeline like this is a 30-line script that fetches the same event twice and prints which fields move and which don’t. That one script would have saved this codebase ten months.
The only thing that is stable per-attachment is the content itself — the MD5 of the bytes. The kicker: the code was already computing it on every download. It was used for an in-memory dedup set, and then thrown away. The DB column existed (file_md5_checksum). The JPA entity didn’t even map it. The answer was sitting on the floor.
How big was this, really
One last query before designing the fix:
SELECT message_id,
count(*) AS rows
FROM integration.email_attachment
WHERE attachment_id IS NULL
GROUP BY message_id
HAVING count(*) > 1
ORDER BY rows DESC;
The numbers reframed the ticket:
- ~7,119 distinct emails in production with more than one processing row.
- One email reprocessed 346 times in a 20-minute window (an older retry-storm pattern).
- One email had produced 128 invoices off a single PDF.
Ten months. Latent. Because the surface symptom had always been “a bit of extra Slack noise.”
The fix
Per-attachment idempotency, keyed on (messageId, content MD5), enforced atomically by the database, with conflict handling that doesn’t corrupt the surrounding Spring Batch step. Five concrete pieces, each doing exactly one thing.
1. Stop discarding the MD5
The MD5 was already computed on download. It just wasn’t reaching the persistence layer. The change was plumbing: add md5 to the DownloadedAttachment DTO, pass it into AttachmentProcessingDescriptor, put it on the worker ExecutionContext in the partitioner, read it on the tasklet with @Value("#{stepExecutionContext['fileMd5']}").
ExecutionContext ctx = new ExecutionContext();
ctx.putString("attachmentId", descriptor.attachmentId());
ctx.putString("fileMd5", descriptor.md5()); // carried, not dropped
ctx.putString("messageId", descriptor.messageId());
No new computation. The value already existed; the diff just stops throwing it away.
2. Persist it
@Column(name = "file_md5_checksum")
private String fileMd5Checksum;
That’s the whole change. The column was waiting.
3. A dedup query on stable identity
@Query("""
select case when count(a) > 0 then true else false end
from EmailAttachment a
join IntegrationRecord ir on ir.emailAttachment = a
where a.messageId = :messageId
and a.fileMd5Checksum = :md5
""")
boolean existsInvoicedByMessageIdAndMd5(@Param("messageId") String messageId,
@Param("md5") String md5);
Two things worth noticing. It keys on the stable pair (messageId, fileMd5Checksum), not the unstable attachmentId. And it joins integration_record — “exists” here means “has actually produced an invoice”, not just “a row exists”. A previously-failed attachment is not considered already processed, so a redelivery reprocesses it. This is what closes Andrei’s no-loss concern.
4. The atomic guard — and the four-line PostgreSQL trick
An application-level existence check is necessary but not sufficient. Two concurrent jobs (exactly what was happening in production) can both pass the check and both insert. Only the database can prevent that race.
CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS uc_email_attachment_message_md5
ON integration.email_attachment (message_id, file_md5_checksum)
WHERE file_md5_checksum IS NOT NULL;
Three details, each load-bearing:
CONCURRENTLY— no writes blocked while the index builds.(message_id, file_md5_checksum)— the actual idempotency key. Same content in different emails is still allowed; the same content twice in one email is not.WHERE file_md5_checksum IS NOT NULL— the whole reason this migration is risk-free.
That last clause is the move. The ~7,119 historical duplicate rows all have file_md5_checksum = NULL, because nobody was persisting it before. A naive full unique index would conflict with every one of them and the migration would fail. The partial index simply excludes them. It builds over zero rows. The guarantee applies to every new row going forward, and the historical mess gets cleaned on its own schedule by a separate, careful task involving real invoices and audit trails.
Partial unique indexes are how you add a guarantee to a table that already has violations — WHERE <new-key-column> IS NOT NULL is the standard escape hatch. It’s one of the cleanest patterns Postgres gives you, and it almost never makes the “intro to indexes” articles.
5. REQUIRES_NEW + saveAndFlush — and why both are load-bearing
This is the part most “use a unique constraint” advice skips, and it’s the part that bites every Spring backend the first time.
If you call repository.save(...) inside the Spring Batch step’s transaction and it violates the unique index, the constraint violation poisons the outer transaction. Spring marks it rollback-only. The rest of the step rolls back too — including unrelated bookkeeping you wanted to keep. By the time you catch the DataIntegrityViolationException, the outer transaction is already dead.
The fix is two annotations and one method call:
@Transactional(propagation = Propagation.REQUIRES_NEW)
public void addAttachmentInNewTransaction(EmailInbox inbox,
EmailAttachment attachment) {
attachment.setEmailInbox(inbox);
attachmentRepository.saveAndFlush(attachment);
}
REQUIRES_NEW suspends the outer step transaction and runs the insert in its own. If it rolls back, only this insert rolls back; the outer transaction is untouched. saveAndFlush forces the constraint check to fire synchronously, here, inside this method — without the flush, the violation might not surface until the outer commit, at which point you’re back in poisoned-transaction land.
The caller treats the duplicate as the routine outcome it actually is:
try {
inboxService.addAttachmentInNewTransaction(inbox, attachment);
} catch (DataIntegrityViolationException dup) {
log.info("Content (md5 {}) already persisted for message {} — skipping duplicate",
fileMd5, message.getId());
}
The full control flow is now:
- App-level check skips the obvious duplicates cheaply.
- The DB constraint catches the race-condition duplicates the app check can’t.
- The race-loss path is fast, isolated, observable — not an error, just a log line.
@Transactional(REQUIRES_NEW) + saveAndFlush is the canonical “let the DB enforce idempotency without poisoning my outer transaction” pattern. Every Spring backend that uses a unique constraint as a guard ends up needing it. Both pieces are needed: the propagation isolates the failure, the flush forces it to surface in time to be caught.
What this looked like in the data
| Scenario | What happens |
|---|---|
| First delivery | Row inserted with MD5, invoice created. |
| Pub/Sub redelivery, same PDF | App check returns “already invoiced.” Skipped. |
| Two concurrent jobs racing | One commits, the other catches DataIntegrityViolationException and logs. |
| Previously failed attachment, then redelivered | No integration_record → app check returns false. Reprocessed. No loss. |
| Same content in a different email | Different message_id. Index doesn’t fire. Different invoice. |
The last row matters: the key is (message_id, file_md5_checksum), not the MD5 alone. Two vendors sending the same boilerplate page is not a duplicate. The same content twice in one email is.
What I’d carry to the next pipeline
Less a list of lessons, more a single shape — because this pattern is everywhere once you’ve seen it once.
A third party delivers an event at-least-once. Your code treats the delivery event as if it were the underlying entity. Some piece of identity that looks stable turns out to be event-scoped, not entity-scoped. Your dedup keys on it. The surface symptom — when there finally is one — is small enough to ignore. The underlying cost compounds for as long as nobody looks twice.
You can find this in Stripe webhook redeliveries, in S3 event notifications, in Kafka at-least-once consumers, in SQS without dedup IDs, in GitHub webhooks, in Microsoft Graph subscriptions — and, yes, in Gmail Pub/Sub. The fix is always the same three moves:
- Verify your idempotency keys against a real redelivery, not against the API docs. “Stable in the docs” and “stable in practice across at-least-once delivery” are not the same sentence. A 30-line script that prints the candidate keys over two deliveries is the highest-leverage thing you can write that week.
- Pick the right granularity, not just the right mechanism. Message-level dedup is correct for no-duplicate and wrong for no-loss. Most production idempotency pain is the wrong granularity, not the wrong tool.
- Move the guarantee to the database. Application checks are necessary, never sufficient. A partial unique index plus
REQUIRES_NEW+saveAndFlushis a tiny amount of code for an enormous amount of confidence — and it’s the only thing that survives a race.
A Slack alert that fires twice isn’t a notification bug.
It might be a data-integrity bug wearing a notification costume.
The gap between “extra noise in Slack” and “we are creating duplicate invoices in production” is one query — and the discipline to follow the trace past the first plausible explanation.
Imad Alilat is a freelance backend engineer working on Spring Boot / Spring Batch / data-integrity systems. More war stories at devadvisor.io.



