Building Event Reliability at Monesize Core
As Monesize Core evolved into a modular business platform, more parts of the system started depending on events.
Accounting needed to react to sales. Inventory needed to react to purchases. Activity logs needed to capture important actions. Notifications needed to respond to business events.
The architecture naturally moved toward an event-driven approach. A service performs a business operation, emits an event, and other modules react.
At first glance, this looks simple. A transaction happens. An event is emitted. Listeners process the event.
The problem is that distributed systems are not defined by what happens when everything works. They are defined by what happens when something fails.
The question I eventually had to answer was: how do I guarantee that a business event is never silently lost?
The Gap Between Database Writes and Events
The first version of event handling looked simple:
await prisma.sale.create({
data: saleData
});
eventBus.emit('sale:created', {
saleId
});
The assumption is that both operations happen successfully. But they are separate operations. The database knows about the sale. The event system knows about the sale. There is no guarantee they remain consistent.
Imagine this sequence:
- Sale transaction commits successfully.
- The process crashes.
- The event is never emitted.
Now the database contains a sale. But other modules never know it happened. Inventory is not updated. Accounting does not create the necessary records. Downstream workflows never execute. The business state is now inconsistent.
This is the first reliability gap: the gap between writing data and publishing the event.
The Second Gap: Event Processing
Fixing the first problem is not enough.
Imagine the event is successfully emitted. A listener receives it:
sale:created → ACCOUNTING
The listener starts processing. It creates accounting entries. Then the process crashes before completion.
The event was already consumed. The system believes it was handled. But the actual work was incomplete.
This creates a second failure mode: the gap between receiving an event and completing the work.
Reliable event systems need to solve both problems. Delivery reliability. And processing correctness.
Why Simple Solutions Were Not Enough
Before implementing the final architecture, several approaches were considered.
Retrying inside listeners. If a listener fails, just retry. The problem is that the event may already be gone. The listener needs a reliable way to know that the event existed. Retries alone do not solve event persistence.
Dual writes. Write to the database and separately write the event. The problem is that you now have two operations that can fail independently. You create another consistency problem.
Synchronous processing. Process everything immediately inside the request. The problem is coupling. A slow listener increases request latency. A failing listener can affect unrelated operations. The system becomes fragile.
The core problem remained: once an event leaves the process, reliability becomes difficult unless the event itself is persisted.
The Solution: Transactional Outbox
The solution was the Event Outbox pattern. The idea is simple: do not immediately emit the event. Store it first.
The event is written inside the same database transaction as the business operation:
await prisma.$transaction(async (tx) => {
await tx.sale.create({
data: saleData
});
await writeEvent(tx, {
event: 'sale:created',
payload: {
saleId
}
});
});
Now the business operation and event creation are atomic. Either both happen. Or neither happens. If the application crashes after the transaction commits, the event still exists. A background process can recover it. The event cannot disappear.
Fast Path + Recovery Path
The outbox is not the primary delivery mechanism. Normal operation should be fast. So Monesize Core uses a two-layer approach.
The fast path: business transaction completes, event is emitted immediately, outbox record is removed after successful delivery.
The recovery path: if something fails before deletion, the poller discovers the event and retries.
The poller is insurance. Not the main highway.
Multi-Process Safety
Monesize Core runs with PM2 clustering. Multiple processes can exist at the same time. That creates another problem: what prevents two processes from delivering the same event?
The solution is row locking. The outbox processing uses database locking with FOR UPDATE SKIP LOCKED. This allows multiple workers to safely process events without competing for the same records. Only one process handles a specific event instance.
The Second Layer: Listener Idempotency
Outbox guarantees delivery. It does not guarantee that a listener executes only once. A listener can still crash after starting work. The same event may be delivered again.
For some actions, duplicate execution is harmless. For accounting entries, it is not.
The solution is per-listener idempotency. Each listener gets a processing record:
EventIdempotency {
listenerModule,
event,
eventInstanceId,
processedAt
}
The unique key is (listenerModule, event, eventInstanceId).
This matters because multiple modules can consume the same event. Accounting processing a sale is different from inventory processing a sale. Each listener needs its own guarantee.
The Idempotency Mistake
The first implementation had a subtle problem. The idempotency check happened before the business transaction:
checkIdempotency();
processBusinessLogic();
This looked correct. But there was a failure case. The process could crash after the idempotency record was created but before the business transaction completed. On retry, the system would see the event as already processed. The business action would never happen. The event would be silently lost.
The fix was moving idempotency inside the same transaction:
await prisma.$transaction(async (tx) => {
await checkIdempotency(tx);
await tx.journalEntry.create({
data
});
});
Now the idempotency record rolls back with the failed transaction. A retry gets a clean state. The event is processed correctly.
Not Every Side Effect Has the Same Guarantee
Email is a different case. Once an email is sent, it cannot be undone. If the system crashes after sending but before recording completion, a retry may send a duplicate email.
For email, the tradeoff is different. A duplicate email is acceptable. A missing password reset email is not.
Different side effects require different reliability decisions.
Cleanup
Idempotency records should not grow forever. Processed events are cleaned periodically. Currently, records older than three days are removed. Simple. Predictable. Enough history for recovery.
Final Architecture
The complete flow looks like this:
Business Transaction
│
▼
Write Event to Outbox
│
▼
Commit Transaction
│
▼
Immediate Event Delivery (Fast Path)
│
▼
Listener Receives Event
│
▼
Idempotency Check
│
▼
Business Action Executes
If anything fails along the way, the poller recovers the outbox event. Retries happen automatically. Failed events eventually move to a dead-letter state. Nothing silently disappears.
What This Changed
The implementation resulted in:
- 19 service modules connected to the outbox
- 23 listener idempotency checks
- 0 direct event emissions inside business logic
- 5 retry attempts before dead-lettering
- Automatic crash recovery
The biggest improvement was not a new feature. It was removing an entire category of uncertainty.
Reliable systems are not created by assuming failures will not happen. They are created by designing what happens when they do.
In enterprise business software, reliability is not optional. A missing event can become a missing transaction. A missing transaction can become a missing business record. The system has to protect the truth.
Comments
Post a Comment