Whoops, I Did It Again

Over the years, one of the largest questions that comes up while teaching about Event Sourcing is, “What happens if I make a mistake?” This is a sensible enough question in an append-only model. Students can normally figure out relatively quickly how to version events forward, but that does not really help with them with a bug or a mistake. This is also one place where Event Sourced systems can actually run into additional costs compared to a typical CRUD-based system.

In a typical SQL database-type system, the solution to having a bug in the software would be to open SQL Server Manager or Toad and edit the data. It may be that it is just one row in a funny state, in which case you may just open that table and edit the data. Some cases may be more sinister; perhaps there are 900 customers who have been affected by this bug. No problem! Write an update statement that selects them!

//TODO DBA clipart.

Alas, this runs into many of the problems discussed previously in “Why can’t I update an event?” Though it can work on a single database system (often a monolith), it starts to fall over on itself when applied to a many-service environment.

What about the audit log of the system? Hopefully, there are triggers on those tables that will have captured this change, who made it, and, most importantly, why. We can quickly imagine a court case where we find out that the DBA issued an update that affected a user that it should not have and which the software would not allow based on some invariants (say, the terrorist flag in a banking system).

These types of edits petrify auditors. If the DBA has permission levels high enough to do this blanket update, how do we ensure he doesn’t have permissions to do lots of other insidious things (or undo them)?

What happens when there’s a data warehouse that consumes and aggregates our data with data from other services? How will the warehouse be alerted to these changes?

What happens if our admin made a mistake when they were building their update statement to update the hundreds of customers that had this problem? What if some customers were included that should not have been, or vice versa?

What if, as a business owner, I decide six months later that this was actually a major issue and I want to know all the customers who were affected by it? Where do I find this information?

Sometimes, these types of issues are important, other times they are not. In the types of systems where Event Sourced is applied, they tend to be important. You can get typical CRUD-type systems to handle many of these cases, but the simplicity of “we are just doing a quick edit” then goes completely out the window.

These ideas are by no means new. Pat Helland has also put forward similar ideas and this blog post of his is worth reading, but the concepts it outlines are much older. Mainframes were commonly built this way in the 70s and 80s. The smalltalk image format used the same principles. Ultimately, these ideas go back centuries. This is not something new to learn. It is something old to research.

Accountants Use Pens

In 2006, I gave the first talk on Event Sourcing and CQRS at QCon San Francisco. At the time, I was not a regular conference speaker, though I had spoken at a few user group meetups. My talk was about extending Domain Driven Design with messaging patterns for high throughput systems (algorithmic trading in our case). My front row was Martin Fowler, Eric Evans, and Gregor Hohpe. I was terrified. I went through all my slides in about twenty minutes for a sixty minute talk.

I cannot tell you the feeling of watching Martin Fowler give you a red card. After, Eric told me, “That was a bad talk. I understood maybe 20% of what you were saying. I can only guess the rest.” For those who have met Eric Evans, to hear a negative word come out of his mouth is rare.

Luckily, I was invited back the following year, following which Eric said, “that was a very good talk.”

I have been using a slide similar to the image above since 2007 for this. What is so terribly wrong with the slide is that accountants don’t erase things in the middle of their ledgers. Event Sourced systems work in the exact same way. I have always joked that if you can’t figure out how to do something in an Event Sourced system, go ask your company’s CPA why and they will likely be able to tell you. This is one situation accountants understand well.

Let’s imagine that the account makes a fat-finger mistake and accidentally transfers $10,000 instead of $1,000 from Account A to Account B. Will the account just erase the zero? No, they will employ a Compensating Action. In other words, they will add new transactions to compensate for the one in error. There are two options here.

Partial Reversal

The accountant decides to balance the books by adding a journal entry that removes $9,000 from Account B to Account A. They include a note with the journal entry that this is due to a mistake with the original transaction.

ID From To Amount Reason
13 Account A Account B $10000 NONE
27 Account B Account A $9000 ERROR 13

Accountants, however, prefer not to do this. If the numbers are perfect numbers like $10,000 and $9,000, you can probably calculate in your head that the accountant originally intended to transfer $1,000. But what if the numbers were $6,984.82 and $4,119.14? What if there were six accounts involved?

Full Reversal

In such a case, accountants tend to use what is known as a full reversal instead. With a full reversal, the account will correct the books by adding a new transaction that moves the $10,000 from Account B back to Account A, noting in the journal entry that the entire transaction was a mistake. The accountant will then add another transaction tranferring $1,000 from Account A to Account B as they originally intended.

ID From To Amount Reason
13 Account A Account B $10000 NONE
27 Account B Account A $10000 REV 13
29 Account A Account B $9000 NONE

Accountants generally prefer full reversals since they make it much easier to figure out what was originally intended than partial reversals do. It is easy to forget about thinking of things from the perspective of someone who is coming to look back at the sitation and trying to figure out what happened after the fact.

Similar examples of simple and full reversals in accounting can be found on Accounting Coach website under Adjusting Entries and Reversing Entries, respectively.

Types of Compensating Actions

Dealing with Event Sourced systems, things can often be done just like an accountant would do them. The example I have used in the past is a shopping cart. You can mistakenly add an item to a shopping cart and then remove it to end up with an equivalent (from the user perspective) shopping cart afterward. This is the same as dealing with the ledger example. From an audit perspective, the removal could be flagged as correcting an error.

This does not work in all cases, though. Accounting ledger systems are lucky in that they have natural compensating actions for all of their transactions. If you put in a credit of $400, you can do an opposing debit of $400. If you add an item, the item can be removed. While it is necessary to think about things like the ability to mark the debit as actually being due to an error, handling the case does not require the introduction of new events.

However, not all Event Sourced systems have natural compensating actions. When there is no use-case representing a credit, how do you reverse a debit? In systems where there are no natural compensating actions that provide a way to easily back out of something, either they need to be added or they can be added on the fly. Both of these actions come with a cost, the difference being when you pay it.

Introducing compensating actions can be a long but valuable exercise in better understanding the domain you are trying to model. “What happens if somebody makes a mistake here?” is a question that can lead towards deep knowledge discoveries. Quite often, when a SQL DBA is handling risk mitigation of these types of issues, the knowledge sits on the DBA team and the domain is not exposed to it until the task is complete. Whether with manual tasks or “clean up” batch jobs, there is often significant risk mitigation handled outside of the model.

For instance, it is common to have a system that has come up with a use-case of “TruckLeft”, but nothing to handle the situation that the person at the gate scanned the wrong truck. The introduction of analysis to discuss all of these edge conditions might also be quite costly. A significant portion of domains exist where it is not worth discussing these edge conditions, as they either happen rarely or the affects of them are small. If you have a significant number of these types of conditions, this is something you likely want to consider.

Another option is to create these events ad-hoc when/if they occur. Many Event Stores support the ability to write an event to a stream ad-hoc, say, from a json file with curl or from a small script. The problem with using these ad-hoc events is that there are consumers, most commonly projections, that do not yet understand this event. In such a case, you would need to update that consumer and then put in the ad-hoc correction.

For smaller systems, ad-hoc compensations work very well. As your number of consumers goes up and/or the difficultly of changing your consumers increases, it often becomes a bad idea to use this type of ad-hoc correction. This is true not only for Event Sourced systems but also for systems that have other push-based integration systems.

A third option, a hybrid of the previous two, is to introduce a special type of event into your system known as a Corrected event or a Cancelled event. In the case of a Cancelled event, for instance, it would contain a link to the original event that it was cancelling.

1 Cancelled {
2 	event : '54@mystream'
3 }

or somethng like:

 1 Cancelled {
 2 	eventData : {
 3 	       id : '54',
 4 	       stream : 'mystream',
 5 	       type : 'AccountDebit',
 6 	       body : {
 7 	       	     account : 'account-64748484',
 8 	       	     amount : 65.78
 9 	       }
10     }
11 }

Depending on how you handle subscriptions, you can either send over a Cancelled event with a link back to the original event or you can include the body of the original event in the Cancelled event. These are just semantic differences, as the consumer can get the event data for 54@mystream if it wants to anyway. Both are valid implementations.

This is advertising to the consumers that any event can possibly be cancelled and if they really care about the cancellation of a previous event beyond notifying someone, they really should handle the Cancelled event.

How Do I Find What Needs Fixing?

This is one of the biggest struggles for people in Event Sourced systems, especially those who may be new to Event Sourcing. We can send a compensating action, but how do we figure out which streams need it sent?

A commonly heard approach is to bring up a one-off instance of the domain model that will iterate through all of the possibly affected aggregates one by one, emitting the compensation as it finds domain objects that may be affected. This strategy is not a terrible one, but it can run into a few issues.

How, for one, does your code know what the IDs of all of the streams of that type are? Assuming that you found a problem in accounts, how do you know all the streams that are accounts? There are ways of working around this, such as using an Announcement Stream that tracks all of the account streams, but this must be in place already. If not, it can be a hurdle.

Does your domain object have enough data in it at the moment to actually identify if it is currently having a problem? It is common in Event Sourced systems to have the domain state/aggregate contain only the things requires to maintain the invariants it protects. Quite often, the domain object in such a scenario does not include the state it needs to be able to detect that this domain object needs the change.

As a concrete example, you realize the customers from Florida have had the wrong sales tax rate associated with them. The domain object however does not include a state field on it as it is not used in any of the invariants for the customer object. It is indeed possible at this point to go change the representation of the domain object to include this information, but there can be other complications with changing domain objects.

It is important here to remember that the state your domain model uses, whether in a functional or object-oriented style, is itself a projection. It is a projection that happens to be used for validating incoming messages. Why not just use another projection? If, for instance, you are using Oracle as a read model, why not just make another temporary projection in oracle?

What’s more, bringing up a temporary read model has some other benefits. It is easier to just throw it away after the procedure than to make changes to the domain objects. Those changes tend to end up sticking. Also, once you have a read model you can use that read model’s full power to analyze and simulate what the effect of your operation will be before you do it. If the streams involved happen to be related in a complex graph, such as in a financial portfolio, you can use a graph database to analyze the possible repercussions.

While it may often seem like the right way to get things done, using of the existing domain model is often not the cleanest solution. Instead, write a one-off projection, analyze your data, and determine what the scope of your change will be. Understand it, discuss its consequences, and finally, if you determine it’s still the right move, run the compensating actions. Trust me, you don’t want to do it twice.

More Complex Example

Cases dealing with invalidating a previous event are not always as simple. There are many cases where invalidating a previous event can actually cause a cascade through the system of invalidations of things that were intrinsically related. A good real world example of handling facts (events) that have been written and later are found out not to be true then cascading can be found in the trading domain.

A common use case in the trading domain is to have a small service that listens to executions (trades occuring in the market) and produces position updates. As example:

Op Symbol Volume Amount
B SYM 100 $88.90
B SYM 100 $88.95
S SYM 100 $88.98

Here the sequence of executions states that there is a buy of 100 at $88.90, then another buy of 100 at $88.98, then a sale of 100 at $88.98. Our service would like to calculate out position change events. A position is how much I own and at what price. There are however multiple ways of calculating a position.

The two most commonly used are LIFO (Last In First Out) and FIFO (First In First Out). If calculating based on LIFO the calculation of profit made and over all position would respond with:

PositionUpdated 100@88.90 PosiitonUpdated 200@88.925 //note this the price is calculated across the whole position ProfitTaken 3.00 //the trade made $3.00 PositionUpdated 100@88.90

If however this were to be done with a FIFO strategy the calculation would come out differently.

PositionUpdated 100@88.90 PosiitonUpdated 200@88.925 //note this the price is calculated across the whole position ProfitTaken 8.00 //the trade made $8.00 PositionUpdated 100@88.95

How you match up buys and shares depends heavily on the order that they come in. Where this can become a serious issue is what happens if a previous trade is invalidated? Every position update and profit taken that has occurred since the invalidated trade is now wrong.

If in the LIFO example the first trade at $88.90 were later invalidated the ProfitTaken should have been $8.00 not $3.00. It is also easy to imagine that there are not only 3 of these events but 300 and now they have all been invalidated. The position updates are also what feeds the PnL (Profit and Loss) report, get this wrong and I can assure you angry traders will be showing up at your desk.

Not only is this situation able to be handled in an Event Sourced system, the Event Sourced system will also likely handle it in a better way than what would be found in a system that used updates.

When the invalidation comes in, the system can look up the exact trade that is being invalidated. It can then read backwards in the position stream until it comes to a 0 cross (position hits zero or switches sign).

The system now knows that the trade is invalidated and plays forward from the last 0 cross. It operates as normal producing results from its calculation. When it hits the trade that is invalidated it ignores it. If supporting multiple invalidations it checks its invalidations first then ignores all the invalidated trades.

In the scenario above for LIFO this process would look like this:

Op Symbol Volume Amount
B SYM 100 $88.90
B SYM 100 $88.95
S SYM 100 $88.98
I SYM 100 $88.95

The invalidation is removing the 100 at $88.95. As such the $3.00 profit generated before is wrong based upon the LIFO strategy, it should have been $8.00 with 0 shares remaining in the position at the end. The position stream would look like this:

PositionUpdated 100@88.90 PosiitonUpdated 200@88.925 //note this the price is calculated across the whole position ProfitTaken 3.00 //the trade made $3.00 PositionUpdated 100@88.90 PositionInvalidated PositionUpdated 100@88.90 ProfitTaken 8.00 //the trade made $8.00 PositionUpdated 0

This has a few benefits associated with it. Most noteably all downstream consumers of the position such and the PnL will automatically receive the position updates. They also do not need to have any special reclalculation logic as they just listen to the Position events.

While this could also be implemented using an update statement across the original positions, what happens when the trader wants to know why his position changed? Implementing this through a mechanism such as above provides a full audit history of what occured and more importantly why it occured. It can be seen via the causation id pattern that the change in the ProfitTaken to $8.00 and the final PositionUpdated to 0 happened because of the Invalidation of the trade at $88.95. While with such a simple action this might be easy to work out imagine there are 100 executions after and thus 100 position updates that follow. An alternative might be to add complexity by making it 5 invalidations, complexity can quickly spiral out of control in this problem.

This same methodology can also be used to back date a change to how the position is calculated if wanted. As example the trader decides they want to switch from LIFO to FIFO.

But I Really Screwed Up

Sometimes things are just so screwed up it’s not worth trying to bring them forward. Nor is it worthwhile in such cases to maintain any kind of auditing functionality over the mistake, as you may not have a regulatory requirement to keep it and it may just confuse the auditors anyway.

A common example of this would be, “I just found out my import of 27gb of events on a new database is broken. Do I need to preserve the streams?” The short answer: no. Unless you have some odd data retention and auditibility rules associated with your system, just delete the database and start over.

If this were later in the life of the system, many Event Stores (EventStore included) do offer varying ways of deleting streams. Deleting a whole stream is a safe operation, whereas deleting a single event or editing an event is not. Deleting from the beginning of a stream is also a safe operation, TruncateBefore, in EventStore. You can utilize these mechanisms to delete full streams from an existing system and still be in the clear, provided it is supported in the given Event Store. Note that this will not work if you are running on a WORM drive.

A versioning-related comment I hear a lot here is, “We have data retention legislation that says we can only keep X information for T time”. This can also be handled with stream deletion, deleting the events specifically past a certain age.

Another possibility is to keep two streams of information, user-555-private and user-555-public, for example, and delete the user-555-private stream, if you must cease to retain private data, yet still need to maintain the events that were in the public stream. Note that if you allow consumers to retain such information, you should also put a RemovedPrivateData event so that any consumers can also be notified that they should no longer retain the information.

As mentioned there are other situations where physical deletes may not be possible, legislation such as GDPR (General Data Protection Regulation) require information to be removed. On a system that supports deletes obviously a delete can just be issued. This can however even be handled when working with a WORM drive. Instead of deleting, an alternative allowable under many rule sets is to encrypt the event data and forget the key which is kept in another system. In practice the deleting of the key makes the data no longer readable while at the same time allowing the system to run on write-once hardware.

In this chapter, the discussion has centered around mistakes dealing with a single event. Other types of versioning issues, however, can be less obvious to fix, particularly those arising when entire or multiple streams have a versioning issue.