The dangers of event-based messaging

It’s common now to prefer communicating asynchronously between services via a message bus, rather than synchronously via some kind of HTTP API. Immature companies that lack experience with this method tend to make the same mistakes.

Non-transactional message publishing

Often you’ll see code written like this:

PersistOrder(Order newOrder) {
database.Save(newOrder);
messageBus.Publish(new OrderCreatedEvent(newOrder));
}

Can you see what the issue is?

Imagine if the message bus were briefly unavailable (it doesn’t matter the reason — just that some kind of exception were to be thrown). We’d have written a new order to the database, but then failed to let anyone else know it had been created. The poor customer has paid for their goods, but the company seems to have lost track of what’s happening!

It seems to be a common failing amongst developers, that they believe in the infallability of their software infrastructure. Perhaps they’re lulled into a false sense of security, due to cloud resources being available most of the time, and are then surprised when they’re suddenly gone. You need to consider what will happen when any external services are unreachable, and what you need to do to mitigate the resulting issues. Here are some example (incorrect) solutions:

  1. The message bus will probably only briefly be down, if it’s ever down at all, so why should we worry about it? Other teams that depend on this type of event will have to tell us that something’s going wrong and we’ll just scramble to put some SQL scripts together that can recreate the event and then publish it manually.
  2. Publish the event before saving to the database.
  3. Enlist saving to the database and publishing to the message bus in a distributed transaction.

Let’s break down what’s wrong with each approach:

  1. This actually happened to me! Although I’d previously warned my team about this possibility, nothing was done about it. The consequence was that at some point we failed to publish to Azure ServiceBus and I had to spend a few hours at the end of the day manually pulling together the data to replay the missing events. You do not want to do this! It’s error-prone, unprofessional, a huge waste of time for everyone involved, and what’s more is shit work, which is beneath you.
  2. If you publish the event first, you’ve entered into a race with yourself. Which will happen first: the downstream consumer sees the message; or, you write to the database? If the consumer then reads from that same database, they’re going to have an issue because the records don’t even exist yet. Also, what happens if you fail to write to the database after publishing the message? Either it’s a poison message that the consumer can never hope to handle successfully, or even worse, the consumers carry out further actions even though the original order has been lost.
  3. In the past if you had to write to multiple databases, you might enlist in a distributed transaction. This solution is out of vogue with the advent of microservices, and you can read some more about it in this famous analogy with starbucks. In my opinion the bigger limiting factor than loss of throughput is that the database and message bus must both support distributed transactions, which isn’t always the case.

The actual solution is to use the outbox pattern. In short, you write everything that needs to be published as an event to the database in a single (non-distributed) transaction. A second process runs separately on a loop, publishes any unsent messages, and then marks them as sent in the database. The caveat to this, is that the downstream consumers must be able to handle duplicate messages.

Either an “events” table can be created specifically for this task, or the whole database can be modelled as an event store, so that the entities are the aggregation of all their events and these can be replayed at a later time.

Non-idempotent message handlers

Idempotent is used to describe an action that, when carried out more than once, results in the same outcome. A good example is a switch transitioning from “off” to “on”. The command can be modeled either as “toggle” (change the switch’s value to the opposite of its current value) or explicitly passing the desired state “on”/“off”. The difference is that “toggle” is not idempotent, and the end state depends on how many times it was carried out.

As mentioned in the solution to the “non-transactional publishing” issue, consumers need to handle duplicate messages. If they’re not idempotent, this could mean they carry out side-effects more often than necessary, or side-effects are not carried out at all, because the previous time it was handled it changed the database state to a value that it now thinks is inconsistent.

Here’s some suspicious code:

HandleMessage(OrderCreatedEvent event) {
var order = database.GetOrder(event.OrderId);
if (order.State != State.New) {
throw new Exception(event.OrderId + " isn't a new order!");
}
order.State = State.Processing;
database.SaveOrder(order);
messageBus.Publish(new PickingStartedEvent(order));
}

This issue is actually very similar to the previous one. Imagine what happens if we fail to publish the PickingStartedEvent. Usually messages can and should be retried, which makes handling them inherently transactional. Unfortunately in this case, if the message is retried it won’t carry out all of the necessary actions to ensure that the end result is the same.

Here’s the example modified to be idempotent:

HandleMessage(OrderCreatedEvent event) {
var order = database.GetOrder(event.OrderId);
if (!(order.State == State.New || order.State == State.Processing)) {
log.Warning(event.OrderId + " has already been handled");
return;
}
if (order.State == State.New) {
order.State = State.Processing;
database.SaveOrder(order);
}
messageBus.Publish(new PickingStartedEvent(order));
}

One of the changes is to continue even in the case that the message handler has run previously. This ensures that either the message is acknowledged and the whole handler has been executed, or it ends up on the dead-letter queue after repeated failure.

The other change is to simply exit and implicitly acknowledge the message if the order state is some unexpected value. The implication is that we’ve already seen and handled this message, and it would be pointless to carry out any further actions. Whether this is the desired behaviour is dependent on the use-case — perhaps the message should be transferred to the dead-letter queue for manual inspection.

Receiving messages out of order

Some automatically assume that a message bus is a FIFO (first-in, first-out) queue, i.e. that messages will be consumed in the same order that they’re published. This is a misconception for two reasons:

  1. Taking Azure ServiceBus as an example, the default behaviour is not FIFO. To get this behaviour, the messages need to be published associated with an id, e.g. an order id. The reason that this is not the default is that it allows for optimisations to be made so that throughput is as high as possible.
  2. The competing consumer pattern is often used. Again, this is a feature used to ensure high throughput. The logical consequence of processing messages concurrently is that the order no longer matches the order in which they were published.

Conclusion

In all fairness, some of these concerns, such as idempotency, apply equally to HTTP APIs. The difference is that the introduction of a message bus is akin to the introduction of a second database, although people often don’t consider it as such.

British software developer working as a freelancer in Berlin. Mainly dotnet, but happy to try new things! https://github.com/NickLydon

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Single Table DynamoDB Data Modeling

Workspace for connecting and collaborating by using MERN stack.

CSS Flexbox: flex-wrap

InvArch Weekly Wednesday Update: March 9th, 2022

Jenkins & Its Use Cases

Liskov’s Substitution Principle

Getting Started with Laravel Development on a Mac — Part 1

Evolution of Computers and Programming, a Personal Journey

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nick Lydon

Nick Lydon

British software developer working as a freelancer in Berlin. Mainly dotnet, but happy to try new things! https://github.com/NickLydon

More from Medium

The Developer Portal MVP

Non-working KPIs in IT: what they can cause and how to choose the right ones

Efficient solution to Sherlock and Anagrams (HackerRank)

Achieving 5 Nines of Reliability: Dream or Reality?