Event-Driven Architecture for Microservices – Part 2

Encoding Event Data

The event data needs to contain all the domain data that could be required by the services that will consume the events. This will differ for every event. For example, the “Transaction Added” event in the virtual wallet could have transaction ID, transaction type, amount, account ID and transaction date fields.

The data needs to be encoded / serialized for efficient transfer using the Pub/Sub system and subsequently decoded / de-serialized in the Event Consumer.

Considering that the event data also needs to be persisted to a relational database, it makes sense to encode the event data before it is persisted as it avoids having to implement a different table for each event type. Although storing data in an encoded format in a database is not ideal, it is unlikely that there will be a need support querying for event data in the originating service.

Events effectively replace the role of traditional IPC mechanisms for inter-service communications. For this reason it is important that the event data conforms to a known structure. Knowing this structure, or schema, allows the Consuming Services to utilize and rely on the data contained in events.

The following would thus be important considerations when deciding on an encoding mechanism for event data:

  1. A set schema is used to encode and decode event data. The schema details the names and types of the fields contained in the encoded data structure.
  2. The schema is well documented so that there is a clear understanding between the Originating Service and the Consuming Services.
  3. The schema is available to all services that require it, ideally in a central schema repository
  4. The schema’s can evolve without breaking forward or backward compatibility

A Schema Registry

A Schema Registry is an independent service that is solely responsible for storing and retrieving event schemas. Originating Services are able to register an event schema with the Schema Registry service so that Consuming Services are able to retrieve it. The Consuming Services subsequently use the schemas to decode the events published by the Originating Services.

schema_registry

A Database Schema for Events in the Originating Service

Once the event data is encoded it can be persisted to the database. The following schema can be used to persist events in the local relational database used by the originating service (any service that publishes events).

* – denotes a required field

*id – A GUID / UUID that identifies the event. The ID is essential to remove duplicates by any consumers that subscribe to this particular event type.

*sequence – An auto-incrementing sequence number that helps to maintain the order of events. The Event Publisher must always sort by the sequence when retrieving events from the database to publish to the Pub/Sub system.

*data  – The encoded binary data of the event itself. This encoded data forms the core of the message that is sent by the Event Publisher to the Pub/Sub system.

*event_source_id – An identifier of the object that caused the event to be fired. One event_source_id could be linked to many events.  In Domain Driven Design terms, this field refers to the unique identifier of the Aggregate Root. Effectively, it serves as a grouping mechanism for events. In Event Sourcing systems, if all the events linked to a specific event_source_id were replayed, it would be possible to reconstruct the state of the Aggregate Root.

For example, in a virtual wallet application, an event “Transaction Added” could define an event_source_id as the primary key of the account that triggered the event. If all the events for an account were replayed, you could calculate the account balance at any given time.

This field will not always link to a Domain object, hence it is simply referred to as the Event Source (e.g. an Email Service publishing an Email Failed event won’t necessarily link to any domain data).

This field can be used by the Pub/Sub system to group related events so that a single Event Consumer can consume related events in order. The order can be extremely important – in our Virtual Wallet, you cannot apply a debit transaction to an account that does not exist – you would need to process an Account Created event first.

event_source_name – A string field that indicates the type of the event source. In our “Transaction Added” example, the event_source_name would be “Account”. This column is not strictly necessary for the Event System, but could provide useful trace-ability – together with the event_source_id, it allows us to query for data related to the event.

schema_version – The version of the schema that was used to encode the data for an event. Only needed if the event data needs to be decoded at a later stage.

schema_name – The name of the schema that was used to encode the data for an event. Only needed if the event data needs to be decoded at a later stage.

created_at: A date/time that indicates when an event was created. While not strictly necessary for the event system, it could be helpful to determine scaling requirements for the Event Publisher by determining the difference between the created_at time and the published_at time. It can also be used by an archiving system if we wish to remove or archive older events.

published_at: The time at which the event was published to the Pub/Sub system. This allows the Event Publisher to query for unpublished events. Once an event is published to the Pub/Sub system, the published_at date/time is updated, indicating that the event has been published successfully.

Event Lifetime in an Originating Service

There is no need to keep events once they have been published by the Event Publisher to the Pub/Sub system. There are two options to consider when implementing the Event Producer:

  1. Delete the event from the event table once it has been published or
  2. Set the “published_at” date/time once the event has been published

There are some advantages to keeping the events after they have been published:

  1. It is possible to query the domain data that caused the event to be triggered (provided the event_source_id and event_source_name is saved)
  2. It could be helpful for scaling purposes to check the difference between the created_at and published_at date/times. Big differences could indicate that the publisher/s are overloaded.

However, a script to archive events older than a specific date will need to be implemented.

A Database Schema for Events in the Consuming Service

The Event Consumer needs to keep track of which events have already been processed as events could be duplicated. For this reason, we need to persist the GUID / UUID of the event that has been processed. A possible schema:

* – denotes a required field

*id – A GUID / UUID that identifies the event. The ID is essential to prevent events which have been processed already from being processed again. This ensures that the consumption of events and subsequent update of domain data is an idempotent operation. Note: It is important to implement a unique constraint on this field. This caters for race conditions where multiple consumers could attempt to process duplicates of the same event simultaneously. The unique constraint will fail the transaction, preventing the domain data from being duplicated.

received_at – A date/time that indicates when the event was received. This is required for archiving older events, but can be omitted if events are never archived.

Note – If the database or storage system of the Consuming Service is not a relational database, it is still possible to add the event ID and the domain data atomically. The only requirement is that the event ID is stored in the same record / document / row as the domain data and added during the same, single operation as the domain data.  E.g. In MongoDB, the event ID could be added to the document that contains the domain data.

Event Lifetime in a Consuming Service

Events in the Consuming Service must be persisted for a minimum period of time. This period is determined by two factors:

  1. The time period one could reasonably expect duplicates of the same event to still be received in the event of communication or catastrophic failures
  2. Whether or not the events from either the originating service or events from the Pub/Sub system could be replayed.

In event sourcing systems, events could conceivably be replayed from service inception in order to rebuild the data in dependent services. For this use case, it might be required to store events indefinitely.

If event replay is not a use case to consider, the Consuming Service, as with the producing service, needs to implement an Event Archiver:

consuming_service_archiver

Events and Pub/Sub Topics

It is generally recommended to use a separate topic in the Pub/Sub system for every schema used to encode data. The schema of each event will be different, so it makes sense to have a one-to-one mapping between events and topics. The topic names can match the names of the events.

Kafka as the Pub/Sub system

Kafka is a proven, scalable, publish-subscribe messaging system. In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.

Advantages:

  • Very good write performance
  • Multiple consumers per topic
  • Cluster-level fault tolerance: Kafka has a distributed design that replicates data per-topic and allows for seamless handling of individual node failure.
  • Data retention for message replay: Kafka has a unique “distributed commit log” design to its topics, which makes it possible to replay old messages even if they have already been delivered to downstream consumers. This allows for various data recovery techniques that are typically unavailable in pub-sub systems.

Apache Avro as an event data encoding format

Avro is a data serialization framework which relies on schema’s. The schema’s are required for both read and write operations.

Benefits:

  • Libraries available for most major languages
  • Data is serialized into a very compact binary format
  • Schemas can be evolved in a backward compatible fashion
  • Does not rely on code generation like some other systems

Confluent as a Service Registry

Confluent is an open-source streaming platform built to run alongside Apache Kafka. It includes a Service Registry which exposes a REST API.

The Service Registry allows configurable compatibility rules and exposes an endpoint to enable compatibility checking of new schema’s before they are registered. Schema’s are automatically versioned when registered and clients are able to either request the latest schema or a specific version of the schema of a subject.

Confluence also includes the Kafka REST Proxy which provides a RESTful interface to a Kafka cluster in order to produce and consume messages, view the state of the cluster, and perform administrative actions without using the native Kafka protocol or clients.

Event-Driven Architecture for Microservices – Part 1

In order to support the development of independently deploy-able services, we need a mechanism that allows us to perform business transactions that span multiple services. Event-Driven Architecture supported by a Pub/Sub system can help us achieve this.

pubsub

Whenever a significant change occurs in the domain data of a service, an event is published to the Pub/Sub system. Any interested service is able to subscribe to these events and perform any manner of actions in response, each of which could trigger additional events.

The Anti-Fraud Service

pubsub_use_case

In this example we have a virtual wallet service. An anti-fraud service could subscribe to “Transaction Added” events published by the virtual wallet service in order to detect fraudulent activity. If the anti-fraud service determines that a specific transaction appears suspicious, it can in turn publish a “Fraud Detected” event. This allows the Virtual Wallet to mark the account in question as flagged and disallow any further transactions.

Data Consistency

An event is the result of a significant change in the domain data that a service manages. In the virtual wallet example above, adding a transaction would warrant an event to be published, since other services are likely to be interested in this type of event. This means that the operation to add a transaction in the virtual wallet involves the following two operations:

  1. Updating the domain data (Persist the transaction and update the account balance)
  2. Publishing an Event

In order to ensure reliable communication between services, it is essential that no events are lost in the event of catastrophic failure which could occur at any point in the process. If the domain data is updated, but the publishing of the event fails, the data becomes inconsistent.

failure

Similarly, if the event is published first, but the operation to update the domain data fails, the data is inconsistent.

failure2

Adding Domain Data and Events Atomically

Where a relational database is employed as the primary data store for a service, local database transactions can be leveraged to ensure that the events are added atomically. Domain data and events can be added in the same database transaction, ensuring that the data remains consistent:

atomic

If there is a failure at any point of the operation, the entire transaction fails – the domain data remains in the state is was in before the transaction started and the event is not added to the event table.

However, if there is no failure during the transaction, the domain data is updated and the event is added to the event table atomically.

An Event Producer, which operates in a separate thread to the main service application, reads events from the event table which are then published to the Pub/Sub system.

On the other hand, any service that wishes to be notified of events can implement an Event Consumer. The Event Consumer subscribes to any events it is interested in and updates the domain data of the Consuming Service. This could in turn trigger more events.

The Consumer is responsible for keeping track of Events which have been processed, since duplication could occur. It does so by storing the GUID / UUID of the published event in its data store. For each event it consumes, it first checks the data store if the event exists before applying domain data changes and adding the event in a single transaction.

Message Delivery Guarantees

Exactly-once message delivery guarantees come at a cost of performance and complexity. However, it is possible to implement an Event Consumer that is able to remove duplicate events. This means that the Event Publisher as well as the Pub/Sub system can implement At-least-once delivery guarantees and remain performant.

message-delivery-guarantees

Debugging a Fridge

No, I am not referring to some new IoT wifi smart fridge of the future. As a Software Engineer, I like applying the same problem solving techniques I use in my day job to other domains – like fixing fridges!

Recently, our fridge started giving problems – it just never cooled down and kept running. The first suspect was the door rubber seals. The door seemed to have drooped slightly over the years, but the seals were ok.

Next, I checked the ventilation. I threw out some preserves that were old and moldy and repacked everything to assist the air flow. I places a thermometer inside so I could monitor improvements.

The fridge purge didn’t have much effect. Then I realized something – because of the door drooping slightly, the switch that controls the interior fridge light was no longer triggered when the door closed because it is mounted above the door.

But why would the light switch affect the cooling of the fridge? I figured that modern, energy efficient fridges would likely want to disable the interior fan that circulates cold air inside the fridge when the door opens, preventing the cold air from escaping.

A quick test by taping the switch in closed position confirmed that this was indeed the problem.

I attached a small spacer to the top of the door with double sided tape to accommodate for the door droop so that the switch is once again triggered when the door closes.

For now, the fix will do. Time to start saving for a new IoT wifi smart fridge! 😀