Most of the time, as engineers, we spend our time building systems with a happy path in mind. We build a service assuming our datastore will always be available, dependent systems will always be healthy, and everything will work together like a well-oiled machine.
Have you ever spent time thinking through how your application might behave if that wasn’t the case? Would it recover gracefully or get stuck in a bad state? Did you build your application prepared for unintended failures?
A lot of the information returned from GET calls to Dwolla’s API is backed by event-based read models written in Scala with Akka. Using a read model allows us to store data modeled for the consumption of our API. For example, when a partner asks for details of a specific transfer, we are able to return the information with little-to-no transformation or additional lookups. Depending on the type of service, this also helps reduce load on our primary datastore.
These services subscribe to specific events and then collect and store information that’s retrieved from the API. Frequently, this process involves calling dependent services to lookup relevant information about that event (white label customer details, transfer details, etc) and persisting it in its own datastore. Seems straightforward, but what happens when one of the services encounters an unexpected issue? There are several possible undesirable side effects:
- Internal queues back up
- Downstream processing could be delayed
- Services become overloaded
As Dwolla continues to evolve, we have used a few techniques to help harden our event-based services, which we want to share and hopefully save you a few future headaches building your first few event handlers. These include message retrying, backoff strategies, throttling, and quality logging. Before these strategies can work effectively, specific points of failure should be identified and updated to handle exceptions. For example, when a block of code results in a future, ensuring some type of recovery logic is in place to handle failures.
The idea behind retrying a message is pretty straightforward—if your service encounters an error, it should retry. When implementing retry logic it is important to ensure your service is left in a good state to handle additional messages. The goal is that the message that was unable to be processed is put back on the queue to be handled later.
Our first few handlers were modeled with the idea that one actor should handle multiple messages. While this concept certainly works, we found that it added additional complexities that come with state management. Which message should be acknowledged/retried? It wasn’t long until we found the approach of using one actor per message simplified this. Spinning up a new actor per message gave us the assumption that one actor was responsible for one message. Choosing this approach also forced us to ensure these actors were shut down properly.
It is also important to realize that there may be types of exceptions that do not require the message to be retried. For example, enforcing unique constraints to handle the possibility of duplicate events.
The retry frequency and total number of retries should also be considered. Retrying too often could increase load on other services and not retrying soon enough could cause unnecessary delays.
The type of message and the impact caused by a processing delay help determine the appropriate retry policy. For example, Dwolla’s API V2 webhooks have a retry policy that includes eight additional attempts over a 72 hour period. However, the read models that back our API have a much more aggressive retry policy with a delay, usually measured in milliseconds.
Building a fault tolerant service also means taking into consideration the impact your service has on the overall system. When we created our first few event-driven services at Dwolla, it felt like that old home improvement show where we were enamored with how many messages we could process concurrently, turning dials to push the limits.
It was fun in the beginning, but we quickly realized it wouldn’t scale. The more messages we processed concurrently, the more unnecessary stress would be put on shared downstream services, degrading performance of various parts of the system. For example, consider a large mass payout that produces tens of thousands of events in a relatively short time period. For each event, the service would need to lookup additional details of the transfers and the account(s) involved. We could try to handle as many messages concurrently as possible, but now consider there are N number of other services that care about similar information for a subset of those events for unrelated reasons. Eventually we would reach a point where requests would start to take longer and possibly even timeout.
As our services have matured we’ve learned that handling messages at scale doesn’t always mean bigger, faster, stronger. We have found in some cases reducing the number of concurrent messages a service handles at one time has actually improved throughput and increased stability.
As we continue to add features and functionality to our API, our services will continue to mature, building for scale and quick recovery. While thinking through the “edge cases” and imagining what could fail isn’t always the easiest exercise, it’s well worth the investment.
At Dwolla, we are experts in facilitating ACH transfers. If you’re curious how your business, new application or platform could integrate our ACH API, reeach out to our sales team.