Here at Dwolla, one of our core beliefs is that money = data. However, there is more to this than just dollars and cents—it takes careful investigation from a dedicated team working to answer questions and measure products. Over the last year we’ve been building a data platform to facilitate this measurement. In order to do this, we needed to effectively collect and analyze data.
Gathering atomic events
Database design has typically thought of atomicity in relation to database transactions, operations that are all or nothing, indivisible or irreducible. However, for data to be semantically atomic, it needs to have meaning that is indivisible or irreducible. This goes beyond just the current state of something, but considers the timing and how that state has transitioned or changed. We need semantic atomicity because:
- Questions need to be answered, but the data to answer those questions is unknown.
- Data needed to answer a question is known. In its current form, it cannot answer the question (easily).
Events are a natural fit to provide the flexibility and data we need. Events describe what happened in their identity and structure, when it happened, and as time based facts, are a concept that others have also embraced. Since these events are atomic and are the smallest pieces of data, we can then compose events together to answer future questions.
Once we’ve defined these specific and immutable (unchanging) events, we simply need to gather it at scale and make it queryable. This allows for us to quickly transform this data to easily answer any question.
Analyzing events at scale
We now have a proliferation of data, a metaphoric explosion of events in “big data”. However, since we have built our data platform on Amazon Web Services, we are able to leverage the following infrastructure:
- Low cost, ubiquitous, and flexible storage of events in S3
- Batch based aggregation and analysis with Elastic MapReduce
- Real time aggregation and analysis with Redshift
This suite of tools allows us to then analyze data across the data structure spectrum.
- Immutable events, immutability changes everything
- Idempotent pipelines, applying any operation twice results in the same value as applying it once
- Transformations are state machines, small composable steps with well defined transitions from one state (data value) to another
Events are immutable and data transformation is a one way street. Because of this, we can archive, tear down, replace, and calculate values derived from our events. As long as we have the original atomic events, data at its source is never lost.
We’re especially excited to release parts of our data stack as open source and will be presenting some of our experiences at upcoming events like Tableau’s TC15 conference: “Dwolla, Tableau, & AWS: Building Scalable Event-Driven Data Pipelines for Payments”.
We’ll be taking a deep dive into some of the concepts I have shared as well as the nuts and bolts behind our data architecture, in particular, how we’ve automated our pipelines with a soon to be released open source project, Arbalest. Finally, we’ll show how applications (like Tableau’s business intelligence platform) can leverage our data platform.
This blog post shares insights from Fredrick Galoso, a software developer and technical lead for the data and analytics team here at Dwolla. Fred has led the creation of our data platform, including its data pipeline, predictive analytics, and business intelligence platform. In his free time he contributes to a number of open source software projects, including Gauss, a statistics and data analytics library.