Interested in learning more?

Matt:
Thanks for coming out to Des Moines JS. As most you probably know we meet the second Tuesday of every month so thanks for coming and hope to see you more often. But this is our first time out at the Dwolla office.

Matt:
At least this new location so the great space for an event like this.

Matt:
Really appreciate Dwolla hosting us tonight and so we’ll learn from Rocky about serverless webhooks so turn it over to him.

Rocky:
Thanks Matt.

Rocky:
So we are recording this. So no outburst or anything. My name’s Rocky Warren. I’m a principal software engineer at Dwolla. And we’ll just jump right in. Here’s a quick outline of what we’re going to be talking about tonight. So first the original architecture. What how we were previously sending webhooks and then what the limitations of that work were. Brief overview of the new architecture and then we’ll start walking through the code. It’s all open source and I’ll have links at the end of this talk for the code that’s out there. We’ll go over the rollout strategy from we have a sandbox environment where our our clients can begin testing since we’re moving money that’s pretty important for us. So we rolled it out there first and then into production. A lot of lessons learned through this process and then the results of, did we actually get the what we were expecting?

Rocky:
So first background, we’re Dwolla we’re a payment platform.

Rocky:
We, our principal thing that we offer is an API to move money. So it’s for bank transfers we offer user management and then Instant Bank Verification. Certain actions while you’re working on our API: Most every post action generates at least one webhook. And a webhook, If you’re not familiar, it’s also called a web callback or a push API so essentially it’s us calling your service. So your backend server and then you can react on that to say update your UI or send an email to your client. That’s a that’s a popular use case for us. So all of that is is an HTTP post that we send again to our partner APIs. And then it includes an ID usually that they can then go back to our API to get additional information if they need to. The main benefit of webhooks is it prevents pulling against our API. So it makes it cheaper for us to host. If people aren’t constantly polling looking for status updates of transfers. So you can think of, I create a, we call them customers in our API. So one of our partners will create a customer and those can go through different verification states. So if you want to know when it goes into a verified state you’ll just subscribe to that webhook and be notified immediately when that happens. So this is the original architecture. So the…I’ll come over here.

Rocky:
This is webhook subscriptions and this is our Dwolla API down here. So webhook subscriptions is book responsible for listening for events throughout the system. So it subscribes to those events and then it looks up any subscriptions that it has for a specific URL and it puts those onto a queue. So there is just one queue here that shared between all of our partners. And then there’s a fleet of workers that were written in Scala that would listen to that queue and then actually called the partner APIs over here on this side. So from an end user’s perspective they would create a subscription. So they they’d hit our API and they would post to create a subscription and just provide us with a URL. We would then create a subscription to the webhook subscription service. And then again when they would say they would create a customer over in Service A that would publish an event. The webhook subscription would listen for that event. Look for any any subscribers to that event, find that URL and then post on to the or publish an event to the queue, the handler’s queue, that would be listened to and then forwarded on. These dotted lines and the bottom are retries. We have an exponential back off strategy so if the first one fails you’ll get thrown onto one of those queues.

Rocky:
So there is a 50-minute one a 45-minute one all the way up to 24 hours. That will then just continue to retry that webhook up to eight retries and then it ends up on an error queue. That’s not shown. Once the call is made from the handler, it publishes an event so that’s why the arrow is back to the subscription service. It publishes an event back onto another queue. The subscription listens for that and then saves it off into a database. So pretty standard. The limitations, however, are that at peak load and you can probably already know this. If all of our partners are sharing one single queue, at peak load they’re quite delayed and they’re delayed because it could be because there’s so many messages in there. We have a big batch process that runs and they can get delayed up to 60 minutes. So if you’re creating a batch process runs it dumps a bunch of messages onto the queue and then if you create a customer say that’s at the very end of the queue and it has to wait for all these other ones to process first. So it essentially defeats the purpose for someone that really wants to know when something has happened and if they create a customer and they’re expecting a send an email based on that event they may go back to pull it long pulling our API which isn’t giving them what they want. It isn’t giving us what what we expected.

Rocky:
So, partner processes their notifications are delayed. And so what was happening was the partner’s customers were complaining to them that it seemed like their services were slow and they would turn around to complain to us because they were slow. They were seemingly slow because we weren’t getting them the notifications in time. So two big things are high volume partners again would dump a bunch of messages onto the queue and also slow to respond partners. So if a particular partner is taking…say each request is timing out after 10 seconds that we’re trying to call their API, they’re again going to be slowing everyone else down that’s behind them in that queue. So one way we could get around this is by scaling the Scala workers that are listening to that queue but then we’re scaling them for all of our partners so we’re making parallel requests for even smaller customers that aren’t doing anything wrong and aren’t slowing down the queue and they, maybe their API can’t handle a bunch of requests in parallel but that’s really our only knob that we can turn is to just spin up more workers to handle the queue.

Rocky:
So it’s non-trivial for if we wanted partner-to-partner configuration for how quickly we’re sending messages to their API, this architecture doesn’t really get it for us.

Rocky:
So the new architecture isn’t drastically different. The whole left side of the subscription is the same. The difference is that we created new queues for every partner. So now each partner when they create a subscription against our API, if they don’t have a queue already and a worker, one gets dynamically provisioned for them. They are then, each handler is a…so now we’re using in the prior architecture we were using rabbit MQ and this one uses SQS. It’s more of an implementation detail but we also switch from Scala handlers to node typescript handlers. Again not the language is much less of a concern than the actual architecture was. That was just an easy way to get spun up quickly and test this idea. So now when I create a webhook subscription, the webhook subscription service actually calls into webhook provisioner and we’ll get into the code a little bit and it will dynamically generate both the SQS queue and the Lambda handler that handles that queue. Create the event source if you’re familiar with how that integration works on AWS and then that Lambda just serves off that one queue and calls the partner’s API and again handles its own retry so it’ll throw it back onto its own queue on each failure with the same back off schedule that we had before.

Rocky:
And then, again it publishes back onto a different SQS queue that the webhook subscription service listens to and writes the result to the database. So first is why do we choose SQS and Lambda? Since we’re dynamically generating these to try and dynamically generate rabbit MQ queue and A Scala. Those were played on ECS previously. So to try and dynamically provision those would be more difficult than it was probably worth when there’s really excellent tooling around AWS. So that was definitely the first one. Second, is since this is a perfect use case for Lambda it can spin up when we have our batch process a bunch of Lambdas can spin up and those can be individually configured to send messages in parallel to as many in parallel or one after another, if a partner can’t handle as many. And then the other benefits of Lambda: So there’s no server management. It maximizes the time we’re spending actually writing the code instead of handling all the the underlying scaling and things like that. And then it decreases the attack surface because again we don’t have to worry about patching the servers and things like that. It’s also really fast to deploy. So our deploys for the old system weren’t necessarily slow. They were probably under, certainly under 10 minutes, probably under five minutes.

Rocky:
But with these serverless functions since they’re so small, they go out in under a minute. And so we can really shorten the lifecycle of deployments. So we’ll go through. Are there any questions before we jump into the code.

Rocky:
All right. The parts that so again I said we open source this. The parts that we open sourced if we go back to the diagram here, are basically from the handlers. Right. So the things that provision the handlers and then the handlers themselves webhook subscriptions is still a Scala service that calls out to the provisioners so we’ll be going through the. This is we’re in the webhook provisioner so, imagine API requests coming in to create a subscription and we would invoke the create Lambda function. This uses service framework if anyone is familiar with that. So you create a serverless file. Here are all the different functions that this serverless service exposes. So create, delete, disable, update and update code. And we’ll go over the interesting ones.

Rocky:
The first one is create. And the interesting part of create is so we’ll go in and here’s all the things it creates. It creates alarms. So it creates cloud watch alarms for both queue depth or it’s now the age of the messages. So if the age of a message reaches over 50 minutes on any queue we’ll get an alarm because that usually either means that there’s a ton of messages and they’re not being processed quickly enough or that maybe there’s an issue with the downstream Lambda function that’s handling those messages.

Rocky:
There’s also alarms for errors in the lambda function and we also set up a custom metric for regular expression of square brackets error in any cloud watch log that will fire off another alert to us so that we can investigate. And then creates a Lambda function, creates a log group. Log groups are where cloud watch logs reside. So that’ll fire off any console.log in your code will go off to cloud watch. It creates the SQS queue and then rolls to make sure that the Lambda can pull messages off of the queue. And then here is the code that creates the Lambda. Again this is just using all of the AWS SDKs to create Lambda functions. And then the nice thing about all these handlers, there’s however many partners we have each of them have Lambdas. They’re all just sitting out there but they all share the exact same code. The only differences are these variables that we pass in. So we pass in different concurrency against we can control that per Lambda function based on the downstream their partner API.

Rocky:
If they can’t handle requests very quickly then we’ll turn the concurrency down for them. They all share the same error queue and result queue, but they each have a different partner queue that they are listening for messages on and then we put the version in here and this version will become apparent why we do that little bit later.

Rocky:
That’s actually right now. So what we use – going to have to kind of jump back and forth here – but here is the Lambda code or the handler code. So when the handler code when we make any changes to the actual webhook handler that’s pulling messages off of the queue. During the deploy job of that handler code it actually invokes a function. This function. Within the web book provision or code and says OK this Lambda function has been deployed it’s new code now go and look up all the old functions that exist out there with the old code and update them to the new code. So that’s what this Lambda function does. Going back to what the other ones are in here: a delete is only called if the create fails. So just right now it is. It just cleans up all the…maybe there were certain resources got created and then it failed on creating the alarm, for instance, it would go and delete all the other resources that were created prior to that.

Rocky:
Disable I can get into a little bit more. But essentially we were running it since these are dynamically provisioned. If someone were to delete their subscription we would go and we would actually call delete. We would call this function and delete everything. We noticed that was causing quite a bit of churn and these things take quite a while to provision up to like over a minute, sometimes two minutes. And so we actually had our own automated test that was creating a subscription and deleting it and catching a bunch of errors. Just throwing a bunch of errors because all kinds of phrase conditions of trying to do that. So instead we disable it now. So we just disable there’s an SQS queue and a Lambda function that’s handling that queue. There’s, it’s called an event source, is what AWS calls it, that’s in between that connects the two together. So what all disable does is disables that event source so that it’s essentially just sitting. They still are just sitting out there but they’re unused we’re not getting charged for them. So that really sped it up. Now disables take seconds and I don’t know if you caught back in the create but the first thing that it does is it looks and sees is there already an existing queue for this partner. If there is, just enable that event source and bailout right away.

Rocky:
Update will update the concurrency.

Rocky:
So via our admin console we can update concurrency for a specific partner. So if their queue is backing up or they think they can handle them faster or maybe they’re coming too fast and they can’t handle them. This is this function gets invoked via our admin console to tune that concurrency. And then Update Code is what I discussed. It goes and looks at anything that’s on an older version and updates it. I can come back to this in a little bit.

Rocky:
So we’ll hop over to the handler code and a little tangent first. These slides are using git pitch. It’s an open source project that you just create your slides in a text document and it handles like formatting all of them and and showing them. It’s really cool. So check it out. It’s called Git Pitch. I don’t work for them or anything. I don’t have a stake in a company. It’s just really cool. It’s really nice because you can. If I while I was coming up with this talk I’d be writing and this is just in the open source repository so it’s it’s open source along with everything else. And so if I had,”oh I learned this too.” I could just open this and V.I. and drop another bullet point into one of these slides and close it really quick.

Rocky:
So now we’re in the handler code. So this is a different repository because of when we deploy this out it has a last updated version that then the webhook provisioner can go and look for and update the other functions. So we created this in a separate repository to make that a little simpler to manage. And so here is the handler function. So if you’re familiar with how AWS Lambda works, is you export a handler and this is the entry point essentially for your function. Anything that’s created outside of this handler sticks around during warm boots or warm starts, rather. So on a cold start when AWS generates your your Lambda function for the very first time it will run all this initialization code and then it will keep that in memory while the container is warm and only call into the handler. So you’ll want to put anything that is time intensive. This goes and looks at pulls and environment variables not super time intensive but other things go and connect to Dynamo DB, for instance, or S3 or something like that. You’ll want to put those outside of your handler functions that it speeds up, makes your warm starts as fast as possible.

Rocky:
So we come in here. There’s some error handling code here. You can set a batch size for how many messages you want to pull off of an SQS queue at once. We have the batch size set at 10. The tradeoff with that is if you throw an exception out to your Lambda function it won’t delete any of those 10 in the whole batch so they fail as a batch.

Rocky:
So we want to make sure that we’re not resending the same webhook to partners. So we need to be very careful about how we are handling errors we want to essentially not throw. We want to catch everything in and insert it onto the error queue ourselves rather than throw them out. Because then they’ll just continually churn through and reset the same ones.

Rocky:
So you’ll see some kind of weird error handling code that you wouldn’t normally probably care as much about if you didn’t care about sending the same messages multiple times. So this is pretty simple. It starts up. It sends the webhooks if it needs to. It then publishes the results to SQS. Where it gets a little bit tricky is the concurrency. So there’s two concurrency dials that we that we can turn per partner queue.

Rocky:
There is the AWS reserved concurrency. And that is the number of Lambda functions that are serving your queue. So we have that set pretty low. That will go up to your account limit if you don’t set anything, which is a thousand by default. So it will if you drop a million messages onto the queue it’ll ramp that up to as many Lambda functions as possible in parallel and just drains the queue as quickly as possible. We don’t again want to do that because it will just overwhelm the majority of our partners’ APIs. So we set that to a pretty low number.

Rocky:
And then we have what is in this code called the post concurrency. And that is once you’re in one of those Lambda functions. So we have reserved concurrency defaulted to two. So once you’re in one of those two Lambda functions. How many of the 10 batch limit do we send in parallel? And we set that by default to five. So they’re multiplicative. So we have two lambda functions running at the same time that are each sending five at a time. So our default is 20 events that we’re…10 events that we’re sending to a partner’s API at once. And again, we can tune both of those via our admin console to essentially whatever we want. As high as we can go, if they can handle it. Or, set something both to 1 and then we’ll just be sending one-by-one.

Rocky:
So that’s what this limit stuff is then. So it loops through all of them and then post the hook.

Rocky:
This requeue is for Lambda. Another little quirk of this is SQS, you can only delay a message on SQS up to 15 minutes. And we have some of our delays for our exponential back off strategy if they get a failure is up to 24 hours. So this, we use SQS message attributes to set: basically requeue this message until the specified time that it should be sent again. And so this is what this is checking. If requeue is set to true it just doesn’t even try and do the webhook, it just immediately requeues it in the published results. When we get to this published results method.

Rocky:
We could have done this with Dynamo as well. We could have dropped something in Dynamo and not continually requeued them over and over again. But SQS is pretty cheap and it simplified the architecture so that’s what we want with. post Hook. Pretty simple. We make a post request with essentially the body of the SQS event that we received. And then we do a hard time out after 10 seconds. So if they haven’t responded within 10 seconds we kill it and we send a result back as an error to the webhook subscription service that’s listening.

Rocky:
And then publishing results. This looks at any that actually were sent would end up on the result queue. Any that were requeued end up back on the partner queue. So it’s publishing both. If either of those fail in send batch then it sends those that did fail to the error queue and another kind of gotcha with publishing messages in a batch to SQS is they can fail individually. So the whole batch in this case will not fail. Some of them could succeed and others won’t so you have to make sure you check the result and only if any do fail then make sure they get to the error queue so we can inspect them later. Nothing’s listening on the error queue. It’s just sitting out there and we have again an alarm. If any get sent to that queue, we get notified and then we can go take a look and see what happened. And then there’s another repository that we created to move those to the specific partner cues to be retried if we fix the bug or if it was a temporary error or something like that.

Rocky:
All right. Questions with the code before we move on?

Rocky:
The question was what strategy did we use to namespace all those resources? It’s convention based. And this has already bit us. So I don’t know that we would necessarily do this going forward. I think it’s worked really well. You just have to know that if you change anything in this mapper file on the webhook provisioner you could break some things. So each of these are created in this file. So they use we call it a consumer ID but that’s what basically distinguishes each partner. And then this is where everything in webhook provisioner comes to this file to get the names of all the different resources. So all we need is that ID and we can we can figure out the name of all of these different things. And then it’s a matter of kind of making sense of AWS SDKs and sometimes they expect an ARN and sometimes they expect a name and sometimes just, shuffling back and forth and getting what you need to be able to do these things. So, the nice thing about this is when I go to delete something I don’t need to know the ARNs or anything like that of anything I can just. All I need to know is that ID. Once I have that ID, I know the names of everything that’s been provisioned for that user. And so what the delete code does is you’ll see ignore 404. So if the resource doesn’t exist it’ll just catch that error and continue on. So it just tries to delete everything and if it’s not there, Ok.

Rocky:
Any other questions?

Rocky:
All right. So the roll out. We first had some – this is actually wrong. I updated this wrong – but essentially we enabled it for our internal applications in Sandbox via feature flags in the webhook subscription service. So what that looks like is we would say if the consumer ID is this, then when they try and create a subscription, provision the stuff for them. Otherwise just do it the old way. Then we enable it globally in the Sandbox. Again the Sandbox is our test environment that partners when they’re initially integrating with us they’ll integrate in the Sandbox. And then we monitored obviously and then we listened for were there any comments from our partners as they were integrating. There were a few. So we fix those. We then got a handful of partners in our production environment that were complaining about the webhooks taking too long and asked them if they were willing to be beta testers for our new version.

Rocky:
We then did the same thing we white-listed them via feature flag in Production. Monitored and gathered feedback from them. And then we started migrating in batches our Production customers. We started with, our highest beta partner, their volume. Anything lower than that volume, we migrated them first. Monitored obviously and then started taking some of our larger partners. One by one and making sure that we were able to handle the load after each migration.

Rocky:
He said “What type of feedback were you getting from partners was it more around load or…” Surprisingly we didn’t get a ton of feedback. Some of the feedback was that some of the HTTP library since we were moving from Scala to Node. And I was also using some just popular HTTP libraries they were doing things different with the request that they weren’t expecting. So one of them was lowercasing the headers for instance. One of them was not handling an HTTPs to HTTP or vice versa. HTTP to HTTPS redirect properly. Things like that.

Rocky:
Yes we did also have some.

Rocky:
I think once you explain that you’re going to be making changes. It seems like people start concentrate on that more and finding issues that maybe were there prior to that. So we had a couple of those too and we were able to quickly pull logs and show that it might be a problem on your side and not ours. And then, yes, we did have issues with either receiving them. Usually it was too quickly because though the other strategy did them all in parallel there was usually a lot of messages in the queue and the chances of you getting right next to each other was was less common. Whereas now with our default of always sending 10 at once, we had to slow some down for certain partners.

Rocky:
And then it’s also just with the the amount of monitoring that we now have in place with this it allows us to see across the board and see OK this partner is having a ton of 10 second timeouts. We should probably dial that one down without them really reaching out to us and and saying that.

Rocky:
So any other questions before I move on to lessons learned. There’s a lot of these. There’s two slides of lessons learned.

Rocky:
So getting back to HTTP libraries. I think people in the Node community are already used to trying to keep their bundle sizes slow and just because this is server side code with Lambda it’s also not as essential but it’s still kind of important to keep your bundle sizes slow and keep an eye on your memory usage because it impacts your cold start times. So for this it’s not a huge deal because so your cold start time doubles to a second versus 500 milliseconds that’s not a huge deal. But for other things it could be if you’re basing your API on Lambda functions then that is a big deal. And so, keeping the bundle size down is one of the ways to speed up Lambda function cold start times as is what I mentioned before of keeping initialization code out of the handlers and in a global scope.

Rocky:
Another issue that we ran into with HTTP libraries is so I mentioned the lower casing libraries. There was the redirect issue so we ended up just not using the library. This is only making a post. So I think that was my fault I’m just used to pulling in Axios because it’s really popular. It’s small but not using it is even smaller. Especially when you have really constrained use case. Unfortunately when I did that I created a memory leak because I didn’t realize that we weren’t using the body of the response at all. We don’t care about it.

Rocky:
We only care about the response code but you need to put in like response dot resume. I didn’t know that so I was leaking every response body so some of our partners that had a big response body for webhooks. Some of them are returning HTML. That would all stay in memory over and over again and these are executing a lot. And so any sort of memory leak you’re going to see it really easily. So I think that’s actually really interesting, too, is that we probably wouldn’t have caught this if we would have deployed it on ECS or we would have caught it as soon because you just oh restart the instance and everything’s fine now. So this really forces, you can see the memory on every execution ticking up and it kind of – especially for someone like me – really annoys me and I have to go figure out what what the problem is. But now, after learning that now I know that for going forward. I always know that, OK here’s a problem that I could easily make again.

Rocky:
Second one, cloud watch. So you saw that we were creating cloud watch a log group for each of these. The default retention period on a log group is forever and these are a big, probably, the second most expensive piece. This is all really cheap but we’ll get into that a little bit. But it’s the most expensive behind the SQS queues is the cloud watch logs so make sure you either ship them somewhere else and don’t keep them in cloud watch or set that retention period to something lower than infinity.

Rocky:
Another thing. There’s a lot of best practices in database documentation. It’s there for a reason. They’re best practices and they’re best followed. Or you’re gonna run into some issues and the while I was working on this I would check the same best practices document just occasionally because I’d stumbled upon it and the numbers would be changing. So I think AWS is still kind of figuring this out. The SQS and Lambda event source is relatively new maybe within the last five to seven months. So I think they’re still kind of trying to tune this and get it correct. So things like if your lambda functions can get throttled so if there’s more messages on the queue they’ll have little workers trying to push messages to your Lambda functions. And if there’s not enough Lambda functions because you set the reserve concurrency low you’ll start getting throttled. So there’s best practices for avoiding that. We are not really listening to those because we don’t want to overwhelm our partner APIs. But at least we know what the best practice is and why we’re seeing all these throttles because we’re not following that. Dead letter queues. So if you have a bad message or something. Say you deploy an update and your Lambda function can’t process it, you need to have some sort of dead letter queue or otherwise it’ll keep spinning and trying to serve the same message and failing over and over again.

Rocky:
Idempotency. SQS is at least once delivery. So you can get the same message twice and we have a number of occasions. So you have to be able to handle that. And then the batch size which I already went over. That’s the number of messages that the Lambda function will pull off the queue at once. So you just have to understand the ramifications of if you’re picking a batch size of 10 what does that mean. That they fail in as a batch and then things like that.

Rocky:
Lambda errors. You’ll get an alert saying that your lambda erred out and you won’t be pointed in any direction on where to find that error and what has happened. So cloud watch insights is kind of if anyone’s familiar with Splunk and you can kind of write queries over your your logs on watch. Insights is similar to that so that’s really helpful. I shared a query with the developer relations teams that they can run it over a if a partner is complaining they can run that query and see OK. You have to look for timed out and process exited and these like keywords that you can only kind of know from experience to see why a specific Lambda function has erred out. I think once, there’s also a bunch of monitoring tools so like dashbird and and IO pipe and there’s a bunch of them. I think they’re all trying to kind of rush into the space because it’s taking off right now.

Rocky:
We haven’t used any of those. I think we have the free terms that seems like are the most common terms if your lambda function errors out but I would definitely look into that if we were going to continue to expand and use this for more production critical things.

Rocky:
And then log messages. So going back to Lambda errors being elusive. Make sure that you have high cardinality values in your log messages so things that are if you search. So think of a GUUID for instance, UUID. Something that if you search for that it’s likely that you’re gonna get that. So you’re not having to write this this crazy regular expression to pull out the information that you need. We had one partner that was complaining that they were getting for all fours and the only way that we were saying that we were turning a 404 and the only way that they return a 404 is in this really specific use case. And so we looked through our errors and I was able to within probably three minutes in the cloud watch console find exactly the the webhook ID that we were sending to them. The response that we were getting from them. And very quickly being able to provide that information back to them. I think if we were more lax on what we were providing in the log messages it would have been a lot harder or maybe impossible to get that information back to them and say OK we don’t think we have an issue because of this data but we’ll keep looking. If we do, here’s what we think we’re getting back from your API.

Rocky:
Another thing here is with server lists it kind of puts the the oweness of monitoring in the developer’s hands which is actually an empowering and a good thing, I think, because the person developing it knows what areas are likely to happen and if they happen what information would I need. So you saw all the cloud watch alarms that could very easily in a larger organization or in a non-serverless architectures be someone else’s or different team’s responsibility to get that information. Whereas, with serverless it makes it very easy for you to do it yourself and deploy it out yourself with your serverless stack. So that’s actually been we’ve got a lot of alarms around this and they typically alarm us of valuable information. There is actually we’re working with. We have a support ticket open with AWS about some things that hopefully are their problem and not mine.

Rocky:
So this came up, a question from last time I give this talk. You can configure a Lambda to serve multiple queues so we could have had just one Lambda handler and a queue per partner and that Lambda handler could serve all of the different queues and so it could spin up 700 Lambda functions or whatever it needed to to be able to keep up with all the different queues. Why we didn’t go – we considered doing that – why we didn’t go in that direction is because it limited the configuration options of being able to fine tune what is the parallel. How often are we hitting a specific partner’s API. So yes it would have simplified things. But it wouldn’t have given us one of the one of the main goals that we wanted to accomplish.

Rocky:
So we used TypeScript serverless framework and then AWS CDK. Those all worked really really well. I’ll show you a little bit of what AWS CDK is. It stands for cloud development kit serverless framework is an open source framework that allows you to configure basically your Lambda service in YAML file and then deploy it out and I’ll handle all the cloud formation and setting up cloud watch alarms and all that stuff for you. There’s also a bunch of plugins that make it really really handy.

Rocky:
They can serverless web pack your Lambda functions again to keep the size down so that they cold start quicker and then TypeScript. I’m sure most of you are probably familiar with that but it’s essentially a superset of JavaScript that adds types. And then they are removed and it’s transpired down to JavaScript. So it’s just mostly at build time helping you with making sure that a string’s a string and not a number. Those have all helped a great deal and especially with the combination of them so AWS CDK is actually written in TypeScript natively they have bindings for Java and I think they’re working on Python but the nice thing about TypeScript is it really improves the IDE experience as well because you can control-click on any type and get exactly what you need so it’s essentially like having the AWS documentation within your visual studio code or, Intelli J web storm, whatever you used to develop. So that made it really easy to speed up that cycle. So quickly, AWS CDK services framework is specifically around Lambda functions and connections to and from Lambda functions.

Rocky:
So for instance it makes it really easy to set up a SQS source and a Lambda function handling that SQS or SNS or API gateway. It’ll generate all of your API Gateway code for you with one line of YAML. That’s where it comes in really handy. where AWS CDK, It’s kind of I think it’s trying to do a lot of what a serverless framework does but also anything. Essentially it’s codifying cloud formation so we use that in webhook provisioner to create our topic.

Rocky:
So our topic is our SNS topic. This is where the all the cloud watch alarms get sent to this topic which then gets forwarded into a Slack channel. So you create, this is Amazon CDK. You create a stack. This is from Amazon CDK. You extend it with my stack and then you can start dropping stuff in. A lot of these names are very similar to what they are in cloud formation documentation so it’s for instance when you create an alarm you have a namespace you have a statistic and a threshold. The nice thing and they’re iterating on this. I think some of this has already changed if I were to update these libraries. But where it gets nice is if you have things that aren’t in the CDK so this log group was created by the serverless framework and it allows you to import those into the CDKs so that I can use it as if I created it natively in the cloud development kits so now I can reference this log group that was created outside of this in here. And attach different depends on flags and whatever onto here. So I’d recommend looking into it if you can. I think it’s really nice and so how this all operates is so I create a bunch of metrics and I think the result queue And the error queues are all created in here and then the output of this in my package JSON file is to do this CDK synth. And what that does is drops it into a YAML file as stack dot YAML.

Rocky:
So it outputs it as this just a regular cloud formation YAML file. That’s the output of the CDK and then the serverless framework allows you to import cloud formation. So I import that stack dot YAML in my serverless dot JS file and then just deploy it all out as one stack. So that’s how this all kind of fits together.

Rocky:
You can deploy with CDK but since we’re using serverless framework as well for the Lambda function niceties. You can also use them together and it’s worked really well for us.

Rocky:
I think I briefly discussed this but (it isn’t full screen. I have an extension that’s not letting it happen). Briefly discussed this but since we’re dynamically provisioning these resources I would think twice before exposing something out through your API that then goes and hits an AWS endpoint and starts to provision things. The reason that confirmation takes so long. If you want to know the reason then try and do it yourself. There is a lot of you have to retry a lot. We’re having this issue. The event sources take really long to connect the SQS queue to a handler to a Lambda function we are getting this error so I searched for in quotation marks and the only other place that I found it was in a bug against terraform because terraform was having the same issue and just had to add a retry so they just retry like every 15 seconds for up to a minute which is what we decided to do. The reason we didn’t, we also had the idea of just using cloud formation so just generate a stack for each partner. The reason we didn’t do that is because of AWS account limits. There’s a 200 stack limit and we would have one of these per partner. So you also have to be cognizant of what those account limits are.

Rocky:
Yes you can get them raised, but if overnight we sign up a bunch of partners or a bunch of different applications get created we could exhaust that and it would be impacting our production deploys because of something that’s dynamically happening through the API. So understand AWS account limits. There is good documentation around these but when you’re starting to use a service especially at scale make sure you understand what the limits are. So you don’t put alarms in place. Trusted advisor is an AWS service that allows you to keep an eye on which service limits you’re getting close to. To make sure that you’re not blowing through those.

Rocky:
And then tagging. So we tag all these resources. It makes it really easy to see how exactly the cost of what webhooks are costing us in AWS because you can filter by tag. It also makes it easy to create different views both in the console or if we were – we don’t have this yet – but if we were to create a dashboard in our own admin or to display on a TV or something you could easily go and pull all resources with the tag of webhooks and be able to see where they all are and how they’re all connected.

Rocky:
So getting into the results. So it scales essentially up to what our partners are able to scale up to so we can end our AWS account limits. We can send these as fast as you’re able to consume them on a per partner level.

Rocky:
So this has already been really cool to see if there are complaints about webhooks, we have a lot of tools at our disposal now to be able to help that partner to get around their issues. Either we can say that, “Oh it looks like your API is taking a little bit to respond.” We can increase it. I don’t know that that’s going to really help throughput. We have that information now and we can tell them versus Well we can scale up a giant cluster of workers on one queue and hope that there aren’t a bunch of slow consumers in front of you.

Rocky:
So it’s taken that 60-minute delay down to, now during our batch jobs, they go essentially to zero as soon as again it’s per partner. But for our larger partners that can handle them really quickly. It’s pretty cool to see how quickly we can fire them off to them. Again configuring individually.

Rocky:
And then low cost and free when it’s not in use. So yeah if these are just sitting out there it’s completely free. If there’s a low volume customer that has very few ebooks. If for the other partners that have a lot of volume it’s still really really cheap. Especially if they’re, like I said the I think, across all of our – I won’t give out exactly how many we have – but I think for last month the bill for Lambda was $7.80 and we’re talking like millions and millions of of executions these Lambda functions. So it’s very very cheap SQS is a little more expensive because we’re requeueing every 15 minutes. I think if that became a high cost which it is definitely not, we could look into ways of reducing that cost as well. And then again the cloud watch logs we are keeping using cloud watch now. So there is a both a retention: you’re paying for stored data as well as ingest of data. So that’s again where we’re talking under $100 a month for cloud watch logs. It’s really really cheap and we’re sending millions and millions and millions these webhooks.

Rocky:
Here are links. These slides are also in the repository. So if you just. So again getting back to my marketing on git pitch. If you create a pitchme.mv file in your repository you just gotta git pitch and then the name of the repository. And that’s what this is. So I’m just using their website to generate these and we can share the links out with this but you can get a PDF version of the slides and everything right here. They just do it for you. It’s really cool. So all of this is out on github. On the Dwolla page the first one we went over is the provisioner. The webhook handler we also went over.

Rocky:
Webhook receiver is for the other side so we open source this for our partners. webhook receiver is another Lambda function that uses the serverless framework and just stands up an API gateway and a lambda function and listens for messages and just essentially logs them out and dies. So it’s a kind of a proof of concept for a partner. It also handles checking – We sign our web hooks – so it handles making sure that the signature is valid.

Rocky:
This is also since it doesn’t use, so services framework is cloud agnostic. You can deploy them out to any cloud so you can deploy it out to AWS. You can deploy it out to Google Cloud functions and IBM open whisk and cloudflare. There’s a bunch of different things that they support. So that we have available for our partners.

Rocky:
And then we have cloud watch alarm to Slack that listens for cloud watch alarms on a topic and forwards them onto a Slack API. It’s all configurable through environment variables. I briefly discussed the SQS move. It’s another lambda function that once executed you give it a source queue and a destination queue and I’ll go and move the move them to another queue.

Rocky:
And then serverless generator is a yeoman generator for generating typescript and node and Scala’s in there too. Serverless functions using that use the serverless framework. So this is how all these were created really quickly. you can just do yo serverless and it’ll ask you which which language you want to generate for and generate you had a skeleton to get started really quickly. Are there any questions?

Rocky:
Yeah.

Rocky:
The question was: how do we handle our CICD process. So we use Jenkins. So there are two services that get deployed here. There is the webhook provisioner and the webhook handler. So Webhook provisioner just runs. There is a command. Via the service framework called serverless deploy and that will deploy everything out via cloud formation to AWS for you.

Rocky:
And so if we go back to the code, it uses this serverless.js file and here is where I laid out all the functions. So it will deploy each of these functions out in one cloud formation stack. And so we just call that SLS deploy from Jenkins. The other one, same idea. It uses Jenkins as well but the difference is once it’s deployed out there it calls Jenkins also calls this script too which executes the webhook provisioners update all that then goes and looks up all the things in AWS and updates them to this new version of Code that just got deployed.

Rocky:
So it’s kind of a it deploys itself and then it runs the second script that calls a different Lambda function that goes and updates everything with the code that it just deployed. So it’s kind of back and forth.

Rocky:
We have it capped right now at 50 in parallel. That can go up but we have no real need to go up because the queues don’t back up. So that’s with five reserve concurrencies so there’s five lambda functions serving the queue and they each are sending 10 at a time which is the max batch. So they just get a batch and they send them all and there’s five of them doing that.

Rocky:
And it seems like what the receiver must be doing is spinning up something with maybe a lambda function and just dropping those onto a queue and then processes them later.

Rocky:
The question is cost of SQS versus running your own queue. I don’t actually know. I know the SQS numbers but I don’t know what we pay for our rabbit infrastructure costs. I think it’s I don’t know that we have like it tagged down to that level of granularity. I don’t I honestly don’t know.

Rocky:
Other questions?

Rocky:
All right. Well like I said the code’s all out there. I’d be happy to – I lived this for the past like two to three months so I know way more than I should about all this stuff in the gotchas and things so if you have other questions I’ll be here afterwards. Otherwise thanks very much for coming.

Matt:
Yeah. Thanks everybody for coming. Thanks again to Dwolla for hosting. But yes please stick around pick Rocky’s brain a little bit.

Matt:
I think there is a little more pizza if somebody wants another slice but yet next month second Tuesday of the month. We’ll be meeting again. We’re always looking for speakers. So we have a talk proposals repo. Feel free to file an issue there and we’re happy to open up the conversation for more speakers. Also if you want to chat JavaScript in between. We’re also on the Des Moines web collective Slack at DSM web collective dot com. So join us out there too. Thanks again for coming.

Convert audio to text with Sonix. Sonix is the best online audio transcription software

Sonix accurately transcribed the audio file, “Serverless Webhooks.m4v” , using cutting-edge AI. Get a near-perfect transcript in minutes, not hours or days when you use Sonix. Sonix is the industry-leading audio-to-text converter. Signing up for a free trial is easy.

Convert m4v to text with Sonix

For audio files (such as “Serverless Webhooks.m4v”), thousands of researchers and podcasters use Sonix to automatically transcribe m4v their audio files. Easily convert your m4v file to text or docx to make your media content more accessible to listeners.

Best audio transcription software: Sonix

Researching what is “the best audio transcription software” can be a little overwhelming. There are a lot of different solutions. If you are looking for a great way to convert m4v to text , we think that you should try Sonix. They use the latest AI technology to transcribe your audio and are one my favorite pieces of online software.

Rocky Warren
Former Principal Software Engineer