At Dwolla, we take security and reliability seriously. Our production infrastructure is defined in code. Immutable computing resources mean operations staff do not manually log into servers to make changes, which is risky and can allow ops the opportunity to circumvent two-phase control procedures. Networking rules, subnets, security groups and other cloud infrastructure are committed to source control. Changes are made using pull requests and code reviews, and applied automatically using Amazon CloudFormation.
This has worked well, but sometimes things can go wrong and manual tweaks need to be made. A CloudFormation stack might fail to update or a legacy security group (created before we went all-in on infrastructure-as-code) might need to be retired. Traditionally, trusted members of staff with elevated permissions would step in to complete those tasks.
This has the same pitfalls as making manual changes to servers:
- Reviewing changes ahead of time is challenging or impossible
- Operators under stress can make mistakes
- Descriptions of the necessary changes might be vague
- Credentials of specific employees become especially sensitive
Several months ago, we decided to change how ad hoc commands were run. Rather than give certain staff members elevated permissions around specific commands, we built a workflow using Slack, whereby any Dwolla engineering team member can propose commands to run. When one of our operations experts approves the proposal (under 2FA control), our bot automatically runs the proposed commands. Because we like adorable animals, the Dwolla Automation Koala was born!
Avatar artwork by Emily Griffin (@emilywithcurls)
To propose a command, Dwollans use an
/aws-koala Slack command, followed by the commands they would like to run. Currently this takes the form of a bash script and runs in an environment with elevated permissions and tools like the AWS CLI preinstalled.
The command is sent to Dwolla’s backend server. The server ensures the message wasn’t mangled en route and then posts the proposal to the channel. The team reviews the proposal and either approves or denies the command.
If the command is approved, it is uploaded to S3 so a permanent record of the command can be kept for audit purposes. The S3 path and a SHA-256 hash of the command are then sent to a Jenkins job which is ultimately responsible for running the command with the necessary permissions. The SHA-256 ensures that nothing has modified the command between submission and execution.
This has worked very well for our teams. Developers across Dwolla have proposed commands to be executed, allowing them to exactly specify what needs to be done, reducing the turnaround time for execution. Everyone has benefited from improved visibility to changes as well, allowing for gentle oversight and serendipitous discovery of overlapping efforts.
For example, when Rocky was rolling out our serverless webhooks, he used 🐨 to trigger AWS Lambda executions to enable beta customers, restart tasks after recovering from failures and otherwise roll out the new system iteratively. We eventually built an administrative dashboard to handle these functions, but not until we’d proven we needed them.
Another team used the Koala to create a new Amazon RDS database instance from a snapshot of one of our production databases in order to debug a specific issue the team was working through. Now that a couple of similar requests have been issued, Dwolla’s Platform team has a project to further simplify the process. Both debugging the problems faced by the product team and justifying the work to simplify the data restore process were made easier by the 🐨.
As we developed the 🐨, we ran into a few issues along the way. By default, Slack mangles the contents of commands and messages. (E.g., text Slack auto-detects as a URL is converted from
<http://example.dwolla.net|example.dwolla.net>, which causes problems with commands that reference hostnames.) We also had trouble with smart quotes in the main Slack input field. Both issues were solved by allowing users to share snippets of code with the bot.
Creating snippets in the Slack desktop app raised another issue, though. It saves its files with Windows-style line separators, but we run the commands on a Linux instance. We added logic to the backend of the bot to look for that case and strip out the problematic characters.
We believe the 🐨 bot is a safe way to move fast, and it’s working well for Dwolla teams. There are some enhancements we’d like to make. One thing that might be interesting would be a domain-specific language, so users don’t have to be bash scripting experts. Giving the bot some knowledge of the commands being run would also allow context-specific groups to become approvers. Improved workflow would allow someone with questions to temporarily block approval without issuing an outright denial.
We also plan to open source the backend server code, so anyone can use the same system with their Slack account. Look for announcements relating to this in the upcoming months.
Start building in our sandbox for free, right now. Get a feel for how our API works before going live in production.