Getting a phone call in the middle of the night when your servers are on fire is a necessary evil for many developers and network administrators. If your site is being used around the world, then it needs to be available 24/7. I thought it'd be fun to see how easy it'd be to get a simple incident alarm going using Twilio and AWS CloudWatch, SNS, and Lambdas. Hint: it's very easy. In this post I'll walk you through how to achieve this yourself. Best of all it's serverless, so there is nothing to maintain. You don't have to worry about your incidence response server going down!
Of course, paid incident management tools exist, like PagerDuty, OpsGenie, and VictorOps. Cabot and OpenDuty are open source alternatives you can host yourself. They'll handle escalating incidents through your team, notification via multiple channels, and more. But that's no fun!
Before we jump in, here's what you'll need:
- an AWS account
- a Twilio account with a voice-capable phone number
- 15 minutes!
What you'll get: A voice call and a text message when your service is down / degraded.
The TL;DR: create an SNS topic, create a Lambda using the gist below which is triggered by that topic, and notify the SNS topic with CloudWatch.
1. Set up a new SNS topic
Simple Notification Service, or SNS, is Amazon's push messaging service. It is here that we create a "Topic" which describes how to notify us.
- Open up the AWS console and head to SNS.
- Click "Topics", then the "Create new Topic" button.
- Name it "Incident_Response" and give it a description like "Notifies the CTO via Lambda and text message".
- Click on your newly created Topic.
- Click the "Create Subscription" button.
- Click the "Protocol" dropdown, and choose "SMS"
- Type in your phone number, including area code.
Easy! You have an SNS topic which, when triggered, will send you a text message. The text message will always contain the name of the CloudWatch alarm, which provides some context to tell you what's on fire. Fun note: your message might even be powered by Twilio, as AWS use them as one of their delivery partners!
We could leave it at SMS messages, but they aren't enough to wake me up at night. I need a phone call buzzing.
2. Create the Voice Call Lambda
Let's head over to AWS Lambda so we can trigger voice calls.
- Open Lambda in the AWS console.
- Click the "Create a Lambda function" button.
- Choose the "Blank template" blueprint.
- Configure an SNS trigger by clicking the grey box outline, and selecting "SNS" from the bottom of the dropdown menu.
- Select your "Incident_Response" topic.
- Tick the "Enable trigger" checkbox, which will configure all the necessary Lamba permissions and create the "subscription" in your SNS topic.
- Click Next.
- Call your function "notifyCTOWithVoiceCall"
Time for some code! Copypaste this into your lambda:
You'll need to update the
toNumber variable in the code above, and add three environment variables which contain your Twilio credentials. You can also encrypt these variables using the encryption helpers.
At the top of the page, hit the "Save and Test" button. You should get a voice call which talks to you, then plays a song! If not, take a look at the "Log output" area. It should have completed successfully and show Twilio's response, or any error messages. If that looks OK, log in to Twilio and check the debugger there.
At this point you have an SNS topic which will send you an SMS message and trigger a Lambda function which calls you. Now it's time to put it to use!
3. Configure your CloudWatch alarms
The easiest way I found to do an end-to-end test is to create an alarm you know will fail. Head over to AWS CloudWatch.
- Click "Alarms" in the sidebar, then the blue "Create Alarm" button.
- Search for a "CPU" metric, and choose one of your instance's "CPUUtilization" metrics.
- Click "Next".
- Call it "TestAlarm"
- Configure a Threshold "whenever CPUUtilization is <= 100 for 1 consecutive period(s)".
- Click "Create Alarm"
Since your CPU will be using below 100% (hopefully!) as soon as you click "Create Alarm" you'll get a phone call and text. The phone call will wake you up, and the text message will contain a little bit of context before you get to your emails. Too easy!
Congratulations! You've successfully set up a simple incident response tool which is effectively free, and you don't have any extra servers to maintain. Now you can create new alarms, or update existing ones, for those critical times when you need to be woken up.
Where to from here?
To take this further, an easy win should be getting more context from the CloudWatch into the voice call. It'd involve looking up how CloudWatch passes those attributes into the Lambda, and then using that in our dynamically generated TwiML.
You could look at storing "on call" information in a DynamoDB, and get Lambda to look up who it should call based on the day. You could also use the Twilio API to make sure the call is answered / acknowledged, and escalate to another developer when the first line of defense doesn't respond.
You could create an API Gateway endpoint which other services (like external uptime monitoring) can POST to and trigger your Lambda.
Lot's of room for improvement, but I hope this has given you a taste of how easy it is to use Lambdas. Let me know how you get on!
Blog cover image: "On Fire" by Gunshow © kc green.