One of the challenges which are presented to organizations when they migrate some of their services to serverless technologies like AWS Lambda or Azure Functions is how to monitor these functions effectively and ensure that they are accessible and performant.
In this blog, I'm going to discuss some of the challenges with monitoring serverless APIs, and why traditional monitoring tools don't work with this technology. I'll also explain how to develop a reliability plan for these services and demonstrate how to use the tools provided by Runscope to implement this plan.
Challenges with Traditional Monitoring and Serverless Solutions
Traditionally, organizations have monitored the performance of their applications and services by installing an agent on each of the servers which gathers metrics and then reports these to a central location. In a serverless world, without access to the servers themselves, this option is no longer available. We need to find a better way of monitoring performance and ensuring the uptime of our services.
Different cloud providers provide a level of reporting for their serverless solutions, but there is a level of effort required to harness this reporting, and attaining accurate and timely results can be a challenge.
Developing a Serverless Reliability Plan
As service providers, we should be concerned about a few key measures related to the user experience of our customers. Our services or functions need to:
■ Be available and accessible to those who need to access them.
■ Respond within a reasonable amount of time.
■ Provide accurate data processing.
Our serverless reliability plan should address each of these requirements, and evaluate them on a regular basis. If for any reason these requirements are not being met, we need a system in place which can mitigate them or alert support personnel, so that they can investigate and resolve any problems.
Designing a Plan
Serverless functions are used for many different tasks. The applications we'll be concerned with in this plan are those functions which are exposed to and respond to consumers through an API. I'll be using an AWS Lambda function connected through AWS API Gateway for this example, but this approach can be applied to any implementation.
Our plan should include periodic tests which validate that the API is reachable and responds favorably. With our implementation, a successful request should return a response with an HTTP status code in the 200 range.
Additionally, we should validate that any data processing which is done by the function results in an accurate response. We should be able to validate that calculations are correct and that storage and retrieval functions complete as expected. The tests should confirm that our functions are working as expected, and can be invaluable if an update is made which breaks existing functionality.
Finally, we'll want to ensure that response times fall within an acceptable range. This is a metric which may vary depending on the calling location of the consumer and the type of operation being performed. These types of tests require that a baseline of performance is established and an acceptable range of variance determined.
Implementing Your Plan with Runscope
I'll be using the Runscope platform to put together an implementation for this plan. Runscope provides all the necessary infrastructure to meet the needs of our tests and makes the process incredibly simple. If you don't already have a Runscope account, you can sign up for a free trial here.
I designed a simple API which will allow clients to create, update, retrieve and delete preference information.: You can find the API here. I implemented the API using a couple of Lambda Functions and then tied them all together with AWS API Gateway. In the next section, I'll show you how easy it is to create tests to validate the functionality and keep track of response times from around the globe.
Determining Availability and Accuracy
After logging in to my Runscope account, I'll start by clicking on the Create Testlink in the top navigation bar, and selecting the New Testoption.
Figure 1. Create a New Test
The platform will prompt you for a name and description. This information is displayed at the top of your test and is used throughout the platform to ensure that it is descriptive and specific to the application you're testing.
Figure 2. Enter a Name and Description for Your Test
Now we can add the tests for the endpoints on our API. I'll be validating the POST, PUT, GET and DELETE endpoints, and I'll want to validate them from North America, Australia and Western Europe. The first thing I'll do is expand the Test Settingssection under the ENVIRONMENTheader.
I'll select Locations, and then enable tests to be executed from US Virginia, Australia and Germany.
Figure 3. Enable the Test to be Executed from Around the Globe
The first step will be to create a new user preference. Under the Stepsheader, I'll change the HTTP method to POST, and then I'll paste my URL into the space provided. You'll see that you can also add various authentication methods, headers and parameters. I'll be adding a simple JSON payload to the request, and leave all the other settings at their default values.
Figure 4. Setting up the POST Endpoint Test
The second part of the test is to validate that the response is what we expect it to be. Clicking on the Assertionstab, we'll see an HTTP response assertion validating that an HTTP 200 response is returned, and we can add additional assertions as well. I'm going to validate that the response body contains my userIdvalue, and also ensure that the response time is less than 500ms.
Figure 5. Adding Assertions to Validate Your Test
That's all there is to setting up a single test. You can validate that it works by clicking on the Save & Runbutton and observing the results in the left-hand panel. If your tests fail, you can click on the failing test to see which assertions failed and why.
I'm going to add additional steps and repeat this process to call the PUT endpoint with an update, the GET endpoint to retrieve the data, and the DELETE endpoint to clean everything up for the next test execution.
Running my tests resulted in the following, which shows successful executions of all but the DELETE test, which I can troubleshoot to determine why an HTTP 500 response is being returned.
Figure 6. Test Results
Scheduling and Notifications
The steps above provide a great way to validate assumptions about your API. However, the real power of Runscope comes in the ability to schedule your tests and have the system notify you when the tests fail due to processing errors, or degrading performance.
Notification methods include email, Slack, PagerDuty, and others. There is also the option to generate a custom webhook if you would like to trigger an action in a different system. The Notification options are included under the Environmentheader.
Scheduling is configured by clicking on the Schedules link in the left-hand navigation menu. Scheduled tests are executed based on the selected schedule, and notifications are triggered depending on the results of the test, and how you have your notifications configured.
You can learn more about customizing your tests, notifications and other advanced features from the Runscope documentation.