Build an AWS EC2 Monitoring Dashboard with CloudWatch using CDK: Step-By-Step

13 min readDec 11, 2021

EC2 + CloudWatch + EventBridge + CDK

Disclaimer: All thoughts and opinions are my own and do not reflect those of my employer.

An AWS monitoring dashboard can help you visualize system performance and interpret metrics for your AWS services and workloads. The dashboard can provide a single view of your resources, aggregate information across your deployment. In this article, we’ll explain how to create a EC2 monitoring dashboard with AWS CloudWatch using CDK.

In this post we will dive into why we need to do this, then how to add the metrics manually and then building it with CDK.

Existing Metrics for EC2

AWS EC2 has lots of metrics already built. Some of the metrics are:

CPUUtilization
DiskReadOps
DiskWriteOps
DiskReadBytes
DiskWriteBytes
NetworkIn
NetworkOut

If you want a more complete list and explanation, please go here and read about the List the available CloudWatch metrics for your instances.

AWS also has a prebuilt dashboard with these metrics for your running instances, although these dashboards can be difficult to look at if you have hundreds of EC2 instances running. Here’s a sample of what that prebuilt dashboard looks like.

Note: I’ve removed the instance IDs for privacy reasons

EC2 prebuilt dashboard showing existing metrics

You can also click on a graph and see it in greater detail, but still isn’t great for when you have hundreds of EC2 instances.

CPU Utilization graph — CPU Utilization Graph

You can’t tell which instance types are having higher CPU utilization or which are lower and if you want info about a specific instance, forget it.

In fact, there are no metrics telling us how many EC2 instances we are using or how long they’ve been running. Knowing how many EC2 instances are being used can be really helpful for estimating your AWS bill. This type of statistic is not available. Let’s build it!

Creating the metrics

We could look at the EC2 console and then manually input the metric into CloudWatch, You can put metrics into CloudWatch using AWS CLI command put-metric-data. For example:

aws cloudwatch put-metric-data --metric-name InstanceCount --namespace AWS/EC2 --value 100 --timestamp 2022–01–01T12:00:00.000Z

However if we are constantly launching and terminating instances, the number of instances currently running is going to change. We want to capture those changes. However we don’t want to have to run a CLI command every day manually. Not only would that have missing data, it would also be tedious.

Amazon EventBridge

Enter EventBridge. With Amazon EventBridge, we can trigger a routine to run automatically. The trigger can be almost anything, such as an EMR state change. However to keep things simple, we will use a scheduled trigger, where the routine will run on a regular schedule. We can decide how frequently we want to trigger the routine. How frequently the routine runs will depend on how frequently users are launching and terminating new EC2 instances. If the number of instances doesn’t change that much, then we can have the schedule be once per hour. If it changes frequently, you may want it to be every 20 minutes. Here’s what that looks like in the AWS EventBridge console:

You’ll notice further down on the page, you need to select a target. There are so many target to choose from that its overwhelming. We choose the simplest option, which is a lambda. But before we can do this, let’s write the lambda function. It’s going to be a simple function. We will write it directly in the AWS Lambda console, so we can immediately test it. First we need to create a new function. I called mine MonitorEC2, although a better name might be MonitorEC2Usage

Note: You’ll want to create a new role for the lambda, so that it has the permissions to put metrics into CloudWatch. More info

Now there are three metrics that are important to us:

Number of instances running
Number of hours they have been running
Number of instance in each availability zone

You may only care about the number of instances running or the number of hours they have been running or something else entirely. Decide what is important to you and your org. For this, we are going to stick with the three things mentioned above. So how can we get those? describe-instances. Test it out using the AWS CLI first.

$ aws ec2 describe-instances

This is a sample what it shows for me (some items are blank for privacy reasons):

{
    "Reservations": [
        {
            "Groups": [],
            "Instances": [
                {
                    "AmiLaunchIndex": 00,
                    "ImageId": "ami-",
                    "InstanceId": "i-",
                    "InstanceType": "r5.4xlarge",
                    "KeyName": "",
                    "LaunchTime": "2021-11-18T23:53:47+00:00",
                    "Monitoring": {
                        "State": "disabled"
                    },
                    "Placement": {
                        "AvailabilityZone": "us-east-1c",
                        "GroupName": "",
                        "Tenancy": "default"
                    },
                    "PrivateDnsName": "",
                    "PrivateIpAddress": "",
                    "ProductCodes": [],
                    "PublicDnsName": "",

Now you can see the structure of the data. Since we created a Python3.9 lambda, we will use Boto3 to query AWS EC2 and then CloudWatch to put the metric into CloudWatch. Here are the docs:

Normally I would copy the entire code for the lambda, but I will actually leave this as a task for the reader. We will use Python for our lambda function. Python is an easy language to learn if you don’t already know it and there are a ton of resources to help you get started. If you already know it, great! I’ll help you get started:

import boto3def lambda_handler(event, context):
    ec2 = boto3.client('ec2')
    cloudwatch_client = boto3.client('cloudwatch')
    
    instance_counts = defaultdict(int)
    availability_zones_counts = defaultdict(int)
    instance_runtimes = defaultdict(int)  
    
    result = ec2.describe_instances()
    # iterate through result
         # increment the count of instances
         # increment the count of hours
         # increment the count of instances in an AZ

Once we have the counts, we need to put them into CloudWatch. We will name our new metrics InstanceCount, InstanceRuntime, AvailabilityZoneInstanceCount. If you have better names, feel free to use those instead. You may also want to incorporate other metrics that are more meaningful to you and your org.

put_metrics_data_in_cloud_watch(cloudwatch_client, instance_counts, 'InstanceCount')
put_metrics_data_in_cloud_watch(cloudwatch_client, instance_runtimes, 'InstanceRuntime')
put_metrics_data_in_cloud_watch(cloudwatch_client, availability_zones_counts, 'AvailabilityZoneInstanceCount')

The put_metrics_data_in_cloud_watch function looks like this:

def put_metrics_data_in_cloud_watch(cloudwatch_client, counts, metrics_name):
    for instance_type, count in counts.items():
        cloudwatch_client.put_metric_data(
            Namespace='AWS/EC2',
            MetricData=[
                {
                    'MetricName': metrics_name,
                    'Dimensions': [
                        {
                            'Name': 'InstanceType',
                            'Value': instance_type
                        },
                    ],
                    'Unit': 'Count',
                    'Value': count
                }
            ]
        )
        print(f'{instance_type}: {count}')

I have include the print function, so you can see the function working. In production code, you should remove this or replace it with logging. Once we have the code looking good, we should deploy it and test it.

This is how that looks for me (some items are hidden for privacy reasons):

Two important things to note:

We have multiple different instance types running. I hid the full list, but you can see we have r5.4xlarge and r5.2xlarge. You can imagine that we could have r3.4xlarge, r4.4xlarge, r5.4xlarge, r5a.4xlarge and more! We will want to make sure we capture all of them in our CloudWatch Dashboard. If you are not familiar with the different instance types, you can read more about Amazon EC2 Instance Types

2. Even though the code is simple, it took 6170ms for the lambda to complete. If we are running the lambda every 20 mins, that’s 72 times per day. That could get expensive! Let’s use the lambda pricing calculator to estimate our costs:

Fortunately, it looks like it will be free, so that’s nice. Thanks AWS!

Note: Putting Metrics in CloudWatch is not free. Check Amazon CloudWatch Pricing for more details. There is also an AWS Pricing Calculator for CloudWatch

Metrics in CloudWatch

Let’s check that we can see these metrics on a CloudWatch dashboard. The AWS doc, Viewing Available Metrics, does a great job explaining how to view the metrics, so please go read that doc before continuing. You should also look at the AWS doc, Publishing Single Data Points.

Note: When you create a metric, it can take up to 2 minutes before you can retrieve statistics for the new metric using the get-metric-statistics command. However, it can take up to 15 minutes before the new metric appears in the list of metrics retrieved using the list-metrics command.

What this means is that even after you have successful tested your lambda function, it may take up to 15 minutes before you can see the metric in CloudWatch. Do not expect it to show up right away. Set a timer and come back.

Creating Lambda and Dashboard in CDK

We have created and tested our lambda code manually, but when it comes to deploying it into our three AWS accounts (Beta, Gamma, and Prod), we want to use CDK. CDK will allow us to use code to define our infrastructure. The CDK code will then compile to CloudFormation templates that are then used to create a CloudFormation stack and resources. If you are not familiar with CDK, specifically with deploying Lambda Functions with CDK, you can read this medium post, Deploying Lambda Functions with AWS CDK Python.

However it looks like they are writing their CDK code in Python and while our lambda is in Python, we prefer TypeScript for CDK as it’s much more featureful. Initially you could only use TypeScript for CDK. The other languages are relatively new additions.

CDK Code for Lambda and Dashboard

The official AWS CDK Github also has an amazing example that uses CDK 2.0.0. We will refer to the code in this repo heavily going forward. You should clone the example at this point because it’s an amazing example. The directory structure is also very nice. Copy their example and create a lambda-handler.py file and copy the code for your Python lambda from the lambda we created earlier into this local file and then put the file in the lambda directory. For your convenience I have created a gist of the code in the lib directory, but you should probably look at the source as it’s updated regularly and my gist will not be updated

Note: Our lambda function only need 128mb of memory. More memory will not make your lambda run faster. Additionally we should have a 30 second timeout.

I will not take time to explain this code as they do that in the GitHub repo. However now you have an idea of how to create a lambda and dashboard in CDK.

EventBridge Rule in CDK

One thing missing from the code is the Rule to trigger the lambda function. You can see an example of how to do that below (this code is untested):

/**
 * Trigger the lambda function and emit the metrics to CloudWatch
 * @param frequency - how frequently the lambda function should run
*/
function addRule(frequency: Duration) {
  new Rule(this, `${this.lambdaFunction.functionName}Rule`, {
    ruleName: `${this.lambdaFunction.functionName}Rule`,
    description: `Triggers ${
      this.lambdaFunction.functionName
    } lambda every ${frequency.toHumanString()} to count EC2 usage and emit metrics to CloudWatch`,
    enabled: true,
    schedule: Schedule.rate(frequency),
    targets: [
      new LambdaFunction(props.lambdaFunction),
    ],
  });
}

Note: I prefer static functions, so preferably you would pass in the Stack and props, but to keep it simple, I did not make this function static in the example above, but if I did, it’d look like this:

public static addRule(stack: DeploymentStack, props: LambdaRuleProps)

LambdaRuleProps would be a new interface with things like frequency

Add new CloudWatch Metric to Dashboard

While the example above shows you how to create a Dashboard in CDK, it does not add the new metrics we have created. As previously mentioned those metrics are InstanceCount, InstanceRuntime, AvailabilityZoneInstanceCount. Also previously mentioned, we have multiple different instance types. We do not want to have to list each instance type in order to display them on the CloudWatch dashboard. That would not be fun and it would also restrict the instances we could show. We want users to be able to launch EC2 instances of any type and have those shown on the dashboard.

Note: Initially I did manually add the metrics for each instance type. 🤦‍♂️

One way to avoid that is to use CloudWatch Metric Search Expression. Our search expression would look like this:

SEARCH('{AWS/EC2,InstanceType} MetricName="InstanceCount" ', 'Maximum', props.frequency.toSeconds())

Now we just need to translate this to CDK and add it to our Dashboard.

Adding New CloudWatch Metric to Dashboard in CDK

Initially I wasn’t sure how to write this. It’s definitely not obvious. The interface could be simplified. I would have figured it out eventually, but I prefer the easy route. Enter aws-samples. See this example. For your convenience I have created a gist of the code. I did add in the expression above (See line 18), but you should probably look at the example as it’s updated regularly and my gist will not be updated.

On line 67, I call the function

/* Add each metric to the dashboard */
    props.metricNames.forEach((metricName: string) => addMetrics(props, metricName));

metricNames are set like this:

const metricNames = ['InstanceCount', 'InstanceRuntime', 'AvailabilityZoneInstanceCount'];

The example is a bit confusing because starting from Line 92, they are creating a ton of widgets that you probably don’t need, so feel free to disregard. Also confusing about the example and using SEARCH in general is the requirement that you have usingMetrics: {}. However it is because of this issue. The issue is resolved and will get added into CDK in the future, at which point you may not need to include that bit of code.

Setup and Deployment

There are so many resources for how to compile and deploy CDK that it doesn’t make sense to discuss them here. I encourage you to look at best practices for developing cloud applications with AWS CDK. There’s even a workshop! However if you’re short on time, here’s the commands:

Setup

cdk init sample-app --language typescript
npm install -g aws-cdk
npm install
npm run build

Deployment

The first time you deploy an AWS CDK app into an environment (account/region), you install a bootstrap stack. This stack includes resources that are used in the toolkit’s operation.

cdk bootstrap aws://account-id/aws-region

At this point you can now synthesize the CloudFormation template for this code:

cdk synth

And proceed to deployment of the stack:

cdk deploy

For example, you may see a warning like this:

This deployment will make potentially sensitive changes according to your current security approval level (--require-approval broadening).
Please confirm you intend to make the following modifications:

IAM Statement Changes
┌───┬────────────────────────────────┬────────┬─────────────────┬────────────────────────────────┬────────────────────────────────┐
│   │ Resource                       │ Effect │ Action          │ Principal                      │ Condition                      │
├───┼────────────────────────────────┼────────┼─────────────────┼────────────────────────────────┼────────────────────────────────┤
│ + │ ${CdkWorkshopQueue.Arn}        │ Allow  │ sqs:SendMessage │ Service:sns.amazonaws.com      │ "ArnEquals": {                 │
│   │                                │        │                 │                                │   "aws:SourceArn": "${CdkWorks │
│   │                                │        │                 │                                │ hopTopic}"                     │
│   │                                │        │                 │                                │ }                              │
└───┴────────────────────────────────┴────────┴─────────────────┴────────────────────────────────┴────────────────────────────────┘
(NOTE: There may be security-related changes not in this list. See https://github.com/aws/aws-cdk/issues/1299)

Do you wish to deploy these changes (y/n)?

The warning above is merely an example. It is warning you that deploying the app entails some risk. That’s fine because we are deploying this to our AWS Beta account. We would never deploy directly to production. Press y to deploy the stack and create the resources.

Output should look something like this:

CdkWorkshopStack: deploying...
CdkWorkshopStack: creating CloudFormation changeset...✅  CdkWorkshopStackStack ARN:
arn:aws:cloudformation:REGION:ACCOUNT-ID:stack/CdkWorkshopStack/STACK-ID

I haven’t tested these deployment commands as we use a different process internally, but they should work as expected.

CloudWatch Dashboard

Once everything is deployed, you can take a look at your dashboard. Here’s our mine looks (some items are hidden for privacy reasons):

Yours should look something like this. I’d recommend having the statistics on the lambda on a separate dashboard assuming you chose to have them at all.

Summary

In summary, you have created a lambda which puts new metrics about your EC2 instances into CloudWatch. You scheduled that lambda to run every 20 minutes using EventBridge. You can visualize these metrics in a newly created CloudWatch Dashboard. You deployed all these resources (lambda, rule, dashboard) with CDK. You should be proud of yourself!

Future Work

You can of course expand to add more metrics based on information in describe_instances API. You can also create CloudWatch alarms for if the number of instances of a particular instance type goes over a threshold you set. You may not want more than 100 EC2 instances running at one time. It’d look something like this:

metric.createAlarm(this, 'Alarm', { threshold: 100,   evaluationPeriods: 3,   datapointsToAlarm: 2, });

You could also setup an anomaly alarm with CfnAnomalyDetector. It’d look something like:

aws-samples also has examples.

Exclusions

It should be pointed out that we never created a role for our lambda using CDK. We may have done so manually using the AWS console, but we should have used CDK as we want each AWS account to have a role for the lambda. I will leave this as an exercise for the reader. You could also create an inline policy and attach it directly to the lambda function. It’d look something like this:

const lambdaPolicy = new Policy(this,' LambdaPolicy', {
  policyName: 'LambdaPolicy',
  statements: [
    new PolicyStatement({
      effect: Effect.ALLOW,
      resources: ['*'],
      actions: ['lambda:InvokeFunction'],
    })
  ]
}this.lambdaFunction.role.attachInlinePolicy(lambdaPolicy)

Note: This code is untested and incomplete

There’s probably a better example in the aws-samples repo.

Misc

To keep your CDK typescript code looking good, I highly recommend you use Prettier. You can add it into your package.json like this:

"prettier": "^2.5.1",

I have this configuration for Prettier in my package.json:

"prettier": {
  "printWidth": 120,
  "semi": true,
  "singleQuote": true,
  "bracketSpacing": true,
  "trailingComma": "all"
}

More info on basic Prettier configuration

Final Notes

I could go on forever about testing, compiling, deploying the CDK code we have written together and best practices, but then this post would never end. The code in the post is not production quality and is merely meant to act as an example to give you an idea of how to write your code. You should not copy it blindly. Furthermore, I had to remove some of the cleanliness to present a simple example for instructional purposes. The code we use in production is actually much cleaner.

If you have enjoyed this post, please clap for this post and follow me on social media.