May 18, 2024

We’re heavy customers of Amazon Compute Compute Cloud (EC2) at Slack — we run roughly 60,000 EC2 cases throughout 17 AWS areas whereas working tons of of AWS accounts. A mess of groups personal and handle our varied cases.

The Occasion Metadata Service (IMDS) is an on-instance part that can be utilized to realize an perception to the occasion’s present state. Because it first launched over 10 years in the past, AWS clients used this service to assemble helpful details about their cases. At Slack, IMDS is used closely as an illustration provisioning, and likewise utilized by instruments that want to know their working environments.

Info uncovered by IMDS contains IAM credentials, metrics concerning the occasion, safety group IDs, and an entire lot extra. This data might be extremely delicate – if an occasion is compromised, an attacker might be able to use occasion metadata to realize entry to different Slack companies on the community.

In 2019, AWS launched a brand new model of IMDS (IMDSv2) the place each request is protected by session authentication. As a part of our dedication to excessive safety requirements, Slack moved all the fleet and instruments to IMDSv2. On this article, we’re going to focus on the pitfalls of utilizing IMDSv1 and our journey in direction of absolutely migrating to IMDSv2.

The v2 distinction

IMDSv1 makes use of a easy request-and-response sample that may enlarge the affect of Server Side Request Forgery (SSRF) vulnerabilities — if an utility deployed on an occasion is weak to SSRF, an attacker can exploit the applying to make requests on their behalf. Since IMDSv1 helps easy GET requests, they’ll extract credentials utilizing its API.

IMDSv2 eliminates this assault vector through the use of session-oriented requests. IMDSv2 works by requiring these two steps:

  1. Make a PUT request with the header X-aws-ec2-metadata-token-ttl-secondsheader, and obtain a token that’s legitimate for the TTL supplied within the request
  2. Use that token in a HTTP GET request with the header named X-aws-ec2-metadata-token to make any follow-up IMDS calls

With IMDSv2, relatively than merely making HTTP GET requests, an attacker wants to use vulnerabilities to make PUT requests with headers. Then they must use the obtained credentials to make follow-up GET requests with headers to entry IMDS information. This makes it way more difficult for attackers to entry IMDS through vulnerabilities corresponding to SSRF.

Our journey in direction of IMDSv2

At Slack there are a number of occasion provisioning mechanisms at play, corresponding to Terraform, CloudFormation and varied in-house instruments that decision the AWS EC2 API. As a company, we rely closely on IMDS to get insights into our cases throughout provisioning and the lifecycle of those cases.

We create AWS accounts per setting (Sandbox, Dev and Prod) and per service group and typically even per utility – so we’ve tons of of AWS accounts.

We have now a single root AWS group account. All our youngster accounts are members of this group. After we create an AWS account, the account creation course of writes details about the account (such because the account ID, proprietor particulars, and account tags) to a DynamoDB desk. Info on this desk is accessible through an inner API known as Archipelago for account discovery.

Determining the dimensions of the issue

Earlier than migrating, first we wanted to know what number of cases in our fleet used IMDSv1. For this we used the EC2 CloudWatch metric known as MetadataNoToken that counts how typically the IMDSv1 API was used for a given occasion.

We created an utility known as imds-cw-metric-collector to map these metrics and occasion IDs we collected to alert varied service groups and purposes. The applying used our inner Archipelago API to get an inventory of our AWS accounts, the aforementioned MetadataNoToken metric, and talked to our occasion provisioning companies to gather data like proprietor IDs and Chef Roles (for cases which can be utilizing Chef to configure them). Our customized app despatched all these metrics to our Prometheus monitoring system.

A dashboard aggregated these metrics to trace all cases that made IMDSv1 calls. This data was then used to attach with service groups, and work with them to replace their companies to make use of IMDSv2.

IMDSv1 usage dashboard

Nonetheless, the checklist of EC2 occasion IDs and their homeowners was solely part of the equation. We additionally wanted to know which processes on these cases have been making these calls to the IMDSv1 API.

At Slack, for probably the most half, we use Ubuntu and Amazon Linux on our EC2 cases. For IMDSv1 name detection, AWS gives a software known as AWS ImdsPacketAnalyzer. We determined to construct the software and package deal it up as a Debian Linux distribution package deal (*.deb) in our APT repository. This allowed the service groups to put in this software on demand and examine IMDSv1 calls.

This labored completely for our Ubuntu 22.04 (Jammy Jellyfish) and Amazon Linux cases. Nonetheless, the ImdsPacketAnalyzer doesn’t work on our legacy Ubuntu 18.04 (Bionic Beaver) cases so we needed to resort to utilizing instruments corresponding to lsof and netlogs in some circumstances.

As a final resort on a few of our dev cases we simply turned off IMDSv1 and listed issues that have been damaged.

Cease calling IMDSv1

As soon as we had an inventory of cases and processes on these cases that have been making the IMDSv1 calls, it was time for us to get to work and replace each to make use of IMDSv2 as a substitute.

Updating our bash scripts was the straightforward half, as AWS gives very clear steps on switching from IMDSv1 and IMDSv2 for these. We additionally upgraded our AWS CLI to the newest model to get IMDSv2 help. Nonetheless doing this for companies which can be written utilizing different languages was a bit extra difficult. Fortunately AWS has a comprehensive list of libraries that we needs to be utilizing to implement IMDSv2 for varied languages. We labored with service groups to improve their purposes to IMDSv2 supported variations of libraries and roll these out throughout our fleet.

As soon as we had rolled out these adjustments, the variety of cases utilizing IMDSv1 dropped precipitously.

Turning off IMDSv1 for brand new cases

Stopping our companies from utilizing the IMDSv1 API solely solved a part of the issue. We additionally wanted to show off IMDSv1 on all future cases. To resolve this drawback, we turned to our provisioning instruments.

First we checked out our mostly used provisioning software, Terraform. Our group gives a set of ordinary Terraform modules for service groups to make use of to create issues corresponding to AutoScaling teams, S3 buckets, and RDS cases. These widespread modules allow us to make a change in a single place and roll it out to many groups. Service groups that simply need to construct an AutoScaling group don’t have to know the nitty-gritty configurations of Terraform to make use of one in all these modules.

Nonetheless we didn’t need to roll out this modification to all our AWS youngster accounts on the identical time, as there have been service groups that have been actively engaged on switching to IMDSv1 at the moment. Due to this fact we wanted a option to exclude these groups and their youngster accounts. We got here up with a customized Terraform module known as accounts_using_imdsv1 as the answer.Then we have been in a position to make use of this module in our shared Terraform modules to maintain or terminate IMDSv1 as per the instance beneath:

module "accounts_using_imdsv1" 
  supply = "../slack/accounts_using_imdsv"


useful resource "aws_instance" "instance" 
  ami           = information.aws_ami.amzn-linux-2023-ami.id
  instance_type = "c6a.2xlarge"
  subnet_id     = aws_subnet.instance.id

  metadata_options 
    http_endpoint  = "enabled"
    http_tokens    = module.accounts_using_imdsv1.is_my_account_using_imdsv1 ? "elective" : "required"
  

We began with a big checklist of accounts within the accounts_using_imdsv1 module as utilizing IMDSv1, however we have been slowly capable of take away them as service groups migrated to IMDSv2.

Blocking cases with IMDSv1 from launching

The following step for us was to dam launching cases with IMDSv1 enabled. For this we turned to AWS Service control policies (SCPs). We up to date our SCPs to dam launching IMDSv1 supported cases throughout all our youngster accounts. Nonetheless, just like the AutoScaling group adjustments we mentioned earlier, we needed to exclude some accounts at first whereas the service homeowners have been working to change to IMDSv2. Our accounts_using_imdsv1 Terraform module got here to the rescue right here too. We have been in a position to make use of this module in our SCPs as beneath. We blocked the power to launch cases with IMDSv1 help and likewise blocked the power to activate IMDSv1 on present cases.

 # Block launching cases with IMDSv1 enabled
  assertion 
    impact = "Deny"

    actions = [
      "ec2:RunInstances",
    ]

    assets = [
      "arn:aws:ec2:*:*:instance/*",
    ]

    situation 
      check     = "StringNotEquals"
      variable = "ec2:MetadataHttpTokens"
      values     = ["required"]
    

    situation 
      check          = "StringNotEquals"
      variable = "aws:PrincipalAccount"
      values     = module.accounts_using_imdsv1.accounts_list_using_imdsv1
    
  

  # Block turning on IMDSv1 if it is already turned off
  assertion 
    impact = "Deny"

    actions = [
      "ec2:ModifyInstanceMetadataOptions",
    ]

    assets = [
      "arn:aws:ec2:*:*:instance/*",
    ]

    situation 
      check          = "StringNotEquals"
      variable = "ec2:Attribute/HttpTokens"
      values     = ["required"]
    

    situation 
      check          = "StringNotEquals"
      variable = "aws:PrincipalAccount"
      values     = module.accounts_using_imdsv1.accounts_list_using_imdsv1
    
  
}

How efficient are these SCPs?

SCPs are efficient with regards to blocking most IMDSv1 utilization. Nonetheless there are some locations the place they don’t work.

SCPs don’t apply to the AWS root group’s account, and solely apply to youngster accounts which can be members of the group. Due to this fact, SCPs don’t stop launching cases with IMDSv1 enabled, nor turning on IMDSv1 on an present occasion within the root AWS account.

SCPs additionally don’t apply to service-linked roles. For instance, if an autoscaling group launches an occasion in response to a scaling occasion, underneath the hood the AutoScaling service is utilizing a service-linked IAM position managed by AWS and people occasion launches usually are not impacted by the above SCPs.

We checked out stopping groups from creating AWS Launch Templates that don’t implement IMDSv2, however AWS Launch Template coverage situation keys at present do not provide support for ec2:Attribute/HttpTokens.

What different security mechanisms are in place?

As there isn’t a 100%-foolproof option to cease somebody from launching an IMDSv1-enabled EC2 occasion, we put in a notification system using AWS EventBridge and Lambda.

We created two EventBridge guidelines in every of our youngster accounts utilizing CloudTrail occasions for EC2 occasions. One rule captures requests to the EC2 API and the second captures responses from the EC2 API, telling us when somebody is making a EC2:RunInstances name with IMDSv1 enabled.

Rule 1: Capturing the requests


  "element": 
    "eventName": ["RunInstances"],
    "eventSource": ["ec2.amazonaws.com"],
    "requestParameters": 
      "metadataOptions": 
        "httpTokens": ["optional"]
      
    
  ,
  "detail-type": ["AWS API Call via CloudTrail"],
  "supply": ["aws.ec2"]

Rule 2: Capturing the responses

{
  "element": {
    "eventName": ["RunInstances"],
    "eventSource": ["ec2.amazonaws.com"],
    "responseElements": 
      "instancesSet": 
        "gadgets": 
          "metadataOptions": 
            "httpTokens": ["optional"]
          
        
      
    
  },
  "detail-type": ["AWS API Call via CloudTrail"],
  "supply": ["aws.ec2"]
}

These occasion guidelines have a goal setup to level them at a central occasion bus residing in an account managed by our group.

AWS Eventbridge Targets

Occasions matching these guidelines are despatched to the central occasion bus. The Central Occasion bus captures these occasions through an identical algorithm. Subsequent it sends them via an Input Transformer to format the occasion just like the next:

Enter path:


  "account": "$.account",
  "instanceid": "$.element.responseElements.instancesSet.gadgets[0].instanceId",
  "area": "$.area",
  "time": "$.time"

Enter template:

 
  "supply" : "slack",
  "detail-type": "slack.api.postMessage",
  "model": 1,
  "account_id": "<account>",
  "channel_tag": "event_alerts_channel_imdsv1",
  "element": 
    "textual content": ":importantred: :provisioning: occasion `<instanceid> (<area>)` within the AWS account `<account>` was launched with `IMDSv1` help"
  

Lastly the remodeled occasions get despatched a Lambda perform in our account.

AWS Eventbridge Targets

This Lambda perform makes use of the account ID from the occasion and our inner Archipelago API to find out the Slack Channel, then sends this occasion to Slack.

IMDSv1 Slack Alerts

This stream seems like the next:

IMDSv1 Slack Alert Flow

We even have an identical alert in place for when IMDSv1 is turned on for an present occasion.

IMDSv1 Enabled Slack Alert

What concerning the cases with IMDSv1 enabled?

Launching new cases with IMDSv2 is cool and all, however what about our hundreds of present cases? We wanted a option to implement IMDSv2 on them as nicely. As we noticed above, SCPs don’t block launching cases with IMDSv1 completely.

Because of this we created a service known as IMDSv1 Terminator. It’s deployed on EKS and makes use of an IAM OIDC provider to acquire IAM credentials. These credentials have entry to imagine a extremely restricted position in all our youngster accounts created for this very goal.

The coverage connected to the position assumed by IMDSv1 Terminator in youngster accounts is as beneath:


    "Assertion": [
        
            "Action": "ec2:ModifyInstanceMetadataOptions",
            "Condition": 
                "StringEquals": 
                    "ec2:Attribute/HttpTokens": "required"
                
            ,
            "Effect": "Allow",
            "Resource": "arn:aws:ec2:*:*:instance/*",
            "Sid": ""
        ,
        
            "Action": [
                "ec2:DescribeRegions",
                "ec2:DescribeInstances"
            ],
            "Impact": "Permit",
            "Useful resource": "*",
            "Sid": ""
        
    ],
    "Model": "2012-10-17"


Much like our earlier metric collector utility, this additionally makes use of the interior Archipelago API to get an inventory of our AWS accounts, lists our EC2 cases in batches and analyzes each and checks if IMDSv1 is enabled. Whether it is, the service will implement IMDSv2 on the occasion.

When the service remediates an occasion, we get notified in Slack.

IMDSv1 Terminator Slack Alert

Initially we noticed tons of of those messages for present cases, however as they have been remediated and solely new cases have been launched with IMDSv2, we stopped seeing these messages. Now if an occasion will get launched with IMDSv1 help enabled we’ve the consolation of realizing that it’ll get remediated and we’ll get notified.

This service additionally sends metrics to our Prometheus monitoring system concerning the IMDS standing of our cases. We are able to simply visualize what AWS accounts and areas which can be nonetheless working IMDSv1 enabled cases, if there are any.

IMDSv1 Usage Dashboard

Some final phrases

Having the ability to implement IMDSv2 throughout Slack’s huge community was a difficult however rewarding expertise for the Cloud Foundations group. We labored with our giant variety of service groups to perform this aim, particularly our SecOps group who went above and past to assist us full the migration.

Need to assist us construct out our cloud infrastructure? We’re hiring! Apply now