Another new AutoSpotting release in less than a week
Fixing ECS load balancer draining
Over the last 7 years I've been working hard to make AutoSpotting the best solution for running Spot Instances in Autoscaling groups.
Just a few days ago I released our biggest release ever which makes AutoSpotting managed groups of Spot instances even more resilient to low capacity situations than the vast majority of Autoscaling groups configured with a single OnDemand instance type.
As I always keep improving the experience based on user feedback, I just released a new version of AutoSpotting that improves reliability for ECS users, making AutoSpotting with ECS a more highly available solution than plain ECS itself!
A few weeks back an AutoSpotting user reported an interesting issue with their ECS setup: each Spot termination in their ECS cluster with Spot capacity managed by AutoSpotting resulted in dropped connections and 5xx errors returned to users.
After lots of investigations, today I was finally able to reproduce and fix the issue.
For a little background, AutoSpotting never handled Spot instance load balancer draining itself but relied on the AutoScaling group or ECS to drain the connections from the load balancers when Spot instances are terminated.
While Autoscaling worked pretty well, it turns out that ECS is very slow at triggering the draining, sometimes starting the load balancer draining just a few seconds before (or sometimes even after!) the Spot instance has been shut down:
This leads to dropped connections and users getting 5xx errors when Spot instances are terminated.
To make it clear, this is not only affecting AutoSpotting users, but anyone using ECS with Spot instances, so if you use ECS with Spot instances I'd recommend you to have a look into this.
So earlier today I implemented earlier deregistration, available in the next release of AutoSpotting.
The deregistration API calls are done by AutoSpotting in less than 10 seconds after the Spot termination event was fired, which should give plenty of time for the connections to be drained cleanly from ECS tasks without users getting errors.
And as we now also immediately launch the replacement Spot instances with diversified failover to OnDemand instances, we give even more time for the new instances to start running the application and we reduce the time of running at reduced capacity as much as possible.
After the latest AutoSpotting release we had just a few days ago, I was thinking to take a few more weeks until the next release, but this is such a big issue that I decided to change my plans, and released it immediately.
Check out the latest available version of AutoSpotting, 1.2.1-0 already available on the AWS Marketplace.
If you run the current release you can install it using this CloudFormation template, just make sure you also use the SourceImageTag "stable-1.2.1-0"
And as always, stay tuned, there's more stuff coming soon in the next release.
We noticed that the version
1.2.1-0 only considered the listener port (such as 80) and failed to drain traffic if the instance was listening on other ports.
We just released the version
1.2.1-2, which correctly handles instances with different ports than the load balancer listener, such as when instances listen on a variety of dynamic ports.
Here's how it looks like in action:
There's also a small non-functional fix related to the deletion of SQS messages after being processed, which used to be shown in the logs like this:
SCE:2023-04-18T12:10:00 2023/04/18 12:11:09 region.go:522: us-east-1 Error deleting on-demand instance i-09e980848cc071f5a launch event message from the SQS Queue https://sqs.us-east-1.amazonaws.com/xxxxx/AutoSpotting.fifo: MissingParameter: The request must contain the parameter ReceiptHandle.
The current version instead shows this message:
SQS:i-0e2ffa0829eb9167c 2023/04/18 10:28:05 region.go:527: us-east-1 Successfully deleted spot instance i-0e2ffa0829eb9167c launch event message from the SQS Queue https://sqs.us-east-1.amazonaws.com/xxxxx/AutoSpotting.fifo