Latest issue May 26, 2022

SQS Visibility Timout - When Concurrency Suddenly Stopped

Amazon SQS is a message queue service that allows you to decouple and scale software components, processing messages concurrently. It's great, for example, if you work with multiple docker containers all processing message data in parallel.

Last week, we noticed a particular SQS queue that had started to quickly build up both a backlog of visible messages, and more concerningly - the age of our oldest message was stuck at 346,000 seconds. The maximum time to live for our queue was 346,000 seconds - 4 days.

We'd hit the limit, and we were now losing messages - a major problem.

Looking at our queue metrics, this is what we were presented with...

Our age of oldest message hits 346000 seconds

Something had clearly gone wrong, and we were suddenly no longer able to process messages at the volume we were receiving.

These two graphs essentially meant one of two situations had occurred.

The volume of messages on the queue had dramatically increased, and we were unable to scale up consumption of these messages to match; OR
Our message consumers were no longer able to process messages and remove them from the queue at the rate they previously could.

What Happened?

A quick investigation of our ingest logs revealed that the volume of data hitting our ingest endpoints had not increased. We were not receiving any more requests leading to messages on the queue than before. This ruled out situation 1.

So that only meant our capacity to process messages had fallen. Investigating, we eventually discovered the culprit - SQS visibility timeout. So, what is the visibility timeout? Let's look at Amazon's docs...

Because Amazon SQS is a distributed system, there's no guarantee that the consumer actually receives the message (for example, due to a connectivity issue, or due to an issue in the consumer application). Thus, the consumer must delete the message from the queue after receiving and processing it.

Immediately after a message is received, it remains in the queue. To prevent other consumers from processing the message again, Amazon SQS sets a visibility timeout, a period of time during which Amazon SQS prevents other consumers from receiving and processing the message. The default visibility timeout for a message is 30 seconds.

SQS visibility timeout is the time period you configure that a message will be placed "on hold" by a queue whilst it gives your consumer a chance to process a message.

If you ask the queue for a message, a timer starts running and if you do not acknowledge receipt, processing and delete of the message before that timeout - the message will be re-added to the queue ready for another consumer to process it.

This is exactly what we were seeing - messages on the queue were being received and processed, but the queue continued to rise. We had tried introducing more processing containers - and interestingly we discovered this had virtually no impact on the queue times. Our concurrency was not working.

Lets read that last quoted line again...

The default visibility timeout for a message is 30 seconds.

A few weeks ago, we migrated to using Crossplane. Whilst I will save the discussion for what Crossplane is to another day, a quick primer is that it allows us to provision and orchestrate our infrastructure in a platform independent way, using a nice declarative API.

In setting that up and migrating from our previous solution - a simple but important mistake was made. Our previous setup relied on the default visibilityTimout as defined by SQS - 30 seconds. Moving to Crossplane required us to explicitly set this value - and instead of entering 30 seconds, it was set to 0.

The net impact of this was that as soon as an SQS message was sent to a consumer, the visibilityTimeout immediately expired - giving the processor no time to acknowledge and process the message before it was re-added to the queue for another consumer to process it.

This meant the same message was being requested and processed by almost all the consumers - rather than just 1. There goes our concurrency!

A quick fix was to reset the visibilityTimeout: 30 setting back in Crossplane, and away we go. The results speak for themselves...

Our ages of oldest message returns to 11,000

And with a few more minor amends to rectify some debugging changes we made, those queues continued to fall to 0 seconds and the number of messages visible returned to normal.

What Happened?

Subscribe to Ian Outterside - Talking Tech and Engineering