Beanstalk for long running jobs…but not too long.

How long is too long? The answer may surprise you!

tl;dr: It’s 30 minutes.

To support tasks that take a long time AWS Elastic Beanstalk provides Worker Environments to make developing applications which consume an SQS queue easier. There are a lot of benefits to this but it does come with some caveats if your jobs take longer than Beanstalk expects out of the box.

The worker tier is laid out like so.

Between SQS and the application are a daemon (sqsd) which reads from SQS and posts the message to your app through an nginx proxy. This provides some really nice abstractions for developing but when it comes to long running jobs the devil is in the timeouts.

SQS

Visibility timeout is the main setting here. It will need to be set to something greater than how long you expect processing to actually take. One notable omission out of the box for the worker environment is the ability to scale based on queue depth. Beanstalk supports creating additional resources via CloudFormation so this can be set up via a config file in the .ebextensions folder. (source)

Resources:
  QueueDepthAlarmHigh:
    Type: AWS::CloudWatch::Alarm
    Properties:
      Namespace: "AWS/SQS"
      MetricName: ApproximateNumberOfMessagesVisible
      Dimensions:
        - Name: QueueName
          Value: { "Fn::GetAtt" : ["AWSEBWorkerQueue", "QueueName"] }
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 1
      Threshold: 1
      ComparisonOperator: GreaterThanOrEqualToThreshold
      AlarmActions:
      - Ref: ScaleOutPolicy

  QueueDepthAlarmLow:
    Type: AWS::CloudWatch::Alarm
    Properties:
      Namespace: "AWS/SQS"
      MetricName: ApproximateNumberOfMessagesVisible
      Dimensions:
        - Name: QueueName
          Value: { "Fn::GetAtt" : ["AWSEBWorkerQueue", "QueueName"] }
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 6
      Threshold: 0
      ComparisonOperator: LessThanOrEqualToThreshold
      AlarmActions:
      - Ref: ScaleInPolicy

  ScaleOutPolicy:
    Type: AWS::AutoScaling::ScalingPolicy
    Properties:
      AdjustmentType: ChangeInCapacity
      AutoScalingGroupName:
        Ref: AWSEBAutoScalingGroup
      ScalingAdjustment: 1

  ScaleInPolicy:
    Type: AWS::AutoScaling::ScalingPolicy
    Properties:
      AdjustmentType: ChangeInCapacity
      AutoScalingGroupName:
        Ref: AWSEBAutoScalingGroup
      ScalingAdjustment: -1

Nginx

For really long running jobs you may need to extend the proxy timeouts in nginx. You can do that with a config file placed in .ebextentions like so. (source)

files:
  "/tmp/proxy.conf":
    mode: "000644"
    owner: root
    group: root
    content: |
      proxy_connect_timeout 1800;
      proxy_send_timeout 1800;
      proxy_read_timeout 1800;
      send_timeout       1800;

container_commands:
  00-add-config:
    command: cat /tmp/proxy.conf > /var/elasticbeanstalk/staging/nginx/conf.d/00_elastic_beanstalk_proxy.conf
  01-restart-nginx:
    command: /sbin/service nginx restart

SQSD

The next timeout you’ll want to visit is the inactivity timeout setting in the worker details. This determines how long the SQS daemon will wait for the application to respond for a given message. The caveat here is that the max value for this setting is 30 minutes. This means that if your jobs regularly take >30 minutes you are S.O.L. Of course you could construct the application to fork off from the request thread but then you lose the ability to properly respond with a success or failure for each message.

Conclusion

The Elastic Beanstalk Worker Environment is nice if it fits your workloads. If, like me, you have jobs that take longer than 30 minutes it might not be the best fit. In this case I ended up creating a Terraform module to create custom autoscaling groups and integrated directly with SQS in the application. I do however miss the decoupling of the application to SQS provided by Beanstalk. Luckily there are a number of open source alternatives to aws-sqsd that I may explore in the future.

Bur.net