images/airflow/2.10.1/python/mwaa/celery/sqs_broker.py (842 lines of code) (raw):
# Copyright (c) 2015-2016 Ask Solem & contributors. All rights reserved.
# Copyright (c) 2012-2014 GoPivotal Inc & contributors. All rights reserved.
# Copyright (c) 2009-2012, Ask Solem & contributors. All rights reserved.
# Modifications Copyright 2022 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of Ask Solem nor the
# names of its contributors may be used to endorse or promote products
# derived from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL Ask Solem OR CONTRIBUTORS
# BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.
"""Amazon SQS transport module for Kombu.
This package implements an AMQP-like interface on top of Amazons SQS service,
with the goal of being optimized for high performance and reliability.
The default settings for this module are focused now on high performance in
task queue situations where tasks are small, idempotent and run very fast.
SQS Features supported by this transport
========================================
Long Polling
------------
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-long-polling.html
Long polling is enabled by setting the `wait_time_seconds` transport
option to a number > 1. Amazon supports up to 20 seconds. This is
enabled with 10 seconds by default.
Batch API Actions
-----------------
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-batch-api.html
The default behavior of the SQS Channel.drain_events() method is to
request up to the 'prefetch_count' messages on every request to SQS.
These messages are stored locally in a deque object and passed back
to the Transport until the deque is empty, before triggering a new
API call to Amazon.
This behavior dramatically speeds up the rate that you can pull tasks
from SQS when you have short-running tasks (or a large number of workers).
When a Celery worker has multiple queues to monitor, it will pull down
up to 'prefetch_count' messages from queueA and work on them all before
moving on to queueB. If queueB is empty, it will wait up until
'polling_interval' expires before moving back and checking on queueA.
Other Features supported by this transport
==========================================
Predefined Queues
-----------------
The default behavior of this transport is to use a single AWS credential
pair in order to manage all SQS queues (e.g. listing queues, creating
queues, polling queues, deleting messages).
If it is preferable for your environment to use multiple AWS credentials, you
can use the 'predefined_queues' setting inside the 'transport_options' map.
This setting allows you to specify the SQS queue URL and AWS credentials for
each of your queues. For example, if you have two queues which both already
exist in AWS) you can tell this transport about them as follows:
.. code-block:: python
transport_options = {
'predefined_queues': {
'queue-1': {
'url': 'https://sqs.us-east-1.amazonaws.com/xxx/aaa',
'access_key_id': 'a',
'secret_access_key': 'b',
'backoff_policy': {1: 10, 2: 20, 3: 40, 4: 80, 5: 320, 6: 640}, # optional
'backoff_tasks': ['svc.tasks.tasks.task1'] # optional
},
'queue-2.fifo': {
'url': 'https://sqs.us-east-1.amazonaws.com/xxx/bbb.fifo',
'access_key_id': 'c',
'secret_access_key': 'd',
'backoff_policy': {1: 10, 2: 20, 3: 40, 4: 80, 5: 320, 6: 640}, # optional
'backoff_tasks': ['svc.tasks.tasks.task2'] # optional
},
}
'sts_role_arn': 'arn:aws:iam::<xxx>:role/STSTest', # optional
'sts_token_timeout': 900 # optional
}
Note that FIFO and standard queues must be named accordingly (the name of
a FIFO queue must end with the .fifo suffix).
backoff_policy & backoff_tasks are optional arguments. These arguments
automatically change the message visibility timeout, in order to have
different times between specific task retries. This would apply after
task failure.
AWS STS authentication is supported, by using sts_role_arn, and
sts_token_timeout. sts_role_arn is the assumed IAM role ARN we are trying
to access with. sts_token_timeout is the token timeout, defaults (and minimum)
to 900 seconds. After the mentioned period, a new token will be created.
If you authenticate using Okta_ (e.g. calling |gac|_), you can also specify
a 'session_token' to connect to a queue. Note that those tokens have a
limited lifetime and are therefore only suited for short-lived tests.
.. _Okta: https://www.okta.com/
.. _gac: https://github.com/Nike-Inc/gimme-aws-creds#readme
.. |gac| replace:: ``gimme-aws-creds``
Client config
-------------
In some cases you may need to override the botocore config. You can do it
as follows:
.. code-block:: python
transport_option = {
'client-config': {
'connect_timeout': 5,
},
}
For a complete list of settings you can adjust using this option see
https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html
Features
========
* Type: Virtual
* Supports Direct: Yes
* Supports Topic: Yes
* Supports Fanout: Yes
* Supports Priority: No
* Supports TTL: No
""" # noqa: E501
import base64
import json
import socket
import string
import uuid
from datetime import datetime
from queue import Empty
from botocore.client import Config
from botocore.exceptions import ClientError
from botocore.serialize import Serializer
from vine import ensure_promise, promise, transform
from kombu.asynchronous import get_event_loop
from kombu.asynchronous.aws.ext import boto3, exceptions, AWSRequest
from kombu.asynchronous.aws.sqs.connection import AsyncSQSConnection
from kombu.asynchronous.aws.sqs.message import AsyncMessage
from kombu.log import get_logger
from kombu.utils import scheduling
from kombu.utils.encoding import bytes_to_str, safe_str
from kombu.utils.json import dumps, loads
from kombu.utils.objects import cached_property
from kombu.transport import virtual
# 2022-11-25: Amazon addition.
# Airflow Stats object.
from airflow.stats import Stats
from enum import Enum
from multiprocessing import shared_memory
from threading import Lock
import os
from mwaa.logging.utils import throttle
# End of Amazon addition
logger = get_logger(__name__)
# dots are replaced by dash, dash remains dash, all other punctuation
# replaced by underscore.
CHARS_REPLACE_TABLE = {ord(c): 0x5F for c in string.punctuation if c not in "-_."}
CHARS_REPLACE_TABLE[0x2E] = 0x2D # '.' -> '-'
#: SQS bulk get supports a maximum of 10 messages at a time.
SQS_MAX_MESSAGES = 10
def maybe_int(x):
"""Try to convert x' to int, or return x' if that fails."""
try:
return int(x)
except ValueError:
return x
# Monkey patch the implementation of ...
def _create_query_request(self, operation, params, queue_url, method):
params = params.copy()
if operation:
params["Action"] = operation
# defaults for non-get
param_payload = {"data": params}
if method.lower() == "get":
# query-based opts
param_payload = {"params": params}
return AWSRequest(method=method, url=queue_url, **param_payload)
def _create_json_request(self, operation, params, queue_url):
params = params.copy()
params["QueueUrl"] = queue_url
service_model = self.sqs_connection.meta.service_model
operation_model = service_model.operation_model(operation)
url = self.sqs_connection._endpoint.host
headers = {}
# Content-Type
json_version = operation_model.metadata["jsonVersion"]
content_type = f"application/x-amz-json-{json_version}"
headers["Content-Type"] = content_type
# X-Amz-Target
target = "{}.{}".format(
operation_model.metadata["targetPrefix"],
operation_model.name,
)
headers["X-Amz-Target"] = target
param_payload = {"data": json.dumps(params), "headers": headers}
method = operation_model.http.get("method", Serializer.DEFAULT_METHOD)
return AWSRequest(method=method, url=url, **param_payload)
def make_request(self, operation_name, params, queue_url, verb, callback=None): # noqa
"""
Overide make_request to support different protocols.
botocore is soon going to change the default protocol of communicating
with SQS backend from 'query' to 'json', so we need a special
implementation of make_request for SQS. More information on this can
be found in: https://github.com/celery/kombu/pull/1807.
"""
signer = self.sqs_connection._request_signer
service_model = self.sqs_connection.meta.service_model
protocol = service_model.protocol
if protocol == "query":
request = self._create_query_request(operation_name, params, queue_url, verb)
elif protocol == "json":
request = self._create_json_request(operation_name, params, queue_url)
else:
raise Exception(f"Unsupported protocol: {protocol}.")
signing_type = "presign-url" if request.method.lower() == "get" else "standard"
signer.sign(operation_name, request, signing_type=signing_type)
prepared_request = request.prepare()
return self._mexe(prepared_request, callback=callback)
# Override the implementation of make_request to bring the fix in this PR:
# https://github.com/celery/kombu/pull/1807
AsyncSQSConnection._create_query_request = _create_query_request
AsyncSQSConnection._create_json_request = _create_json_request
AsyncSQSConnection.make_request = make_request
class UndefinedQueueException(Exception):
"""Predefined queues are being used and an undefined queue was used."""
class InvalidQueueException(Exception):
"""Predefined queues are being used and configuration is not valid."""
class QoS(virtual.QoS):
"""Quality of Service guarantees implementation for SQS."""
def reject(self, delivery_tag, requeue=False):
super().reject(delivery_tag, requeue=requeue)
routing_key, message, backoff_tasks, backoff_policy = (
self._extract_backoff_policy_configuration_and_message(delivery_tag)
)
if routing_key and message and backoff_tasks and backoff_policy:
self.apply_backoff_policy(
routing_key, delivery_tag, backoff_policy, backoff_tasks
)
def _extract_backoff_policy_configuration_and_message(self, delivery_tag):
try:
message = self._delivered[delivery_tag]
routing_key = message.delivery_info["routing_key"]
except KeyError:
return None, None, None, None
if not routing_key or not message:
return None, None, None, None
queue_config = self.channel.predefined_queues.get(routing_key, {})
backoff_tasks = queue_config.get("backoff_tasks")
backoff_policy = queue_config.get("backoff_policy")
return routing_key, message, backoff_tasks, backoff_policy
def apply_backoff_policy(
self, routing_key, delivery_tag, backoff_policy, backoff_tasks
):
queue_url = self.channel._queue_cache[routing_key]
task_name, number_of_retries = self.extract_task_name_and_number_of_retries(
delivery_tag
)
if not task_name or not number_of_retries:
return None
policy_value = backoff_policy.get(number_of_retries)
if task_name in backoff_tasks and policy_value is not None:
c = self.channel.sqs(routing_key)
c.change_message_visibility(
QueueUrl=queue_url,
ReceiptHandle=delivery_tag,
VisibilityTimeout=policy_value,
)
def extract_task_name_and_number_of_retries(self, delivery_tag):
message = self._delivered[delivery_tag]
message_headers = message.headers
task_name = message_headers["task"]
number_of_retries = int(
message.properties["delivery_info"]["sqs_message"]["Attributes"][
"ApproximateReceiveCount"
]
)
return task_name, number_of_retries
class Channel(virtual.Channel):
"""SQS Channel."""
default_region = "us-east-1"
default_visibility_timeout = 1800 # 30 minutes.
default_wait_time_seconds = 10 # up to 20 seconds max
domain_format = "kombu%(vhost)s"
_asynsqs = None
_predefined_queue_async_clients = {} # A client for each predefined queue
_sqs = None
_predefined_queue_clients = {} # A client for each predefined queue
_queue_cache = {}
_noack_queues = set()
QoS = QoS
eof_token = "EOF_TOKEN"
# The SQS channel needs to maintain data regarding the SQS messages that it is currently consuming. This data can be used by the
# MWAA worker task monitor to check if a worker is idle or not. The data is stored in the shared memory blocks defined below.
# The shared memory blocks will have a definite size which is calculated here.
# This per tasks buffer size allows us to store data for each incoming SQS message like the airflow task command contained inside
# and the SQS message receipt handle. The airflow task command helps with correlating an SQS message with its corresponding Airflow
# task process and the receipt handle helps with sending the SQS message back to the queue if needed.
# Furthermore, if celery fails to remove the message from the queue, the message data will still be present in the shared memory blocks
# defined below. This limit will provide the needed flexibility to go beyond what is actually needed for the happy case scenario and
# allow the cleanup process to remove the data from the shared memory blocks without running out of space.
buffer_size_per_task = 2500
celery_worker_task_limit = int(
os.environ.get("AIRFLOW__CELERY__WORKER_AUTOSCALE", "20,20").split(",")[0]
)
celery_tasks_buffer_size = celery_worker_task_limit * buffer_size_per_task
# A simple enum to define the type of operations that can be carried out when updating the celery state (memory block containing
# the current in-flight tasks related data).
class CeleryStateUpdateAction(Enum):
# Add data specific to a single Airflow task to the celery state.
ADD = 1
# Remove data specific to a single Airflow task from the celery state.
REMOVE = 2
def __init__(self, *args, **kwargs):
if boto3 is None:
raise ImportError("boto3 is not installed")
super().__init__(*args, **kwargs)
self._validate_predifined_queues()
# SQS blows up if you try to create a new queue when one already
# exists but with a different visibility_timeout. This prepopulates
# the queue_cache to protect us from recreating
# queues that are known to already exist.
self._update_queue_cache(self.queue_name_prefix)
self.hub = kwargs.get("hub") or get_event_loop()
# MWAA__CORE__TASK_MONITORING_ENABLED is set to 'true' for workers where we want to monitor count of tasks currently getting
# executed on the worker. This will be used to determine if idle worker checks are to be enabled.
self.idle_worker_monitoring_enabled = (
os.environ.get("MWAA__CORE__TASK_MONITORING_ENABLED", "false") == "true"
)
if self.idle_worker_monitoring_enabled:
logger.info('Idle working monitoring will be enabled because '
'MWAA__CORE__TASK_MONITORING_ENABLED is set to true.')
# These are the shared memory blocks which the Worker Task Monitor and the Celery SQS Channel uses to share the internal
# state of current work load across the two processes.
# 'celery_state' is maintained by the SQS channel and has information about the current in-flight tasks.
celery_state_block_name = f'celery_state_{os.environ.get("AIRFLOW_ENV_ID", "")}'
self.celery_state = (
shared_memory.SharedMemory(name=celery_state_block_name)
if self.idle_worker_monitoring_enabled
else None
)
# Create a shared memory block which the Worker Task Monitor and the Celery SQS Channel will use to signal the toggle of a
# flag which tells the Celery SQS channel to pause/unpause further consumption of available SQS messages.
# It is maintained by the worker monitor.
celery_work_consumption_block_name = (
f'celery_work_consumption_{os.environ.get("AIRFLOW_ENV_ID", "")}'
)
self.celery_work_consumption_flag_block = (
shared_memory.SharedMemory(name=celery_work_consumption_block_name)
if self.idle_worker_monitoring_enabled
else None
)
# 'cleanup_celery_state' is maintained by the Worker Task Monitor and has information about the current in-flight tasks
# which needs to be cleaned up from 'celery_state'. The second blob is used because worker task monitor cannot write into
# 'celery_state'. If worker task monitor was to directly update the 'celery_state', then chances are that changes happening
# concurrently at the worker task monitor and the SQS channel, can cause changes to be overwritten by one another.
cleanup_celery_state_block_name = (
f'cleanup_celery_state_{os.environ.get("AIRFLOW_ENV_ID", "")}'
)
self.cleanup_celery_state = (
shared_memory.SharedMemory(name=cleanup_celery_state_block_name)
if self.idle_worker_monitoring_enabled
else None
)
self.celery_lock = Lock() if self.idle_worker_monitoring_enabled else None
# If celery fails to remove the message from the queue but the associated airflow process has wrapped up, then the message details
# will be stuck in the memory blocks defined above. This is the abandoned sqs messages scenario and this flag determine if we need
# to intentionally create this scenario for testing purposes.
self.abandoned_messages_test_enabled = (
os.environ.get(
"AIRFLOW__MWAA__TEST_ABANDONED_SQS_MESSAGE_SCENARIOS", "false"
)
== "true"
and self.idle_worker_monitoring_enabled
)
# If celery removes the message from the queue but the associated airflow process has not wrapped up, then the process will continue
# to eat worker resources. This is the undead airflow processes scenario and this flag determine if we need to intentionally create
# this scenario for testing purposes.
self.undead_processes_test_enabled = (
os.environ.get("AIRFLOW__MWAA__TEST_UNDEAD_PROCESS_SCENARIOS", "false")
== "true"
and self.idle_worker_monitoring_enabled
)
def _get_padded_bytes_from_str(self, raw_data: str):
data = raw_data + self.eof_token
data_bytes = bytes(data, "utf-8")
data_bytes += b"0" * (self.celery_tasks_buffer_size - len(data_bytes))
return data_bytes
def _get_str_from_padded_bytes(self, raw_data: bytes):
data = str(raw_data, "utf-8")
return data[: data.index(self.eof_token)]
def _get_tasks_from_state(self, celery_state):
return loads(
self._get_str_from_padded_bytes(
celery_state.buf[: self.celery_tasks_buffer_size]
)
)
def _get_celery_task_index(self, celery_task, celery_tasks):
for index, task in enumerate(celery_tasks):
if (
task["command"] == celery_task["command"]
and task["receipt_handle"] == celery_task["receipt_handle"]
):
return index
return -1
@throttle(seconds=60, log_throttling_msg=False)
def _report_celery_status_update_no_failure(self):
# This method is used to report a zero value for the celery_state_update_failure
# metric. It is throttled with an interval of 60 seconds, to avoid spamming
# the metric with lots of values.
Stats.incr("mwaa.celery.celery_state_update_failure", 0)
def _update_state_with_tasks(
self, celery_task_tuples, update_action: CeleryStateUpdateAction
):
"""
Update celery_state (memory block containing all in-flight SQS message data) with the provided data related to in-flight SQS
messages. This method also performs cleanup of celery_state if it finds any task data common between celery_state and
cleanup_celery_state. Check the definition of cleanup_celery_state above for more details.
:param celery_task_tuples: List of tuples where each tuple contains the Airflow task command contained in an SQS message being
consumed by the SQS channel and the SQS message receipt handle.
:param update_action: Whether to add the provided SQS tasks data to the celery_state (memory block containing all in-flight
SQS message data) or remove it from the celery_state.
"""
if self.idle_worker_monitoring_enabled:
self.celery_lock.acquire()
try:
cleanup_celery_tasks = self._get_tasks_from_state(
self.cleanup_celery_state
)
current_celery_tasks = self._get_tasks_from_state(self.celery_state)
for cleanup_celery_task in cleanup_celery_tasks:
index_for_cleanup = self._get_celery_task_index(
cleanup_celery_task, current_celery_tasks
)
if index_for_cleanup != -1:
current_celery_tasks.pop(index_for_cleanup)
for command, receipt_handle in celery_task_tuples:
celery_task = {"command": command, "receipt_handle": receipt_handle}
index_for_update = self._get_celery_task_index(
celery_task, current_celery_tasks
)
if (
update_action == self.CeleryStateUpdateAction.ADD
and index_for_update == -1
):
current_celery_tasks.append(celery_task)
elif (
update_action == self.CeleryStateUpdateAction.REMOVE
and index_for_update != -1
):
current_celery_tasks.pop(index_for_update)
self.celery_state.buf[: self.celery_tasks_buffer_size] = (
self._get_padded_bytes_from_str(dumps(current_celery_tasks))
)
self._report_celery_status_update_no_failure()
except Exception:
Stats.incr("mwaa.celery.celery_state_update_failure", 1)
finally:
self.celery_lock.release()
def _is_task_consumption_paused(self):
"""
celery_work_consumption_block represents the toggle switch for accepting any more incoming SQS message from the
celery queue which will be used during the shutdown procedure. If this value of the flag is set to 1, then no more SQS messages
should be consumed by the SQS channel.
"""
return (
self.idle_worker_monitoring_enabled
and self.celery_work_consumption_flag_block.buf[0] == 1
)
def _get_task_command_from_sqs_message(self, encoded_sqs_message_body):
decoded_message_body = loads(base64.b64decode(encoded_sqs_message_body))
celery_payload = loads(base64.b64decode(decoded_message_body["body"]))[0]
celery_command_list = [
command for command_list in celery_payload for command in command_list
]
return " ".join(celery_command_list).strip()
def _validate_predifined_queues(self):
"""Check that standard and FIFO queues are named properly.
AWS requires FIFO queues to have a name
that ends with the .fifo suffix.
"""
for queue_name, q in self.predefined_queues.items():
fifo_url = q["url"].endswith(".fifo")
fifo_name = queue_name.endswith(".fifo")
if fifo_url and not fifo_name:
raise InvalidQueueException(
"Queue with url '{}' must have a name " "ending with .fifo".format(
q["url"]
)
)
elif not fifo_url and fifo_name:
raise InvalidQueueException(
"Queue with name '{}' is not a FIFO queue: " "'{}'".format(
queue_name, q["url"]
)
)
def _update_queue_cache(self, queue_name_prefix):
if self.predefined_queues:
for queue_name, q in self.predefined_queues.items():
self._queue_cache[queue_name] = q["url"]
return
resp = self.sqs().list_queues(QueueNamePrefix=queue_name_prefix)
for url in resp.get("QueueUrls", []):
queue_name = url.split("/")[-1]
self._queue_cache[queue_name] = url
def basic_consume(self, queue, no_ack, *args, **kwargs):
if no_ack:
self._noack_queues.add(queue)
if self.hub:
self._loop1(queue)
return super().basic_consume(queue, no_ack, *args, **kwargs)
def basic_cancel(self, consumer_tag):
if consumer_tag in self._consumers:
queue = self._tag_to_queue[consumer_tag]
self._noack_queues.discard(queue)
return super().basic_cancel(consumer_tag)
def drain_events(self, timeout=None, callback=None, **kwargs):
"""Return a single payload message from one of our queues.
Raises:
Queue.Empty: if no messages available.
"""
# If we're not allowed to consume or have no consumers, raise Empty
if not self._consumers or not self.qos.can_consume():
raise Empty()
# At this point, go and get more messages from SQS
self._poll(self.cycle, callback, timeout=timeout)
def _reset_cycle(self):
"""Reset the consume cycle.
Returns:
FairCycle: object that points to our _get_bulk() method
rather than the standard _get() method. This allows for
multiple messages to be returned at once from SQS (
based on the prefetch limit).
"""
self._cycle = scheduling.FairCycle(
self._get_bulk,
self._active_queues,
Empty,
)
def entity_name(self, name, table=CHARS_REPLACE_TABLE):
"""Format AMQP queue name into a legal SQS queue name."""
if name.endswith(".fifo"):
partial = name[: -len(".fifo")]
partial = str(safe_str(partial)).translate(table)
return partial + ".fifo"
else:
return str(safe_str(name)).translate(table)
def canonical_queue_name(self, queue_name):
return self.entity_name(self.queue_name_prefix + queue_name)
def _new_queue(self, queue, **kwargs):
"""Ensure a queue with given name exists in SQS."""
if not isinstance(queue, str):
return queue
# Translate to SQS name for consistency with initial
# _queue_cache population.
queue = self.canonical_queue_name(queue)
# The SQS ListQueues method only returns 1000 queues. When you have
# so many queues, it's possible that the queue you are looking for is
# not cached. In this case, we could update the cache with the exact
# queue name first.
if queue not in self._queue_cache:
self._update_queue_cache(queue)
try:
return self._queue_cache[queue]
except KeyError:
if self.predefined_queues:
raise UndefinedQueueException(
(
"Queue with name '{}' must be "
"defined in 'predefined_queues'."
).format(queue)
)
attributes = {"VisibilityTimeout": str(self.visibility_timeout)}
if queue.endswith(".fifo"):
attributes["FifoQueue"] = "true"
resp = self._create_queue(queue, attributes)
self._queue_cache[queue] = resp["QueueUrl"]
return resp["QueueUrl"]
def _create_queue(self, queue_name, attributes):
"""Create an SQS queue with a given name and nominal attributes."""
# Allow specifying additional boto create_queue Attributes
# via transport options
if self.predefined_queues:
return None
attributes.update(
self.transport_options.get("sqs-creation-attributes") or {},
)
return self.sqs(queue=queue_name).create_queue(
QueueName=queue_name,
Attributes=attributes,
)
def _delete(self, queue, *args, **kwargs):
"""Delete queue by name."""
if self.predefined_queues:
return
super()._delete(queue)
self._queue_cache.pop(queue, None)
def _put(self, queue, message, **kwargs):
"""Put message onto queue."""
q_url = self._new_queue(queue)
if self.sqs_base64_encoding:
body = AsyncMessage().encode(dumps(message))
else:
body = dumps(message)
kwargs = {"QueueUrl": q_url, "MessageBody": body}
if queue.endswith(".fifo"):
if "MessageGroupId" in message["properties"]:
kwargs["MessageGroupId"] = message["properties"]["MessageGroupId"]
else:
kwargs["MessageGroupId"] = "default"
if "MessageDeduplicationId" in message["properties"]:
kwargs["MessageDeduplicationId"] = message["properties"][
"MessageDeduplicationId"
]
else:
kwargs["MessageDeduplicationId"] = str(uuid.uuid4())
c = self.sqs(queue=self.canonical_queue_name(queue))
if message.get("redelivered"):
# 2022-11-25: Amazon addition.
# This branch is executed when a task is returned to the queue, e.g.
# a worker shutdown:
# https://github.com/celery/kombu/blob/v4.6.11/kombu/transport/virtual/base.py#L732
Stats.incr("mwaa.celery.task_returned", 1)
self._update_state_with_tasks(
[
(
self._get_task_command_from_sqs_message(body),
message["properties"]["delivery_tag"],
)
],
self.CeleryStateUpdateAction.REMOVE,
)
# End of Amazon addition
c.change_message_visibility(
QueueUrl=q_url,
ReceiptHandle=message["properties"]["delivery_tag"],
VisibilityTimeout=0,
)
else:
# 2022-11-25: Amazon addition.
# This branch is executed when the scheduler puts a task in the
# queue so it can be picked by a Celery worker.
Stats.incr("mwaa.celery.task_queued", 1)
# End of Amazon addition
c.send_message(**kwargs)
@staticmethod
def _optional_b64_decode(byte_string):
try:
data = base64.b64decode(byte_string)
if base64.b64encode(data) == byte_string:
return data
# else the base64 module found some embedded base64 content
# that should be ignored.
except Exception: # pylint: disable=broad-except
pass
return byte_string
def _message_to_python(self, message, queue_name, queue):
body = self._optional_b64_decode(message["Body"].encode())
payload = loads(bytes_to_str(body))
if queue_name in self._noack_queues:
queue = self._new_queue(queue_name)
# 2022-11-25: Amazon addition.
# This branch won't be called with our current configuration for
# Airflow/Celery. It will be called only when task_acks_late Celery
# configuration is set to False, resulting in task messages being
# deleted from the SQS immediately, rather than waiting for the
# worker to finish the task. However, we still report a metric to
# make sure the code is future-proof, e.g. if Airflow or MWAA decide
# to disable the task_acks_late configuration.
Stats.incr("mwaa.celery.task_pulled", 1)
self._update_state_with_tasks(
[
(
self._get_task_command_from_sqs_message(message["Body"]),
message["ReceiptHandle"],
)
],
self.CeleryStateUpdateAction.ADD,
)
# End of Amazon addition.
self.asynsqs(queue=queue_name).delete_message(
queue,
message["ReceiptHandle"],
)
else:
try:
properties = payload["properties"]
delivery_info = payload["properties"]["delivery_info"]
except KeyError:
# json message not sent by kombu?
delivery_info = {}
properties = {"delivery_info": delivery_info}
payload.update(
{
"body": bytes_to_str(body),
"properties": properties,
}
)
# set delivery tag to SQS receipt handle
delivery_info.update(
{
"sqs_message": message,
"sqs_queue": queue,
}
)
properties["delivery_tag"] = message["ReceiptHandle"]
return payload
def _messages_to_python(self, messages, queue):
"""Convert a list of SQS Message objects into Payloads.
This method handles converting SQS Message objects into
Payloads, and appropriately updating the queue depending on
the 'ack' settings for that queue.
Arguments:
messages (SQSMessage): A list of SQS Message objects.
queue (str): Name representing the queue they came from.
Returns:
List: A list of Payload objects
"""
q = self._new_queue(queue)
return [self._message_to_python(m, queue, q) for m in messages]
def _get_bulk(self, queue, max_if_unlimited=SQS_MAX_MESSAGES, callback=None):
"""Try to retrieve multiple messages off ``queue``.
Where :meth:`_get` returns a single Payload object, this method
returns a list of Payload objects. The number of objects returned
is determined by the total number of messages available in the queue
and the number of messages the QoS object allows (based on the
prefetch_count).
Note:
Ignores QoS limits so caller is responsible for checking
that we are allowed to consume at least one message from the
queue. get_bulk will then ask QoS for an estimate of
the number of extra messages that we can consume.
Arguments:
queue (str): The queue name to pull from.
Returns:
List[Message]
"""
# drain_events calls `can_consume` first, consuming
# a token, so we know that we are allowed to consume at least
# one message.
# Note: ignoring max_messages for SQS with boto3
max_count = self._get_message_estimate()
if max_count:
q_url = self._new_queue(queue)
# 2022-11-25: Amazon addition.
# Send a heartbeat metric each time we try to receive messages from SQS.
# I didn't notice this branch being executed as it seems that async client
# is used in our case, but adding this nevertheless to be future-proof.
Stats.incr("mwaa.celery.heartbeat", 1)
# End of Amazon addition.
resp = self.sqs(queue=queue).receive_message(
QueueUrl=q_url,
MaxNumberOfMessages=max_count,
WaitTimeSeconds=self.wait_time_seconds,
)
if resp.get("Messages"):
# 2022-11-25: Amazon addition.
# Since we pulled some messages from the SQS queue, we report
# a metric indicating the number of tasks pulled for execution.
# I didn't notice this branch being executed as it seems that
# async client is used in our case, but adding this nevertheless
# to be future-proof.
Stats.incr("mwaa.celery.task_pulled", len(resp.get("Messages")))
celery_task_tuples = []
# End of Amazon addition.
for m in resp["Messages"]:
celery_task_tuples.append(
(
self._get_task_command_from_sqs_message(m["Body"]),
m["ReceiptHandle"],
)
)
m["Body"] = AsyncMessage(body=m["Body"]).decode()
self._update_state_with_tasks(
celery_task_tuples, self.CeleryStateUpdateAction.ADD
)
for msg in self._messages_to_python(resp["Messages"], queue):
self.connection._deliver(msg, queue)
return
raise Empty()
def _get(self, queue):
"""Try to retrieve a single message off ``queue``."""
q_url = self._new_queue(queue)
# 2022-11-25: Amazon addition.
# Send a heartbeat metric each time we try to receive messages from SQS.
# I didn't notice this branch being executed as it seems that async client
# is used in our case, but adding this nevertheless to be future-proof.
Stats.incr("mwaa.celery.heartbeat", 1)
# If worker monitoring is enabled and the worker has been told to pause consumption, then we return Empty here to
# simulate the situation where there are no messages in the celery queue.
if self._is_task_consumption_paused():
raise Empty()
# End of Amazon addition.
resp = self.sqs(queue=queue).receive_message(
QueueUrl=q_url,
MaxNumberOfMessages=1,
WaitTimeSeconds=self.wait_time_seconds,
)
if resp.get("Messages"):
# 2022-11-25: Amazon addition.
# Since we pulled a message from the SQS queue, we report a metric
# indicating that.
# I didn't notice this branch being executed as it seems that async client
# is used in our case, but adding this nevertheless to be future-proof.
Stats.incr("mwaa.celery.task_pulled", 1)
self._update_state_with_tasks(
[
(
self._get_task_command_from_sqs_message(
resp["Messages"][0]["Body"]
),
resp["Messages"][0]["ReceiptHandle"],
)
],
self.CeleryStateUpdateAction.ADD,
)
# End of Amazon addition.
body = AsyncMessage(body=resp["Messages"][0]["Body"]).decode()
resp["Messages"][0]["Body"] = body
return self._messages_to_python(resp["Messages"], queue)[0]
raise Empty()
def _loop1(self, queue, _=None):
self.hub.call_soon(self._schedule_queue, queue)
def _schedule_queue(self, queue):
if queue in self._active_queues:
if self.qos.can_consume():
self._get_bulk_async(
queue,
callback=promise(self._loop1, (queue,)),
)
else:
self._loop1(queue)
def _get_message_estimate(self, max_if_unlimited=SQS_MAX_MESSAGES):
# 2024-02-01: Amazon addition.
# If worker monitoring is enabled and the worker has been told to pause consumption, then we return zero here to
# simulate the situation where there are no messages in the celery queue.
if self._is_task_consumption_paused():
return 0
# End of Amazon addition.
maxcount = self.qos.can_consume_max_estimate()
return min(
max_if_unlimited if maxcount is None else max(maxcount, 1),
max_if_unlimited,
)
def _get_bulk_async(self, queue, max_if_unlimited=SQS_MAX_MESSAGES, callback=None):
maxcount = self._get_message_estimate()
if maxcount:
return self._get_async(queue, maxcount, callback=callback)
# Not allowed to consume, make sure to notify callback..
callback = ensure_promise(callback)
callback([])
return callback
def _get_async(self, queue, count=1, callback=None):
q = self._new_queue(queue)
qname = self.canonical_queue_name(queue)
return self._get_from_sqs(
qname,
count=count,
connection=self.asynsqs(queue=qname),
callback=transform(self._on_messages_ready, callback, q, queue),
)
def _on_messages_ready(self, queue, qname, messages):
if "Messages" in messages and messages["Messages"]:
callbacks = self.connection._callbacks
# 2022-11-25: Amazon addition.
# This code is called after messages are retrieved from SQS via
# the async client. From our perspective, this indicates tasks
# pulled from the queue, so we report a metric.
Stats.incr("mwaa.celery.task_pulled", len(messages["Messages"]))
celery_task_tuples = []
for msg in messages["Messages"]:
celery_task_tuples.append(
(
self._get_task_command_from_sqs_message(msg["Body"]),
msg["ReceiptHandle"],
)
)
msg_parsed = self._message_to_python(msg, qname, queue)
callbacks[qname](msg_parsed)
should_add_all_messages = True
if self.undead_processes_test_enabled and len(celery_task_tuples) > 1:
# In order to test for undead processes, we will not add the data regarding the first message fetched from the queue
# in 50% of the cases. This will cause the cleanup process to treat the corresponding Airflow process as undead and that
# Airflow process will be terminated.
import random
if random.randint(0, 9) < 5:
should_add_all_messages = False
if should_add_all_messages:
self._update_state_with_tasks(
celery_task_tuples, self.CeleryStateUpdateAction.ADD
)
else:
self._update_state_with_tasks(
celery_task_tuples[1:], self.CeleryStateUpdateAction.ADD
)
# End of Amazon addition.
def _get_from_sqs(self, queue, count=1, connection=None, callback=None):
"""Retrieve and handle messages from SQS.
Uses long polling and returns :class:`~vine.promises.promise`.
"""
connection = connection if connection is not None else queue.connection
if self.predefined_queues:
if queue not in self._queue_cache:
raise UndefinedQueueException(
(
"Queue with name '{}' must be defined in "
"'predefined_queues'."
).format(queue)
)
queue_url = self._queue_cache[queue]
else:
queue_url = connection.get_queue_url(queue)
# 2022-11-25: Amazon addition.
# Send a heartbeat metric each time we try to receive messages from SQS.
Stats.incr("mwaa.celery.heartbeat", 1)
# End of Amazon addition.
return connection.receive_message(
queue,
queue_url,
number_messages=count,
wait_time_seconds=self.wait_time_seconds,
callback=callback,
)
def _restore(self, message, unwanted_delivery_info=("sqs_message", "sqs_queue")):
for unwanted_key in unwanted_delivery_info:
# Remove objects that aren't JSON serializable (Issue #1108).
message.delivery_info.pop(unwanted_key, None)
return super()._restore(message)
def basic_ack(self, delivery_tag, multiple=False):
try:
message = self.qos.get(delivery_tag).delivery_info
sqs_message = message["sqs_message"]
except KeyError:
super().basic_ack(delivery_tag)
else:
queue = None
if "routing_key" in message:
queue = self.canonical_queue_name(message["routing_key"])
try:
# 2022-11-25: Amazon addition.
# This code is executed when a task finishes execution and
# thus its message can be removed from the SQS queue. We thus
# report a task_executed metric.
Stats.incr("mwaa.celery.task_executed", 1)
should_remove_sqs_message = True
celery_task_tuple = (
self._get_task_command_from_sqs_message(sqs_message["Body"]),
sqs_message["ReceiptHandle"],
)
if self.abandoned_messages_test_enabled:
# In order to test for abandoned tasks, we will not remove the message from the SQS queue in 50% of the cases.
# This will cause the cleanup process to treat the corresponding SQS message as abandoned and the message will be
# returned to the queue by setting its visibility timeout to zero.
# Also checking that the message data should be present in the internal state (memory blocks) because the
# abandoned SQS message test scenario and undead Airflow process test scenario are not possible to occur
# simultaneously based on definition.
import random
celery_task = {
"command": celery_task_tuple[0],
"receipt_handle": celery_task_tuple[1],
}
celery_task_index = self._get_celery_task_index(
celery_task, self._get_tasks_from_state(self.celery_state)
)
if random.randint(0, 9) < 5 and celery_task_index != -1:
should_remove_sqs_message = False
if should_remove_sqs_message:
self._update_state_with_tasks(
[celery_task_tuple], self.CeleryStateUpdateAction.REMOVE
)
self.sqs(queue=queue).delete_message(
QueueUrl=message["sqs_queue"],
ReceiptHandle=sqs_message["ReceiptHandle"],
)
# End of Amazon addition.
except ClientError:
super().basic_reject(delivery_tag)
else:
super().basic_ack(delivery_tag)
def _size(self, queue):
"""Return the number of messages in a queue."""
url = self._new_queue(queue)
c = self.sqs(queue=self.canonical_queue_name(queue))
resp = c.get_queue_attributes(
QueueUrl=url, AttributeNames=["ApproximateNumberOfMessages"]
)
return int(resp["Attributes"]["ApproximateNumberOfMessages"])
def _purge(self, queue):
"""Delete all current messages in a queue."""
q = self._new_queue(queue)
# SQS is slow at registering messages, so run for a few
# iterations to ensure messages are detected and deleted.
size = 0
for i in range(10):
size += int(self._size(queue))
if not size:
break
self.sqs(queue=queue).purge_queue(QueueUrl=q)
return size
def close(self):
# 2024-02-01: Amazon addition.
# If worker monitoring is enabled, then we make use of shared memory blocks to share the internal state from this
# SQS Channel with the worker task monitor. When closing the channel, we also close the shared memory blocks in order to
# prevent a warning message to show up in the logs. We are not calling unlink here intentionally because both the task monitor and
# the Celery SQS channel references these blocks and when the worker is shutting down, we do not control which process will be
# killed first. Calling unlink may result in a warning message showing up in the customer side logs causing unnecessary confusion.
if self.idle_worker_monitoring_enabled:
self.celery_state.close()
self.celery_work_consumption_flag_block.close()
self.cleanup_celery_state.close()
# End of Amazon addition.
super().close()
# if self._asynsqs:
# try:
# self.asynsqs().close()
# except AttributeError as exc: # FIXME ???
# if "can't set attribute" not in str(exc):
# raise
def new_sqs_client(
self, region, access_key_id, secret_access_key, session_token=None
):
session = boto3.session.Session(
region_name=region,
aws_access_key_id=access_key_id,
aws_secret_access_key=secret_access_key,
aws_session_token=session_token,
)
is_secure = self.is_secure if self.is_secure is not None else True
client_kwargs = {"use_ssl": is_secure}
if self.endpoint_url is not None:
client_kwargs["endpoint_url"] = self.endpoint_url
client_config = self.transport_options.get("client-config") or {}
config = Config(**client_config)
return session.client("sqs", config=config, **client_kwargs)
def sqs(self, queue=None):
if queue is not None and self.predefined_queues:
if queue not in self.predefined_queues:
raise UndefinedQueueException(
f"Queue with name '{queue}' must be defined"
" in 'predefined_queues'."
)
q = self.predefined_queues[queue]
if self.transport_options.get("sts_role_arn"):
return self._handle_sts_session(queue, q)
if not self.transport_options.get("sts_role_arn"):
if queue in self._predefined_queue_clients:
return self._predefined_queue_clients[queue]
else:
c = self._predefined_queue_clients[queue] = self.new_sqs_client(
region=q.get("region", self.region),
access_key_id=q.get("access_key_id", self.conninfo.userid),
secret_access_key=q.get(
"secret_access_key", self.conninfo.password
),
)
return c
if self._sqs is not None:
return self._sqs
c = self._sqs = self.new_sqs_client(
region=self.region,
access_key_id=self.conninfo.userid,
secret_access_key=self.conninfo.password,
)
return c
def _handle_sts_session(self, queue, q):
if not hasattr(self, "sts_expiration"): # STS token - token init
sts_creds = self.generate_sts_session_token(
self.transport_options.get("sts_role_arn"),
self.transport_options.get("sts_token_timeout", 900),
)
self.sts_expiration = sts_creds["Expiration"]
c = self._predefined_queue_clients[queue] = self.new_sqs_client(
region=q.get("region", self.region),
access_key_id=sts_creds["AccessKeyId"],
secret_access_key=sts_creds["SecretAccessKey"],
session_token=sts_creds["SessionToken"],
)
return c
# STS token - refresh if expired
elif self.sts_expiration.replace(tzinfo=None) < datetime.utcnow():
sts_creds = self.generate_sts_session_token(
self.transport_options.get("sts_role_arn"),
self.transport_options.get("sts_token_timeout", 900),
)
self.sts_expiration = sts_creds["Expiration"]
c = self._predefined_queue_clients[queue] = self.new_sqs_client(
region=q.get("region", self.region),
access_key_id=sts_creds["AccessKeyId"],
secret_access_key=sts_creds["SecretAccessKey"],
session_token=sts_creds["SessionToken"],
)
return c
else: # STS token - ruse existing
return self._predefined_queue_clients[queue]
def generate_sts_session_token(self, role_arn, token_expiry_seconds):
sts_client = boto3.client("sts")
sts_policy = sts_client.assume_role(
RoleArn=role_arn,
RoleSessionName="Celery",
DurationSeconds=token_expiry_seconds,
)
return sts_policy["Credentials"]
def asynsqs(self, queue=None):
if queue is not None and self.predefined_queues:
if queue in self._predefined_queue_async_clients and not hasattr(
self, "sts_expiration"
):
return self._predefined_queue_async_clients[queue]
if queue not in self.predefined_queues:
raise UndefinedQueueException(
(
"Queue with name '{}' must be defined in "
"'predefined_queues'."
).format(queue)
)
q = self.predefined_queues[queue]
c = self._predefined_queue_async_clients[queue] = AsyncSQSConnection(
sqs_connection=self.sqs(queue=queue),
region=q.get("region", self.region),
)
return c
if self._asynsqs is not None:
return self._asynsqs
c = self._asynsqs = AsyncSQSConnection(
sqs_connection=self.sqs(queue=queue), region=self.region
)
return c
@property
def conninfo(self):
return self.connection.client
@property
def transport_options(self):
return self.connection.client.transport_options
@cached_property
def visibility_timeout(self):
return (
self.transport_options.get("visibility_timeout")
or self.default_visibility_timeout
)
@cached_property
def predefined_queues(self):
"""Map of queue_name to predefined queue settings."""
return self.transport_options.get("predefined_queues", {})
@cached_property
def queue_name_prefix(self):
return self.transport_options.get("queue_name_prefix", "")
@cached_property
def supports_fanout(self):
return False
@cached_property
def region(self):
return (
self.transport_options.get("region")
or boto3.Session().region_name
or self.default_region
)
@cached_property
def regioninfo(self):
return self.transport_options.get("regioninfo")
@cached_property
def is_secure(self):
return self.transport_options.get("is_secure")
@cached_property
def port(self):
return self.transport_options.get("port")
@cached_property
def endpoint_url(self):
if self.conninfo.hostname is not None:
scheme = "https" if self.is_secure else "http"
if self.conninfo.port is not None:
port = f":{self.conninfo.port}"
else:
port = ""
return "{}://{}{}".format(scheme, self.conninfo.hostname, port)
@cached_property
def wait_time_seconds(self):
return self.transport_options.get(
"wait_time_seconds", self.default_wait_time_seconds
)
@cached_property
def sqs_base64_encoding(self):
return self.transport_options.get("sqs_base64_encoding", True)
class Transport(virtual.Transport):
"""SQS Transport.
Additional queue attributes can be supplied to SQS during queue
creation by passing an ``sqs-creation-attributes`` key in
transport_options. ``sqs-creation-attributes`` must be a dict whose
key-value pairs correspond with Attributes in the
`CreateQueue SQS API`_.
For example, to have SQS queues created with server-side encryption
enabled using the default Amazon Managed Customer Master Key, you
can set ``KmsMasterKeyId`` Attribute. When the queue is initially
created by Kombu, encryption will be enabled.
.. code-block:: python
from kombu.transport.SQS import Transport
transport = Transport(
...,
transport_options={
'sqs-creation-attributes': {
'KmsMasterKeyId': 'alias/aws/sqs',
},
}
)
.. _CreateQueue SQS API: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_CreateQueue.html#API_CreateQueue_RequestParameters
""" # noqa: E501
Channel = Channel
polling_interval = 1
wait_time_seconds = 0
default_port = None
connection_errors = virtual.Transport.connection_errors + (
exceptions.BotoCoreError,
socket.error,
)
channel_errors = virtual.Transport.channel_errors + (exceptions.BotoCoreError,)
driver_type = "sqs"
driver_name = "sqs"
implements = virtual.Transport.implements.extend(
asynchronous=True,
exchange_type=frozenset(["direct"]),
)
@property
def default_connection_params(self):
return {"port": self.default_port}