Mentor: Yuval Lifshitz
Persistent bucket notifications are essential in the object storage ecosystem, they make the system resilient against broker outages. Traditionally, these notifications have relied on RADOS queues, which, while effective, place a significant load on the RADOS cluster and restrict operation to environments with a RADOS backend. This project aims to demonstrate the use of Redis for implementing persistent bucket notifications, and to make bucket notifications backend-agnostic. Additionally, it seeks to overcome performance limitations inherent in writing to a single RADOS object that maintains the queue. The Redis integration promises to maintain the robustness of bucket notifications while significantly enhancing the flexibility and adaptability of the system across various storage environments such as POSIX.
Ceph is an open-source, distributed system that runs on commodity hardware. It is a unified storage system, a single cluster can serve file, block and object stores. Read more on ceph architecture.
Redis is an in-memory data structure store. It is referred to as a data structure server because it supports various data structures such as strings, hashes, lists, sets, sorted sets, and more.
It provides atomic operations on these data structures through optimistic locking, allowing for safe concurrency control. It supports replication and clustering for high availability and scalability.
Redis provides simple data structures that can be used to solve various problems.
We aim to use the Redis queues
to implement a Rados 2 phase commit queue, and similarly, and simply key-value
pairs to implement implement advisory distributed locking.
The RADOS Gateway (or the radosgw daemon
) is a Ceph component that offers a RESTful HTTP API for storing objects and associated metadata.
It functions atop the Ceph Storage Cluster, employing its distinct data structures, and managing user authentication and access control through its user database.
Ceph strives to remain compatible with the s3 API.
Bucket notifications are a mechanism to send event messages by the rados gateway to defined topics whenever an operation happens on a bucket. More about bucket notification compatibility with AWS S3 Notification API.
A bucket is a container for objects stored in the rados gateway. Topics define an endpoint and its associated configuration; they act as logical communication channels.
The bucket notification API for topics and supported operations is defined here, https://docs.ceph.com/en/latest/radosgw/notifications/#topics
Notification Configurations (Referred to as just ‘Notifications’ in the text) are entities that bind a topic and a subset of a bucket's events.
A bucket can have multiple notification configurations for multiple topics and a topic can be configured as the destination for multiple bucket notifications.
The delivery of messages out of the rados gateway can be done in two modes. These are defined by the Persistence property of the topic.
When set to False a synchronous delivery attempt is made once during the bucket operation itself. Notification delivery and acknowledgement become a part of the operation
When set to True an internal queue is created to ensure delivery of the message. Messages sent to this topic become asynchronous. Notification persistence to storage becomes part of the trigger operation The delivery is retried asynchronously
Feature to enable ”at least once” delivery of bucket notifications. In a synchronous message delivery model, the operation's success or failure remains independent of the message delivery. The delivery is attempted “at most once”. This is prone to failures when e.g. the topic endpoint is down. The persistent bucket notifications, however, use persistent storage and a 2-phase commit algorithm to ensure that the message is committed after the operation's success and aborted in case the operation fails. Further, the notification delivery is retried until successful acknowledgement from the topic endpoint.
We use Lua Scriptig in Redis to model a distributed locking mechanism and a two-phase commit queue in Redis. The following sections outline the key modules for managing locks and the queue system, which includes essential operations such as reserving, committing, acknowledging (ack), and cleanup.
The lock module manages distributed locks using Redis keys, allowing for resource protection across multiple processes. Locks are identified by a unique name and are held by a "cookie," a unique identifier per lock owner. Key functions include:
assert_lock
: Checks if a lock exists and is held by a given cookie. If the lock is held by another owner, it returns an error. Otherwise, it confirms lock ownership.lock
: Acquires a lock if it's not already held, or extends it if the cookie matches the current holder. If the lock exists under a different cookie, it denies access.unlock
: Releases the lock if the owner has the correct cookie, allowing others to acquire it.
Each function returns a standardized response, indicating success or an error with details.
This module enables message queuing with a two-phase commit system. Queues are defined by a unique name and a maximum size, with reservations and message additions managed separately to support concurrent handling of items.
init_queue
: Initializes a queue if it doesn't already exist, storing the maximum size for managing space constraints.reserve
: Prepares space in the queue by creating a reserved entry of a specified byte size. A reservation ID is generated to track the entry, with the data populated by a random string for testability.commit
: Completes the two-phase commit by taking a reserved entry and adding it to the queue.abort
: Cancels a reservation if needed, removing the entry from the reserved space.read
: Retrieves the latest message in the queue without removing it, allowing multiple processes to view pending items.locked_read
andlocked_read_multi
: These variations ofread
use a lock to ensure exclusive access to the messages in the queue, particularly useful in environments with shared resources.ack
andlocked_ack
: Acknowledges the processing of a message by removing it from the queue.locked_ack
ensures that only the process holding the lock can acknowledge the message, preventing race conditions.cleanup
: Removes stale reserved entries based on a timeout threshold, freeing up space in case of abandoned or expired reservations.
Each function in the queue module also returns a response in the same standard format, ensuring consistency across operations.
To realise the project the following changes were made to the Ceph codebase: ceph/ceph#57842
- Start the Ceph cluster with atleast 1 RGW
MON=1 OSD=1 MDS=0 MGR=0 RGW=1 ../src/vstart.sh -n -d
- Start a Redis Server
docker run -d --rm --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest
- Start a HTTP Endpoint This servers as the destination for the notifications
wget https://gist.githubusercontent.com/mdonkers/63e115cc0c79b4f6b8b3a6b797e485c7/raw/a6a1d090ac8549dac8f2bd607bd64925de997d40/server.py
python server.py 8080
-
With the aws cli tool installed and configured
4.1 Create a bucket
aws s3 mb s3://fish --endpoint-url http://localhost:8000
4.2 Create a topic
aws sns create-topic --name fishtopic --attributes='{"push-endpoint": "http://localhost:8080", "persistent": "true"}' --endpoint-url http://localhost:8000
4.3 Create a notification configuration
aws s3api put-bucket-notification-configuration --bucket fish --notification-configuration='{"TopicConfigurations": [{"Id": "notif1", "TopicArn": "arn:aws:sns:default::fishtopic", "Events": []}]}' --endpoint-url http://localhost:8000
-
Upload an object to the bucket
head -c 512 </dev/urandom > myfile
aws --endpoint-url http://localhost:8000 s3 cp myfile s3://fish
- On the HTTP Terminal you should see the notification message
Body:
{"Records":[{"eventVersion":"2.2","eventSource":"ceph:s3","awsRegion":"default","eventTime":"2024-11-04T05:55:19.575526Z","eventName":"ObjectCreated:Put","userIdentity":{"principalId":"testid"},"requestParameters":{"sourceIPAddress":""},"responseElements":{"x-amz-request-id":"808b3989-1cfc-474c-bb29-bdb72765bb2e.4179.1952379431697694961","x-amz-id-2":"4179-default-default"},"s3":{"s3SchemaVersion":"1.0","configurationId":"notif_persist","bucket":{"name":"fish","ownerIdentity":{"principalId":"testid"},"arn":"arn:aws:s3:default::fish","id":"808b3989-1cfc-474c-bb29-bdb72765bb2e.4179.1"},"object":{"key":"myfile","size":512,"eTag":"9229f59fbfcb16b237b89306ce273b24","versionId":"","sequencer":"C7612867A15AFF23","metadata":[],"tags":[]}},"eventId":"1730699719.603937.9229f59fbfcb16b237b89306ce273b24","opaqueData":""}]}
- Add an abstraction layer to support multiple backends for persistent bucket notifications.
- Add consiguration options to allow users to choose between RADOS and Redis for persistent bucket notifications.
I would like to earnestly acknowledge the dedicated efforts and valuable time given by my mentor, Yuval Lifshitz, whose guidance and support have been instrumental throughout this project. I am deeply humbled by Yuval’s patience and understanding as he answered my questions—sometimes repetitive, often detailed—and helped me navigate challenges.
I truly appreciate the time he spent with me during our debugging sessions; they were incredibly helpful in moving our work forward
Thank you, Yuval, for making this an exceptional learning journey. I hope to continue learning from you and contributing to the Ceph community in the future.