Project Emmy: Dapr Actors v2

This document contains the design for a new Actors service for Dapr, which:

Provides the address for actors running in the cluster, replacing the current Placement service and table distribution with a more efficient and reliable design.
Manages, stores, and executes actor reminders, in a centralized service.

The new service requires the use of a relational (aka "SQL") database, including Azure SQL (Microsoft SQL Server), PostgreSQL, MySQL, etc… as well as SQLite.

Because the Dapr workflow building block is based on actors, improvements in this space will benefit workflow users too.

State of the world

Actor placement

The placement subsystem is a critical piece of infrastructure to enable the use of Actors in Dapr, ensuring that each actor is running in a "single threaded" mode, processing a single message and on one host only.

Current design

The placement table contains the list of all actor hosts as well as the types of actors they can host. Every time a sidecar that can host actors comes online or goes offline, the placement table is updated. Note that the placement table does not contain the list of active actors, but just the list of active actor hosts and the types of actors they support.

The current implementation is based on a Placement service which maintains the placement table (the state), either in a single-node or in a 3-node setup which leverages Raft for replicating the state. In single-node mode, the state is also persisted on disk synchronously; when using Raft, persistence is optional.

The placement table is then disseminated to all Dapr sidecars on each change, with a push-based approach. Each sidecar maintains a persistent connection with one of the instances of the Placement service, which is used for both health-checks (to remove unhealthy sidecars from being able to host actors) and for receiving updates to the placement table.

Because each Dapr sidecar contains a local copy of the placement tables, finding the address of a specific actor is a local-only operation. The sidecar looks up the full list of hosts that can serve actors of the given type, and then uses a consistent hashing algorithm to determine which one will run the specific actor; this method resolves instantly as it only needs to perform a local computation.

Limitations

Today's design suffers from a few major issues:

Scalability as number of actor types increase. After we reach a certain number of actors (~600), the connection with the Placement service can fail. This has been observed by users in production.
Operational complexity when running in HA mode due to the use of Raft. We have observed issues where the entire actor subsystem was unavailable for users because of a corrupted Raft state. Additionally, heavy actor users can experience resource exhaustion (because Raft maintains the entire history of all state revisions in-memory) as well as operational issues such as significant network traffic between the nodes.
- One very large Dapr user, and heavy actor user, claims they need to restart the Placement service on a weekly basis.
The push-based approach of distributing state tables to all sidecars suffers from scalability and reliability issues, as notifications are sent to sidecars one-by-one. To ensure consistency, this is a 3-step process, where a lock is acquired, the table is disseminated, and then the lock is removed; however, this is a very costly operation, during which the actor subsystem is blocked.
Limited scalability of the Placement service, which can either run with 1 instance, or 3. Any other number of instances (including 2) is not allowed. Even when running with 3 instances, Placement operates with only one active primary, and two replicas that are standby-only.
Each time a sidecar comes online or offline, the list of active actor hosts changes and causes a rebalancing to happen across all connected actor hosts. Actors are kept active if the expected host (depending on the consistent hash) is the same; otherwise, they are terminated. When there are many sidecar that can host a given actor and/or they scale frequently (such as in an environment that uses auto-scaling), this can cause a lot of actors being deactivated prematurely and shuffled around.
The subsystem is very complex, due to the need to maintain the placement tables in sync across all sidecars. Over time, we have discovered bugs, including subtle ones that caused actors to be potentially active in two separate hosts (e.g. dapr/dapr#6968)

Proposed changes

The new Actors service addresses the problems above by maintaining the state in a relational database. Dapr sidecars retrieve information on a given actor in a "pull" way, by requesting the address of the host for a given actor on-demand by querying the Actors service (with a short-lived, local cache).

The benefits of this approach are:

Scalability increases "linearly" with the number of instances of the Actors service that are deployed, and there are no limitations to the number of replicas.
Operations are simpler and more reliable thanks to not having to rely on distributed consensus algorithms like Raft.
The implementation is simpler and potentially more reliable.

The biggest downside of this approach is that there's an increase in latency when an actor is invoked, because the caller needs to resolve the address of the actor host by making a network call (rather than performing a local lookup). This can be mitigated by using caching, both in the actor clients and in the Actors service.

Reminders

The current implementation of actor reminders has proved to have some significant limits, especially with regards to scalability when lots of reminders are scheduled, and/or when there are lots of different actors.

Dapr also supports actor timers. The difference between the two is explained in the Dapr docs:

The main difference is that Dapr actor runtime is not retaining any information about timers after deactivation, while persisting the information about reminders using Dapr actor state provider.

Timers continue to be executed by the Dapr runtime that hosts the active actor and are outside of the scope of this document.

Current design and limitations

Although we made, and continue to make, incremental improvements to the actor reminder subsystem, its design has some natural upper bounds due to the way persistent storage for reminder works.

At a high level, all reminders for actors of a given actor type are serialized in a single JSON document (or a limited number thereof), which is retrieved from the persistent storage and committed in full on every change. This decision was necessary to be able to allow reminders to be stored on all state stores supported by Dapr–as long as they offer "transactional" capabilities.

In addition to performance issues, due to its complexity, the design for the reminders storage suffers from race conditions which are hard to debug, and sometimes even to address. Over the last Dapr releases, we have identified and fixed at least half a dozen race conditions in the reminders subsystem, including some subtle ones.

When multiple Dapr sidecars try to update the document(s) containing the reminders for a given actor, we leverage opportunistic concurrency control to prevent race conditions, using etags. Users frequently complain that operations on reminders fail with etag conflicts, forcing them to retry the operation manually.

Lastly, the current reminders implementation does not offer strong delivery guarantees, and reminders are delivered "at most once". If the sidecar and/or app crash while processing the reminder, it is not retried later on.

Proposed changes

We aim to solve these concerns by designing a new solution that is based on a separate Dapr control plane service that is independent from Dapr runtimes and can be scaled horizontally as needed. The use of a relational database addresses the issues related to scalability of the storage of reminders.

Initial investigation was performed in dapr/dapr#5403.

Requirements

The solution is designed to meet the following goals:

Users must be able to scale the number of instances of the Actors service horizontally as needed, in or out, with zero downtime.
The solution must work on Kubernetes and in standalone ("self-hosted") setups of Dapr.
The solution must be resilient. In case an instance of the Actors service goes down, the solution must be able to recover without losing any reminder's data, and with no downtime as long as there's another replica active.
The number of supported actors, reminders and connected sidecars must scale linearly or near-linearly with the number of instances of the Actors service.
Additional requirements for reminders:
- Reminders need to be executed with "at least once" delivery guarantee to the app
- Reminders must be executed as close as possible to the time they are scheduled for, and certainly not before.

High-level design

The Actors service is a new Dapr control plane service, alongside Operator, Injector, and Sentry (it replaces the Placement service). Being a separate control plane service allows it to be scaled horizontally depending on the load, such as number of actors and/or actor reminders, or the number of connected Dapr sidecar.

Although the Actors service will be deployed automatically with every Dapr installation, users who do not need support for actors can disable it if needed.

The Actors service requires a relational (aka "SQL") database. We will support multiple databases, including PostgreSQL, MySQL, Azure SQL (MS SQL Server), etc. For simple environments, where there's one instance (or possibly a small number of instances but on the same physical host), SQLite is supported too. Support for non-relational databases such as MongoDB could be investigated for the future; however, Azure Cosmos DB will not be usable.

We understand that the requirement of a relational database is the single biggest breaking change compared with the current solution; however, we believe that the benefits we gain from this choice vastly outweigh the lack of support for some databases that are currently available as Dapr actor state stores. Additionally, relational databases are offered as a service by every cloud provider, allowing users a simple way to connect to one.

Actor placement

Actor host registration

Dapr sidecars that are actor hosts register themselves with Actors service.

Actor hosts' sidecars establish the ConnectHost gRPC bi-di stream with one instance of the Actors service:
- The first message must contain informations on the list of supported actor types (aka "entities"), as well as the app's ID and address, and the actor API level (see dapr/dapr#6838).
- Sidecars can send updated data at any time.
- Optional namespacing is supported, to address dapr/dapr#4711.
The ConnectHost gRPC stream is also used for healthchecks, so the Actors service can detect if an actor host is failed.
- Sidecars must send a message (even empty) at least every N seconds.
- If the Actors service does not receive a healthcheck within N seconds, it must assume that the actor host is offline.
- The Actors service must respond too, with an empty message, for the sidecar to detect if the gRPC stream is in a failed state, in which case the gRPC library should re-establish it automatically. If the connection fails to be re-established, and health-checks continue to fail, the sidecar must deactivate all local actors.
Apps that are registered are added to the "hosts" table, alongside the last healthcheck time. Apps are considered healthy if they are in the table and last successful healthcheck is < N seconds.
- The Actors service performs a periodic garbage collection of unhealthy sidecars.
- When an unhealthy sidecar is removed, associated active actors are purged too.
When a sidecar shuts down, it gracefully disconnects from the ConnectHost stream, which makes the Actors service unregister the sidecar right away (without waiting for the healthcheck).
If the connection is forcefully severed (which includes the case of the instance of the Actors service crashing), the registration is kept in the database, so upon reconnection (to the same or different instance of the Actors service), the operation can resume.

Actor address lookup

Dapr sidecars can make requests to the Actors service to get the address of an actor. When a request comes in to look up an actor's host's address:

If the actor is inactive–if there's nothing in the "actors" table, or if the host associated with that actor is unhealthy based on the data from the "hosts" table:
- The Actors service picks one sidecar that can host actors of that kind, at random (from the "hosts" table).
- The new actor is added to the "actors" table alongside the name of the host sidecar and the idle timeout.
- The Actors service responds with the address of the sidecar that was picked.
If there's an active actor–if there's an entry in the "actors" table and the associated host is healthy:
- The Actors service responds with the address of the app and the actor idle time.

When a new actor needs to be activated, the Actors service picks a host randomly. This seems to be the most efficient method on average, also according to the Orleans team research.

Sidecars that invoke actors can cache the response for up to 5s or the actor's idle time, whichever comes first. When using cached data, it's possible that a sidecar connects to an host that has crashed: in this case, the connection would fail, and the sidecar must request a new address and retry automatically.

Rebalancing

The new design does not include explicit support for automatic rebalancing every time a sidecar that can host actors comes online.

If a sidecar that can host actors goes offline, the actors hosted on that are all deactivated. If those actors are invoked again, they are automatically placed on one of the actor hosts randomly.

We expect that as the number of actor hosts increases due to scaling, actors are "naturally" rebalanced due to the random placement across actor hosts. Users that intend to rely heavily on auto-scaling should set a lower "idle timeout" for actors, so they are more likely to be deactivated and re-activated on different hosts. (The default value of 1 hour in Dapr Actors v1 is perhaps too high)

In a "break the glass" situation, users can scale the app horizontally and then restart a pod that is hosting actors, which causes all actors hosted there to be deactivated and eventually rebalanced across all active actor hosts, automatically.

Should we identify the need for a more "forced" rebalancing of actors in the future, we could implement in a way such:

The Actors service tracks the last time an actor was invoked by tracking requests for addresses (the value is cached by sidecar, but only for a limited time).
Actors that haven't been invoked for the longest time can be deactivated: the Actors service can request a sidecar to forcefully deactivate an actor with the DeactivateActor message.

Reminders

This design is based on ItalyPaleAle/dapr-actor-reminders-v2-demo, which includes a Proof Of Concept. Unlike that document, this proposal does involve moving reminders execution into a control plane service (the Actors service), to reduce the number of active connections to the database.

The Actors service is also responsible for executing actor reminders.

By storing reminders in a centralized, relational database, multiple instances of the Actors service can process reminders concurrently, in a conflict-free way, and horizontal (auto-)scaling is possible. There's a "natural" load balancing thanks to the fact that all processors are competing to fetch reminders from the database.

This solution should allow for really high throughput in executing, scheduling, or re-scheduling (modifying or deleting) reminders. In itself, it's not impacted by the total number of actor types and/or actor IDs, and it can scale horizontally well when there are many reminders to be executed. The goal is that the limiting factor for performance and scalability should only be in the database, and not in Dapr or the Actors service themselves.

Table schema

Draft table schema in DBML format:

Table reminders {
  // Random ID for the reminder
  reminder_id uuid [primary key]
  // Actor type
  actor_type text [not null]
  // Actor id
  actor_id text [not null]
  // Name
  name text [not null]
  // Next execution time
  execution_time timestamp [not null]
  // If set, indicates the period the reminder repeats for
  period text
  // Reminders (including repeating reminders) are deleted after this timestamp, if set
  ttl timestamp
  // Data associated with the reminder
  data bytea
  // Active lease
  lease_id uuid
  lease_time timestamp
  lease_pid text

  Indexes {
    (actor_type, actor_id, name) [unique]
    execution_time
    lease_pid
  }
}

Equivalent SQL (Postgres):

CREATE TABLE reminders (
  reminder_id uuid PRIMARY KEY NOT NULL DEFAULT gen_random_uuid(), 
  actor_type text NOT NULL,
  actor_id text NOT NULL,
  reminder_name text NOT NULL,
  reminder_execution_time timestamp with time zone NOT NULL,
  reminder_period text,
  reminder_ttl timestamp with time zone,
  reminder_data bytea,
  reminder_lease_id uuid,
  reminder_lease_time timestamp with time zone,
  reminder_lease_pid text
);

CREATE UNIQUE INDEX ON reminders (actor_type, actor_id, reminder_name);
CREATE INDEX ON reminders (reminder_execution_time);
CREATE INDEX ON reminders (reminder_lease_pid);

Operation on reminders

The Actors service offers APIs to create (or replace), delete, and get reminders, which can be invoked by sidecars over gRPC:

CreateReminder
GetReminder
DeleteReminder

See the APIs section below for more details on the gRPC methods.

Any instance of the Reminders service can respond to the methods above, for any reminder in the namespace.

Reminder execution

Each instance of the Actors service maintains an in-memory queue with the reminders that are scheduled to be executed in the immediate future.
- This queue is managed using the github.com/dapr/kit/events/queue package, which is already in use by Dapr sidecars (starting in Dapr 1.12) to manage the in-memory queue of timers.
- For details, see dapr/dapr#6716
Periodically, every pollInterval (default: 1.5s), each instance of the Actors service polls the database to retrieve the next reminders that need to be executed within the fetchAhead interval (default: 4s).
- At most batchSize (default: 40) reminders are retrieved, as long as they are scheduled to be executed within fetchAhead.
  - The query that retrieves the reminders also atomically updates the rows storing the current lease (lease ID, time, and PID). These are used as "lease token".
  - Rows that have a lease_time that is newer than the current time less leaseDuration (default: 20s - this must be much bigger than fetchAhead) are skipped. This allows making sure that only one instance will retrieve a reminder, and if that instance is terminated before the reminder is executed, after leaseDuration it can be picked up by another instance.
- The reminders that are retrieved are added to the in-memory queue, to be executed at the time they're scheduled for.
When it's time to execute the reminder:
1. First, the instance of the Actors service loads the reminder to confirm the record hasn't changed since it was fetched, and the data that is in-memory is up-to-date.
2. The reminder is sent to the actor host for execution. In the meanwhile, the Actors service keeps the reminder in-memory and continues to renew its lease.
3. The actor host executes the reminder (by invoking the app)
4. Upon completion, the actor host notifies the Actors service by invoking the CompleteReminder method.
5. The Actors service then considers the execution as complete, and proceeds to delete the reminder from the database (or update its execution_time for repeating reminders)
When a new reminder is added, it's saved in the database. If it's scheduled to be executed "immediately", the first instance of the Actors service that is polling for reminders will pick it up.
- If the reminder's scheduled time is within fetchAhead from now (default: 4s), and it can be served by a sidecar connected to the current instance of the Actors service, then it's stored in the database in a way that is already owned by the current instance of the Actors service (i.e. with a lease already set). It's then directly enqueued in the queue managed by the current instance.
When a reminder is updated (same actor type, actor ID, and reminder name), it's replaced in the database. This also removes any lease that may exist (but may create a new lease owned by the current instance if applicable, per the previous point).

Reminders data

Important: This section in the proposal is currently not implemented - associated data is currently stored alongside the reminder in the Actors service.

An important difference with "actor reminders v1" is that data associated with a reminder is stored out-of-band, and not in the Actors service. This is important for reasons that include both performance and security/privacy (otherwise, a control plane service would have access to data that should be scoped to specific apps).

For reminders that have associated data:

Associated data is stored by the sidecar before sending it to the Actors service, and it then sends to the Actors service only a reference.
- The data can be stored by Dapr sidecars in any state store.
- The reference is the key of the data in the state store, which is random.
Sidecars store the data before invoking the Actors service. Then, they send the request to create the reminder.
- If they receive an error, they delete the data.
- Note: there could be garbage if the sidecar crashes before. We will need to figure out a way to clean that up if it becomes a problem.
When reminders are executed, sidecars load the data from the state store. Upon successful execution, the data is deleted from the state store.

Alternative approach to storing reminders data

We can continue to store the full data associated with a reminder in the database, through the Actors service. In order to protect the privacy of the data, it could be stored encrypted (either by the sidecar beforehand, to achieve "E2E" encryption, or by the Actors service before storing it at rest).

To prevent users potentially storing data of unbounded size, we should consider setting a limit on the amount of data that can be stored with a reminder, for example 1KB. Users who intend to store larger amounts of data should consider storing it out-of-band themselves.

APIs

The Actors service accepts requests from other apps in the cluster over gRPC. When enabled, we rely on mTLS to ensure the connection is secure and calls are authenticated as coming from Dapr runtimes (ideally, users' apps should not be able to call the Actors service directly, although doing so would not risk data corruption).

TODO: Add namespace support to these APIs, to allow namespaced apps.

The following gRPC methods are defined:

/*
Copyright 2023 The Dapr Authors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

syntax = "proto3";

package dapr.proto.actors.v1;

import "google/protobuf/timestamp.proto";
import "google/protobuf/duration.proto";

option go_package = "github.com/dapr/dapr/pkg/proto/actors/v1;actors";

service Actors {
    // ServiceInfo returns information on the actors service, including the version.
    rpc ServiceInfo(ServiceInfoRequest) returns (ServiceInfoResponse) {}

    // ConnectHost is used by the Dapr sidecar to register itself as an actor host.
    // It remains active as a long-lived bi-di stream to allow for the Actors service
    // to communicate with the sidecar, including for health-checks.
    rpc ConnectHost(stream ConnectHostClientStream) returns (stream ConnectHostServerStream) {}

    // LookupActor returns the address of an actor.
    // If the actor is not active yet, it returns the address of an actor host capable of hosting it.
    rpc LookupActor(LookupActorRequest) returns (LookupActorResponse) {}

    // ReportActorDeactivation is sent to report an actor that has been deactivated.
    rpc ReportActorDeactivation(ReportActorDeactivationRequest) returns (ReportActorDeactivationResponse) {}

    // CompleteReminder is sent to report that a reminder has been completed.
    // Hosts are expected to inform the Actors service when a reminder is completed, so it's possible to guarantee at-least-once delivery.
    // Repeating reminders are re-queued for the next execution, and non-repeating reminders are deleted.
    rpc CompleteReminder(CompleteReminderRequest) returns (CompleteReminderResponse) {}

    // CreateReminder creates a new reminder.
    // If a reminder with the same ID (actor type, actor ID, name) already exists, it's replaced.
    rpc CreateReminder(CreateReminderRequest) returns (CreateReminderResponse) {}

    // GetReminder returns details about an existing reminder.
    rpc GetReminder(GetReminderRequest) returns (GetReminderResponse) {}

    // DeleteReminder removes an existing reminder before it fires.
    rpc DeleteReminder(DeleteReminderRequest) returns (DeleteReminderResponse) {}
}

message ServiceInfoRequest {
    // Empty for now
}

message ServiceInfoResponse {
    // Version of the actors service.
    // This is indicated as an integer.
    uint32 version = 1;
}

// ConnectHostClientStream is sent by the Dapr sidecar to the Actors service.
// The first message in the stream must contain the required fields; subsequent messages could be empty, but including fields is allowed to provide updates.
message ConnectHostClientStream {
    // Message to include.
    // This is optional, and no message indicates a simple ping (for health checks).
    // However, the first message sent must include a RegisterActorHost.
    oneof message {
        // The first message sent in ConnectHost by the sidecar must contain RegisterActorHost.
        // The sidecar can re-send this message at any time to update its registration.
        RegisterActorHost register_actor_host = 1;

        // Instructs the Actors service to temporarily pause delivering reminders.
        ReminderBackOff reminder_back_off = 2;
    }
}

// RegisterActorHost is sent by the Dapr sidecar to the Actors service.
// It includes information on the current sidecar's actor hosting capabilities.
message RegisterActorHost {
    // Address, including port
    // Required on the first message; cannot be updated
    string address = 1;
    // Dapr App ID
    // Format is 'namespace/app-id' or just 'app-id'
    // Required on the first message; cannot be updated
    string app_id = 2;
    // Version of the Actor APIs supported by the Dapr runtime
    // Required on the first message; cannot be updated
    uint32 api_level = 3;
    // List of supported actor types.
    repeated ActorHostType actor_types = 4;
}

// ReminderBackOff is sent by the Dapr sidecar to the Actors service.
// When the Dapr sidecar sends this message to the Actors service, the Actors service pauses delivering reminders to this Dapr sidecar for a period of time.
message ReminderBackOff {
    // Requested pause duration.
    // This is optional and defaults to 1s if empty.
    google.protobuf.Duration pause = 1;
}

// ActorHostType references a supported actor type.
message ActorHostType {
    // Actor type name
    string actor_type = 1;
    // Actor idle timeout, in seconds
    uint32 idle_timeout = 2;
    // Maxium number of reminders concurrently active on a host for the given actor type
    // A value of 0 means no limit
    uint32 concurrent_reminders_limit = 3;
}

// ConnectHostServerStream is sent by the Actors service to the Dapr sidecar.
// The message could be empty, in which case it acts as a response to a "ping" message.
message ConnectHostServerStream {
    // Message to include.
    // This is optional, and no message indicates a simple ping.
    oneof message {
        // Send certain configuration options for the actor subsystem to the actor host.
        // This is normally sent in response to the first message from the actor host, but can be sent as update at any time.
        ActorHostConfiguration actor_host_configuration = 1;
        // Start the execution of a reminder.
        ExecuteReminder execute_reminder = 2;
        // Deactivate an actor
        DeactivateActor deactivate_actor = 3;
    }
}

// ActorHostConfiguration is one of the messages that can be sent by ConnectHostServerStream.
// It contains certain configuration options for the actor subsystem.
// This is normally sent in response to the first message from the actor host, but can be sent as update at any time.
message ActorHostConfiguration {
    // Maximum interval for the actor host to send pings to the actors service.
    uint32 health_check_interval = 1;
}

// ExecuteReminder is one of the messages that can be sent by ConnectHostServerStream.
// It is sent to tell the actor host to execute a reminder.
// The actor host is expected to respond right away, and process the reminder asynchronously.
message ExecuteReminder {
    // Reminder that is to be executed.
    Reminder reminder = 1; 
    // Token that the actor host needs to send back to confirm the reminder was executed completely.
    string completion_token = 2;
}

// ActorRef contains the reference to an actor.
message ActorRef {
  string actor_type = 1;
  string actor_id = 2;
}

// DeactivateActor is one of the messages that cna be sent by ConnectHostServerStream.
// It is sent to tell the sidecar to deactivate an actor.
message DeactivateActor {
    ActorRef actor = 1;
}

message LookupActorRequest {
    // Actor reference.
    ActorRef actor = 1;
    // Always fetch from the database, and do not return cached values if present.
    bool no_cache = 2;
}

message LookupActorResponse {
    // Dapr App ID of the host
    string app_id = 1;
    // Host address (including port)
    string address = 2;
    // Actor idle timeout, in seconds
    // (Note that this is the absolute idle timeout, and not the remaining lifetime of the actor)
    uint32 idle_timeout = 3;
}

message ReportActorDeactivationRequest {
    ActorRef actor = 1;
}

message ReportActorDeactivationResponse {
    // Empty for now
}

message ReminderRef {
    string actor_type = 1;
    string actor_id = 2;
    // Name of the reminder
    string name = 3;
}

message Reminder {
    string actor_type = 1;
    string actor_id = 2;
    // Name of the reminder
    string name = 3;
    // Execution time - either one of execution_time or delay is required
    google.protobuf.Timestamp execution_time = 4;
    // Delay from current time; will be parsed in any format supported by Dapr actors - either one of execution_time or delay is required
    google.protobuf.Duration delay = 5;
    // Can be empty; will be parsed in any format supported by Dapr actors
    string period = 6;
    // Can be empty
    google.protobuf.Timestamp ttl = 7;
    // Can be empty
    bytes data = 8;
}

message CompleteReminderRequest {
    // Reminder reference.
    ReminderRef ref = 1;
    // Token that was sent with the ExecuteReminder message.
    string completion_token = 2;
    // When true, repeating reminders are deleted after being executed.
    // This is a no-op if the reminder doesn't repeat.
    bool stop_reminder = 3;
}

message CompleteReminderResponse {
    // Empty for now
}

message CreateReminderRequest {
    Reminder reminder = 1;
}

message CreateReminderResponse {
    // Empty for now
}

message GetReminderRequest {
    ReminderRef ref = 1;
}

message GetReminderResponse {
    Reminder reminder = 1;
}

message DeleteReminderRequest {
    ReminderRef ref = 1;
}

message DeleteReminderResponse {
    // Empty for now
}

Metrics

Each instance of the Actors service exposes a Prometheus-compatible metrics endpoint that includes the following (in addition to standard Go metrics):

Latency of writes to database
Average time it takes for an app to acknowledge the reminder (from the time the call is sent to the response only)
Average reminder execution time (from the moment the transaction is started to when it's committed; it is a superset of the previous)
- Could indicate that the target apps and/or the placement service are overworked.
Average delay between a reminder's due time and when it's sent to the app
- Could indicate the need to scale the Actors service
Number of reminders currently in-memory in the instance
Number of reminders executed per actor type (cumulative)

Additional notes for reminders

Missed repeated reminders

The current ("v1") implementation of Actor Reminders executes all past-due repeating reminders as soon as the service starts. For example, if a reminder is scheduled to be repeated every 5 seconds and the sidecar hosting that reminder goes offline for 1 minute, when it comes back up it will fire 12 reminders, one after the other.

The v2 proposal behaves differently, and it will "condense" all past-due reminders into one. In the same example where the instance of the reminders service that owns the reminder goes offline for 1 minute, the actor would receive only 1 reminder. This is a natural consequence of how the queue works and how periodic reminders are re-enqueued.

Although this is a change in behavior, the authors of this document consider it as a welcome and desirable one.

Upgrade path

We expect Actors v1 and v2 to co-exist for the foreseeable future, and users will be able to choose which implementation to use by setting a value in the Helm chart.

We plan on offering an upgrade path in the future that allows migrating from Actors v1 to v2, by performing a migration of all the reminders in the state store. This requires:

A coordination mechanism as proposed in dapr/dapr#6838
Will probably leverage an internal actor to perform the actual data migration
The migration should be possible with zero downtime for the placement subsystem; for reminders, there will be a brief pause in their execution while they are migrated to the new storage.

	Actors "v1"	Project Emmy	Improvement
TestWorkflowWithConstantVUs: [T_30_300]: Test duration (s) Lower is better	34.589	17.624	-49.05%
TestWorkflowWithConstantVUs: [T_30_300]: Req Duration P90 (ms) Lower is better	3.535	1.910	-45.97%
WorkflowWithConstantIterations: [T_30_300]: Test duration (s) Lower is better	38.057	16.392	-56.93%
WorkflowWithConstantIterations: [T_30_300]: Req Duration P90 (ms) Lower is better	4.276	1.789	-58.16%
WorkflowWithConstantIterations: [T_60_300]: Test duration (s) Lower is better	42.580	18.920	-55.57%
WorkflowWithConstantIterations: [T_60_300]: Req Duration P90 (ms) Lower is better	9.524	4.476	-53.00%
WorkflowWithConstantIterations: [T_90_300]: Test duration (s) Lower is better	45.435	18.530	-59.22%
WorkflowWithConstantIterations: [T_90_300]: Req Duration P90 (ms) Lower is better	13.930	6.077	-56.37%
SeriesWorkflowWithMaxVUs: [T_280_1400]: Test duration (s) Lower is better	424.598	93.288	-78.03%
SeriesWorkflowWithMaxVUs: [T_280_1400]: Req Duration P90 (ms) Lower is better	108.445	26.851	-75.24%
ParallelWorkflowWithMaxVUs: [T_90_450]: Test duration (s) Lower is better	113.608	22.527	-80.17%
ParallelWorkflowWithMaxVUs: [T_90_450]: Req Duration P90 (ms) Lower is better	28.825	5.978	-79.26%
WorkflowOnMultipleInstances-2: [T_80_800]: Test duration (s) Lower is better	Test failed	94.757	∞
WorkflowOnMultipleInstances-2: [T_80_800]: Req Duration P90 (ms) Lower is better	Test failed	24.585	∞
WorkflowOnMultipleInstances-3: [T_80_800]: Test duration (s) Lower is better	Test failed	96.246	∞
WorkflowOnMultipleInstances-3: [T_80_800]: Req Duration P90 (ms) Lower is better	Test failed	24.953	∞
ActorReminder: QPS (target=500) Higher is better	30.96	499.91 (max test target is 500)	+1,614%

ItalyPaleAle/README.md

Project Emmy: Dapr Actors v2

State of the world

Actor placement

Current design

Limitations

Proposed changes

Reminders

Current design and limitations

Proposed changes

Requirements

High-level design

Actor placement

Actor host registration

Actor address lookup

Rebalancing

Reminders

Table schema

Operation on reminders

Reminder execution

Reminders data

APIs

Metrics

Additional notes for reminders

Missed repeated reminders

Upgrade path

Performance tests