This document is intended to be informative and not authoritative. The library may change its internal behavior regarding error handling without notice.
This document describes how the library internally deals with errors, so that you can get a better idea of what may be happening when debugging various issues.
The first thing to note about errors is that the library categorizes errors into two distinct types: Data Errors and Network Errors.
A data error is a negative reply returned by the server when an operation
has failed, (e.g. ENOENT)
or when the library returns with an error due to bad user (e.g. OPTIONS_CONFLICT)
input. Note this definition is slightly different than what is mentioned
in the API documentation, since this talks about the view of the error from
the library.
When a data error is encountered, it is simply returned to the user and it is up for the application to determine how to proceed. Simply put: The library merely passes any kind of data error over to the application and does no further internal processing.
A Network Error is a failure which may potentially be the result of either
a slow network or a stale cluster topology. Examples of network errors are
closed sockets, timeouts and the internal NOT_MY_VBUCKET protocol responses.
In all these cases, the client will internally attempt a corrective action
with the goal that eventually a new configuration will arrive and will
instruct it as to the next step to take.
In conclusion, the library will only perform internal error handling for errors which may be the result of a stale configuration.
A cluster configuration is a JSON object which describes the nodes in the cluster. Specifically, it contains:
- The nodes that are part of the cluster. Each node is assigned a numeric index starting at 0.
- a vBucket map which instructs the client about which node is to receive commands for a given key.
The cluster configuration is received during the "bootstrap" phase of the client,
either via the memcached protocol, or via HTTP. You may inspect a cluster's current
configuration by retrieving the JSON at http://yourcluster:8091/pools/default/b/default,
where b/default requests the bucket named default.
Internally a bucket is initialized with a constant number of vBuckets (the number of vBuckets depends on the platform the cluster is running on and some other varitions; the default on Linux is 1024 vBuckets).
Throughout the lifetime of the bucket, any given key will always map to one of those vBuckets.
vBuckets themselves are mapped to nodes, where each node in the cluster is considered
active for a certain number of vBuckets (typically the number of active vBuckets per
node will be distributed evenly, using the formula
vbuckets_per_node = total_vbuckets/number_of_nodes).
As nodes are added or removed, a rebalance takes place which changes the mappings of vBuckets to nodes. If a node is added, for example, it will take ownership of a given number of vBuckets. If a node is removed, another node (or typically several nodes) will take ownership of those vBuckets previously owned by the removed node.
The client must receive a cluster configuration in order to know where to send given operations for keys. To determine where to send a command, it will inspect the key for the command, map the key to a vBucket, determine which node index is active for that vBucket, and finally send the command to that node.
During a rebalance, the cluster will move vBucket ownerships between nodes (as
discussed above). Each time a vBucket changes nodes, the cluster creates a new
configuration, and the previous configuration is considered stale. When the
client makes use of a stale configuration, it may end up sending a command to
the wrong node (i.e. the previous owner of the vBucket); if the node is still
in the cluster it will respond with a NOT_MY_VBUCKET error, indicating that
the node is no longer considered active for the vBucket (and thus the key which
is mapped to the vBucket).
A cluster configuration changes when the cluster topology changes. A topology change happens when:
- Nodes are added
- Nodes are removed
- Nodes are failed over
In all the above cases, the cluster will update the configuration to add/remove the affected nodes, and change the vBucket mapping accordingly.
In the C library however, a new configuration is only retrieved reactively: Once the client detects a possible configuration issue, it will attempt to retrieve a new configuration.
It may be tempting to ask "why doesn't the cluster just notify the client when a new configuration is available?", the answer is as follows:
In order for the client to pre-emptively get configuration updates from the cluster, it must be able to receive a configuration update from any node, which means a connection for each node. Because the memcached protocol does not support unsolicited responses, it would mean that the client would need to maintain connections to the HTTP streaming API for each individual node, thereby doubling the number of connections at the client (and increasing the overall number of connections to the main cluster process, for which the cost of a single connection is not trivial).
Note that it is still possible to get semi-preemptive configuration updates
using a single connection to the HTTP endpoint; however this is not always
reliable since the node which is serving the configuration to the current
client instance may itself have been failed over by the cluster, or may
itself be experiencing network difficulties; nevertheless it is possible
to enable this mode of operation by passing the bootstrap_on=http option
to the connection string.
The library does not regularly poll for a new configuration, nor does it perform any kind of regular "probing" to determine if a given node is alive or dead. This limitation is specific to the typical usage modes of the C library: synchronous blocking applications.
In synchronous applications, the library is only able to perform
network operations from within the context of the lcb_wait() API; and
thus any polling would be dependent on when the application actually calls
this API function.
With some minor effort, applications can configure the library to manually
poll for a new configuration, or to manually poll and determine whether all
the nodes are alive. Manually retrieving a new configuration may be done
with the lcb_refresh_config() API, while manually checking that all
the nodes are alive may be done with the lcb_server_versions3() or
lcb_server_versions() API. the "Server Versions" command will effectively
check each node for connectivity because it will send a request to each
node requesting its version.
Of course, polling with a synchronous application is only useful if your application is also threaded (in which case you can have a "polling thread"):
pthread_mutex_lock(&instance_lock);
lcb_refresh_config(instance);
lcb_server_versions3(instance, NULL, NULL);
lcb_wait3(instance, LCB_WAIT_NOCHECK);
pthread_mutex_unlock(&instance_lock);When the client deems it necessary to request a new configuration from the cluster, it will go through an internal subsystem known as "Configuration Providers". This is a collection of various methods which can all asynchronously retrieve an updated configuration from the cluster and supply it back to the library core. Providers are tried in order of priority, and a configuration is considered to be updated when the current provider indicates so.
The default and primary provider is called CCCP and makes use of the internal
memcached GET_CLUSTER_CONFIG command. This simply requests a given server to
provide its current copy of the configuration in the response. When the client
requests a configuration via the CCCP provider, the provider will simply send
a command to a given server (the server is selected based on a round-robin
algorithm) and awaits the response. If the first response fails, it will
try the next server, until all the servers have been exhausted.
The next provider (and the one which is used on memcached buckets, and
on couchbase buckets on clusters version 2.2 and lower) is HTTP. This
opens a persistent streaming connection to a given node (selected at
random during initialization). The HTTP connection uses a special chunked
encoding mechanism which will "push" new configuration updates as they
arrive. When the client requests a configuration via this provider, the
provider will wait a given amount of time for a new response; and if no
response has been received within that interval, will close the current
streaming connection and open a new streaming connection with the next
node, using a round-robin algorithm.
In all cases, the configuration refresh does not guarantee a new configuration (for example, the "new" configuration may be identical to the current one), but merely uses a best-effort logic in requesting the server to provide the client with what the server deems current.
Note that under most cases, the configuation request mechanism is subject to throttling, meaning that the number of times the library's core may request a new configuration is rate limited so that a new configuration is fetched at most once per a given interval (this interval is configurable, and is 10 seconds by default). The reason for this is that excessive configuration requests, especially if the HTTP provider is being used, can slow down the network and cluster. See below for more information about throttling.
There are a number of ways the client may internally trigger a configuration request.
When a NOT_MY_VBUCKET is received, the client takes this as a clear indication
that the current configuration it is using is no longer the most recent one, and
thus requests a new configuration from the cluster. If CCCP is enabled (which is
the default for clusters 2.5 and greater), then the NOT_MY_VBUCKET reply already
contains the updated configuration inside of it, and therefore the library
does not need to explicitly fetch a new configuration.
A NOT_MY_VBUCKET is the typical way the client detects that the current
configuration is stale. However a configuration may be invalidated by other
means. Specifically, the NOT_MY_VBUCKET reply will happen only when:
The node responding with the NOT_MY_VBUCKET is still part of the cluster
and
The client actually requested a key mapped to that node
To satisfy the first condition, the node must either be a node that is being added
to the cluster, or a node that is being removed, but the rebalance is not yet
completed. In the case when a node is failed over, for example, the node is
immediately removed from the cluster and will not send a NOT_MY_VBUCKET
(but will typically close the socket connection).
To satisfy the second condition, a client must actually request a key which is mapped to that vBucket (and thus, that node). This of course depends on how active the client is, and the diversity of the keys the client is operating on: If the client is retrieving/modifying a single key, then it is entirely possible that all the other nodes in the cluster may have already been removed:
The likelihood of the client detecting a stale configuration from a NOT_MY_VBUCKET
is:
- Proportional to the frequency at which the client does operations
- Proportional to the diversity of the keys on which it operates with
- Inversely proportional to the number of nodes in the cluster.
In some failover cases (specifically, when the cluster has no more replicas),
when a node is failed over, the cluster will send a new,
interim configuration in which all the vBuckets owned by that node will
be assigned to the node of index -1. This index indicates that there is no
node to serve that given vBucket. This places the cluster in a degraded
state, in which a subset of its data is to be unavailable until either the
failed-over node is fully ejected from the cluster, or comes back online.
When the client encounters such a condition, it will return the error
LCB_NO_MATCHING_SERVER and trigger a configuration refresh. This flow
will be executed by the client until the index for the vBucket is something
other than -1. Keep in mind that configuration refresh throttling is in
effect here, so if a node is failed over for a long time, the configuration
refreshes will still only happen as often as the throttle setting
allows it to. It is ultimately expected that the cluster administrator either
bring the node back online, or manually rebalance the cluster.
Note that NO_MATCHING_SERVER errors will be returned in the callback, either
immediately, or after some delay, depending on whether there is an active config
refresh (in which case command waits a bit until the configuration has
been received), or if a new refresh cannot be obtained yet due to throttling
(in which case the callback is invoked immediately (but asynchronously)
with the error. The reason for internally scheduling a command with no
server in the retry queue is to essentially yield to the library's event
loop so that it may request a new configuration transparently. Note that this
is an implementation detail. In practice, applications should be prepared
to handle LCB_NO_MATCHING_SERVER in either the callback, or as a return code
from the scheduling function.
Another trigger for a configuration refresh is a timeout or a socket failure. In this case of socket errors, this happens when the connection between the client and a given node is broken. The client takes this as a clear indication that the node it is talking to is no longer part of the cluster and requests a new configuration.
Timeouts work much the same way, with the understanding that the client does not have a certainty as to why the timeout took place. Specifically timeouts may happen if the network is simply slow, or there are insufficient CPU resources; either on the client or server, to handle current load. A timeout may also happen if there is a kernel panic on the cluster, or if the physical ethernet connection between client and server has been broken.
Specifically, a timeout in the client only happens when there is a lack of
activity within a certain timeframe (specifically, the value of
LCB_CNTL_OP_TIMEOUT). Specifically, there is no ETIMEDOUT
status code that the server will provide for a given operation. It is
recommended that applications use the detailed_errcodes=1 in the
connection string, or enable the LCB_CNTL_DETAILED_ERRCODES setting
to get more detailed error codes in the case of network failure.
In most cases, if a command has failed because of a network-type error, the client will internally retry it until it either receives a different error code or until the timeout for the given command has exceeded. For this reason it is normally not recommended for applications to retry the command after they receive a network error, unless the application has other knwoledge which may suggest the operation may succeed on the next attempt.
Retrying may be configured through the various lcb_cntl() APIs. By default
all data commands (i.e. lcb_store3(), lcb_get3(), etc. are retried.
There are some inherent race conditions (for example, if the network connection
was broken during the execution of an INCR command; the client may
end up reconnecting to the same server and re-send the command again, resulting
in the sending of the INCR command twice). Race conditions may be avoided
in most cases by using the cas parameter, though it is recommended
to disable retries for INCR manually (the library does not do this by default
order to be comptaible with behavior in older versions)
using the LCB_CNTL_RETRYMODE setting:
int setting = LCB_RETRYOPT_CREATE(LCB_RETRY_ON_SOCKERR, LCB_RETRY_CMDS_SAFE);
lcb_cntl(instance, LCB_CNTL_GET, LCB_CNTL_RETRYMODE, &setting);The LCB_RETRY_CMDS_SAFE will only allow a command to be retried if it is certain
that retrying the command will not result in unintended modification of data.
Commands to be retried are placed in an internal queue. Commands from the retry
queue are retried either after a certain increasing interval has elapsed (which
throttles how often a command can be retried; see LCB_CNTL_RETRY_INTERVAL
and LCB_CNTL_RETRY_BACKOFF) or when a new configuration has been received.
Throttling is in effect for retried commands, so as to ensure that if the
network is under load, the client does not add to the load by retrying the
command in a busy loop.
Couchbase is a high performance distributed database. It maintains this high performance through topology changes throughout topology changes, and even in the face of node failure.
High performance is attained because each node is mostly independent of other nodes (indeed, the node will replicate its content to the relevant targets, but this does not make an operation's success depend on the health of any given replica, unless durability constraints are employed).
Problems in throughput come about when attempting to perform operations against
a failed node. Failure in this context is understood to mean a node that is still
part of the cluster topology, but is otherwise unreachable; for example, a node
that has been disconnected from the network. For synchronous applications which
deal with operations in strict sequence, if any of the operations within the
logical sequence are mapped to the failed node, then the entire pipeline will
be blocked on that node. Consider this application sequence of a user
(user:gwf_hegel) "pairing" with several contacts, where there are 4 nodes in the
cluster, and node 2 has failed
- Get
user:gwf_hegel=> Node: #1 => OK (2ms) - Add
user:sfreudto a list withinuser:gwf_hegel=> Node #1 => OK (1ms) - Get
user:sfreud=> Node #2 (SLOWDOWN!!) => TIMEOUT (2.5s)
Despite the fact that the key user:wf_hegel is very much available on node #1,
the other key, user:sfreud is mapped to the failed node (Node #2). Because the
application's logic depends on both items being accessible, the throughput may be
much slower.
The application may still maintain high throughput in such situations by following several suggestions and/or guidelines. Keep in mind these guidelines are dependent on the application logic and surrounding infrastructure.
To prevent an unreachable node from slowing down the application, the application must avoid operations to that node. This way, all operations will be performed only against known good nodes. How this is accomplished depends on the nature of the failure and the overall application/infrastructure logic.
If a node is unreachable because of a catastrophic failure (for example, it has had a critical hardware failure), then failing over the node is the ideal option. The node may yet still be added back. Failover is a relatively expensive operation as it reduces the capacity of the cluster and may result in reverting some items back to an older state (if they had not yet been replicated).
The Couchbase Web Console allows you to configure Auto Failover. This will make the cluster automatically fail over nodes which have been unreachable for a period of time. The minimum threshold interval for this feature is 30 seconds. Use this feature with care, as this may sometimes result in a node being failed over simply because it is slow (for example, during an immense flood of traffic). In this case, reducing the cluster capacity will typically increase the severity of the issue!
Couchbase's auto-failover detection is rather limited. Because Couchbase can only discover rudimentary statistics of another system over the network, it is by design rather conservative about auto-failover.
Applications may leverage existing infrastructure telemetry to preemptively fail over a node. Existing telemetry may be obtained via a variety of sources, most commonly server monitoring frameworks.
A clear case in which an application may utilize infrastructure information to invoke a failover would be in a virtualized environment. Assuming a virtual machine has been forcefully destroyed, infrastructure scripts could then query the hypervisor of the virtualized environment for the node's status. If the node is down, the infrastructure could fail over the node in a matter of seconds.
Avoiding a failed node is another way of ensuring high throughput in applications. However, this requires that the application:
- Knows which nodes are unreachable
- Has a defined policy for handling operations which cannot execute immediately.
Knowing which nodes are reachable or not may be done in several ways. The application may use existing infrastructure methods (per above) to determine which nodes are done.
An alternate mechanism may be to maintain a client-local "blacklist" of nodes which may be unreachable, periodically refreshing this blacklist after a specified interval has elapsed. Consider the following code:
class Blacklist {
public:
Blacklist(lcb_t instance) : instance(instance) {}
bool isBlacklisted(string& key) {
string node = nodeForKey(key);
if (entries.find(node) == entries.end()) {
return false; // no blacklist for node
} else {
time_t now = time(NULL);
if (entries[node] < now) {
entries.erase(node);
return false; // Entry expired
} else {
return true;
}
}
}
void updateBlacklist(string& key) {
string node = nodeForKey(key);
// Mark the node as blacklisted for 60 seconds
entries[node] = time(NULL) + 60;
}
private:
lcb_t instance;
map<string,time_t> entries;
string nodeForKey(string& key) {
lcbvb_CONFIG *vbucketConfig;
lcb_cntl(instance, LCB_CNTL_GET, LCB_CNTL_VBCONFIG, &vbucketConfig);
int vbucketForKey, serverForVBucket;
lcbvb_map_key(vbucketConfig, key.c_str(), key.size(), &vbucketForKey, &serverForVBucket);
return lcb_get_node(instance, LCB_NODE_DATA, serverforvbuckeyt);
}
}
void doOperation(Blacklist& bl, string& key) {
if (bl.isBlacklisted(key)) {
pendingOpQueue.put(key);
return;
}
// Do operation here, and retrieve the error code via the callback
if (LCB_EIFNET(retrievedError)) {
pendingOpQueue.put(key);
bl.updatedBlacklist(key);
}
}
In the above code, the application maintains a blacklist of nodes which the application assumes to be down, because of previous network errors received on them. In order to allow for the possibility that the node may recover, the entry is only valid for 60 seconds.
The sample code also shows how to obtain the node to which the key is mapped. This
may also be used to compare with an external topology/telemetry source to determine
if a node has failed (not shown here). Bear in mind that the node returned is in
host:port format; so in most cases you will likely want to strip the :port section.
For operations which cannot be completed because the node has been blacklisted, the application places them in a special pending queue to be handled later. This document deliberately does not show a sample implementation of such a queue, since its logic is highly dependent on the nature of the application (for example one may write pending operations to disk; discard them immediately; attempt to perform this operation against another cluster or database, etc.).