Skip to content

Instantly share code, notes, and snippets.

@mnunberg
Last active August 29, 2015 14:10
Show Gist options
  • Select an option

  • Save mnunberg/b1b5e9a19fc96d249a80 to your computer and use it in GitHub Desktop.

Select an option

Save mnunberg/b1b5e9a19fc96d249a80 to your computer and use it in GitHub Desktop.
LCB Error handling

Library Internals: How the library deals with errors

This document is intended to be informative and not authoritative. The library may change its internal behavior regarding error handling without notice.

This document describes how the library internally deals with errors, so that you can get a better idea of what may be happening when debugging various issues.

Error Types

The first thing to note about errors is that the library categorizes errors into two distinct types: Data Errors and Network Errors.

A data error is a negative reply returned by the server when an operation has failed, (e.g. ENOENT) or when the library returns with an error due to bad user (e.g. OPTIONS_CONFLICT) input. Note this definition is slightly different than what is mentioned in the API documentation, since this talks about the view of the error from the library.

When a data error is encountered, it is simply returned to the user and it is up for the application to determine how to proceed. Simply put: The library merely passes any kind of data error over to the application and does no further internal processing.

A Network Error is a failure which may potentially be the result of either a slow network or a stale cluster topology. Examples of network errors are closed sockets, timeouts and the internal NOT_MY_VBUCKET protocol responses. In all these cases, the client will internally attempt a corrective action with the goal that eventually a new configuration will arrive and will instruct it as to the next step to take.

In conclusion, the library will only perform internal error handling for errors which may be the result of a stale configuration.

Cluster configurations

A cluster configuration is a JSON object which describes the nodes in the cluster. Specifically, it contains:

  • The nodes that are part of the cluster. Each node is assigned a numeric index starting at 0.
  • a vBucket map which instructs the client about which node is to receive commands for a given key.

The cluster configuration is received during the "bootstrap" phase of the client, either via the memcached protocol, or via HTTP. You may inspect a cluster's current configuration by retrieving the JSON at http://yourcluster:8091/pools/default/b/default, where b/default requests the bucket named default.

vBucket mapping

Internally a bucket is initialized with a constant number of vBuckets (the number of vBuckets depends on the platform the cluster is running on and some other varitions; the default on Linux is 1024 vBuckets).

Throughout the lifetime of the bucket, any given key will always map to one of those vBuckets.

vBuckets themselves are mapped to nodes, where each node in the cluster is considered active for a certain number of vBuckets (typically the number of active vBuckets per node will be distributed evenly, using the formula vbuckets_per_node = total_vbuckets/number_of_nodes).

As nodes are added or removed, a rebalance takes place which changes the mappings of vBuckets to nodes. If a node is added, for example, it will take ownership of a given number of vBuckets. If a node is removed, another node (or typically several nodes) will take ownership of those vBuckets previously owned by the removed node.

Client handling of Configuration

The client must receive a cluster configuration in order to know where to send given operations for keys. To determine where to send a command, it will inspect the key for the command, map the key to a vBucket, determine which node index is active for that vBucket, and finally send the command to that node.

During a rebalance, the cluster will move vBucket ownerships between nodes (as discussed above). Each time a vBucket changes nodes, the cluster creates a new configuration, and the previous configuration is considered stale. When the client makes use of a stale configuration, it may end up sending a command to the wrong node (i.e. the previous owner of the vBucket); if the node is still in the cluster it will respond with a NOT_MY_VBUCKET error, indicating that the node is no longer considered active for the vBucket (and thus the key which is mapped to the vBucket).

Configuration Changes

A cluster configuration changes when the cluster topology changes. A topology change happens when:

  • Nodes are added
  • Nodes are removed
  • Nodes are failed over

In all the above cases, the cluster will update the configuration to add/remove the affected nodes, and change the vBucket mapping accordingly.

In the C library however, a new configuration is only retrieved reactively: Once the client detects a possible configuration issue, it will attempt to retrieve a new configuration.

Why Reactive

It may be tempting to ask "why doesn't the cluster just notify the client when a new configuration is available?", the answer is as follows:

In order for the client to pre-emptively get configuration updates from the cluster, it must be able to receive a configuration update from any node, which means a connection for each node. Because the memcached protocol does not support unsolicited responses, it would mean that the client would need to maintain connections to the HTTP streaming API for each individual node, thereby doubling the number of connections at the client (and increasing the overall number of connections to the main cluster process, for which the cost of a single connection is not trivial).

Note that it is still possible to get semi-preemptive configuration updates using a single connection to the HTTP endpoint; however this is not always reliable since the node which is serving the configuration to the current client instance may itself have been failed over by the cluster, or may itself be experiencing network difficulties; nevertheless it is possible to enable this mode of operation by passing the bootstrap_on=http option to the connection string.

Polling for most recent configuration

The library does not regularly poll for a new configuration, nor does it perform any kind of regular "probing" to determine if a given node is alive or dead. This limitation is specific to the typical usage modes of the C library: synchronous blocking applications.

In synchronous applications, the library is only able to perform network operations from within the context of the lcb_wait() API; and thus any polling would be dependent on when the application actually calls this API function.

With some minor effort, applications can configure the library to manually poll for a new configuration, or to manually poll and determine whether all the nodes are alive. Manually retrieving a new configuration may be done with the lcb_refresh_config() API, while manually checking that all the nodes are alive may be done with the lcb_server_versions3() or lcb_server_versions() API. the "Server Versions" command will effectively check each node for connectivity because it will send a request to each node requesting its version.

Of course, polling with a synchronous application is only useful if your application is also threaded (in which case you can have a "polling thread"):

pthread_mutex_lock(&instance_lock);
lcb_refresh_config(instance);
lcb_server_versions3(instance, NULL, NULL);
lcb_wait3(instance, LCB_WAIT_NOCHECK);
pthread_mutex_unlock(&instance_lock);

What happens when a client requests a configuration

When the client deems it necessary to request a new configuration from the cluster, it will go through an internal subsystem known as "Configuration Providers". This is a collection of various methods which can all asynchronously retrieve an updated configuration from the cluster and supply it back to the library core. Providers are tried in order of priority, and a configuration is considered to be updated when the current provider indicates so.

The default and primary provider is called CCCP and makes use of the internal memcached GET_CLUSTER_CONFIG command. This simply requests a given server to provide its current copy of the configuration in the response. When the client requests a configuration via the CCCP provider, the provider will simply send a command to a given server (the server is selected based on a round-robin algorithm) and awaits the response. If the first response fails, it will try the next server, until all the servers have been exhausted.

The next provider (and the one which is used on memcached buckets, and on couchbase buckets on clusters version 2.2 and lower) is HTTP. This opens a persistent streaming connection to a given node (selected at random during initialization). The HTTP connection uses a special chunked encoding mechanism which will "push" new configuration updates as they arrive. When the client requests a configuration via this provider, the provider will wait a given amount of time for a new response; and if no response has been received within that interval, will close the current streaming connection and open a new streaming connection with the next node, using a round-robin algorithm.

In all cases, the configuration refresh does not guarantee a new configuration (for example, the "new" configuration may be identical to the current one), but merely uses a best-effort logic in requesting the server to provide the client with what the server deems current.

Note that under most cases, the configuation request mechanism is subject to throttling, meaning that the number of times the library's core may request a new configuration is rate limited so that a new configuration is fetched at most once per a given interval (this interval is configurable, and is 10 seconds by default). The reason for this is that excessive configuration requests, especially if the HTTP provider is being used, can slow down the network and cluster. See below for more information about throttling.

Triggering configuration requests

There are a number of ways the client may internally trigger a configuration request.

NOT_MY_VBUCKET responses

When a NOT_MY_VBUCKET is received, the client takes this as a clear indication that the current configuration it is using is no longer the most recent one, and thus requests a new configuration from the cluster. If CCCP is enabled (which is the default for clusters 2.5 and greater), then the NOT_MY_VBUCKET reply already contains the updated configuration inside of it, and therefore the library does not need to explicitly fetch a new configuration.

A NOT_MY_VBUCKET is the typical way the client detects that the current configuration is stale. However a configuration may be invalidated by other means. Specifically, the NOT_MY_VBUCKET reply will happen only when:

The node responding with the NOT_MY_VBUCKET is still part of the cluster and The client actually requested a key mapped to that node

To satisfy the first condition, the node must either be a node that is being added to the cluster, or a node that is being removed, but the rebalance is not yet completed. In the case when a node is failed over, for example, the node is immediately removed from the cluster and will not send a NOT_MY_VBUCKET (but will typically close the socket connection).

To satisfy the second condition, a client must actually request a key which is mapped to that vBucket (and thus, that node). This of course depends on how active the client is, and the diversity of the keys the client is operating on: If the client is retrieving/modifying a single key, then it is entirely possible that all the other nodes in the cluster may have already been removed:

The likelihood of the client detecting a stale configuration from a NOT_MY_VBUCKET is:

  1. Proportional to the frequency at which the client does operations
  2. Proportional to the diversity of the keys on which it operates with
  3. Inversely proportional to the number of nodes in the cluster.
Missing node for vBucket

In some failover cases (specifically, when the cluster has no more replicas), when a node is failed over, the cluster will send a new, interim configuration in which all the vBuckets owned by that node will be assigned to the node of index -1. This index indicates that there is no node to serve that given vBucket. This places the cluster in a degraded state, in which a subset of its data is to be unavailable until either the failed-over node is fully ejected from the cluster, or comes back online.

When the client encounters such a condition, it will return the error LCB_NO_MATCHING_SERVER and trigger a configuration refresh. This flow will be executed by the client until the index for the vBucket is something other than -1. Keep in mind that configuration refresh throttling is in effect here, so if a node is failed over for a long time, the configuration refreshes will still only happen as often as the throttle setting allows it to. It is ultimately expected that the cluster administrator either bring the node back online, or manually rebalance the cluster.

Note that NO_MATCHING_SERVER errors will be returned in the callback, either immediately, or after some delay, depending on whether there is an active config refresh (in which case command waits a bit until the configuration has been received), or if a new refresh cannot be obtained yet due to throttling (in which case the callback is invoked immediately (but asynchronously) with the error. The reason for internally scheduling a command with no server in the retry queue is to essentially yield to the library's event loop so that it may request a new configuration transparently. Note that this is an implementation detail. In practice, applications should be prepared to handle LCB_NO_MATCHING_SERVER in either the callback, or as a return code from the scheduling function.

Socket errors and timeouts

Another trigger for a configuration refresh is a timeout or a socket failure. In this case of socket errors, this happens when the connection between the client and a given node is broken. The client takes this as a clear indication that the node it is talking to is no longer part of the cluster and requests a new configuration.

Timeouts work much the same way, with the understanding that the client does not have a certainty as to why the timeout took place. Specifically timeouts may happen if the network is simply slow, or there are insufficient CPU resources; either on the client or server, to handle current load. A timeout may also happen if there is a kernel panic on the cluster, or if the physical ethernet connection between client and server has been broken.

Specifically, a timeout in the client only happens when there is a lack of activity within a certain timeframe (specifically, the value of LCB_CNTL_OP_TIMEOUT). Specifically, there is no ETIMEDOUT status code that the server will provide for a given operation. It is recommended that applications use the detailed_errcodes=1 in the connection string, or enable the LCB_CNTL_DETAILED_ERRCODES setting to get more detailed error codes in the case of network failure.

Retrying commands

In most cases, if a command has failed because of a network-type error, the client will internally retry it until it either receives a different error code or until the timeout for the given command has exceeded. For this reason it is normally not recommended for applications to retry the command after they receive a network error, unless the application has other knwoledge which may suggest the operation may succeed on the next attempt.

Retrying may be configured through the various lcb_cntl() APIs. By default all data commands (i.e. lcb_store3(), lcb_get3(), etc. are retried. There are some inherent race conditions (for example, if the network connection was broken during the execution of an INCR command; the client may end up reconnecting to the same server and re-send the command again, resulting in the sending of the INCR command twice). Race conditions may be avoided in most cases by using the cas parameter, though it is recommended to disable retries for INCR manually (the library does not do this by default order to be comptaible with behavior in older versions) using the LCB_CNTL_RETRYMODE setting:

int setting = LCB_RETRYOPT_CREATE(LCB_RETRY_ON_SOCKERR, LCB_RETRY_CMDS_SAFE);
lcb_cntl(instance, LCB_CNTL_GET, LCB_CNTL_RETRYMODE, &setting);

The LCB_RETRY_CMDS_SAFE will only allow a command to be retried if it is certain that retrying the command will not result in unintended modification of data.

Commands to be retried are placed in an internal queue. Commands from the retry queue are retried either after a certain increasing interval has elapsed (which throttles how often a command can be retried; see LCB_CNTL_RETRY_INTERVAL and LCB_CNTL_RETRY_BACKOFF) or when a new configuration has been received. Throttling is in effect for retried commands, so as to ensure that if the network is under load, the client does not add to the load by retrying the command in a busy loop.

Throughput considerations

Couchbase is a high performance distributed database. It maintains this high performance through topology changes throughout topology changes, and even in the face of node failure.

High performance is attained because each node is mostly independent of other nodes (indeed, the node will replicate its content to the relevant targets, but this does not make an operation's success depend on the health of any given replica, unless durability constraints are employed).

Problems in throughput come about when attempting to perform operations against a failed node. Failure in this context is understood to mean a node that is still part of the cluster topology, but is otherwise unreachable; for example, a node that has been disconnected from the network. For synchronous applications which deal with operations in strict sequence, if any of the operations within the logical sequence are mapped to the failed node, then the entire pipeline will be blocked on that node. Consider this application sequence of a user (user:gwf_hegel) "pairing" with several contacts, where there are 4 nodes in the cluster, and node 2 has failed

  • Get user:gwf_hegel => Node: #1 => OK (2ms)
  • Add user:sfreud to a list within user:gwf_hegel => Node #1 => OK (1ms)
  • Get user:sfreud => Node #2 (SLOWDOWN!!) => TIMEOUT (2.5s)

Despite the fact that the key user:wf_hegel is very much available on node #1, the other key, user:sfreud is mapped to the failed node (Node #2). Because the application's logic depends on both items being accessible, the throughput may be much slower.

The application may still maintain high throughput in such situations by following several suggestions and/or guidelines. Keep in mind these guidelines are dependent on the application logic and surrounding infrastructure.

Maintaining high throughput during node unreachability

To prevent an unreachable node from slowing down the application, the application must avoid operations to that node. This way, all operations will be performed only against known good nodes. How this is accomplished depends on the nature of the failure and the overall application/infrastructure logic.

Failing over the unreachable node

If a node is unreachable because of a catastrophic failure (for example, it has had a critical hardware failure), then failing over the node is the ideal option. The node may yet still be added back. Failover is a relatively expensive operation as it reduces the capacity of the cluster and may result in reverting some items back to an older state (if they had not yet been replicated).

The Couchbase Web Console allows you to configure Auto Failover. This will make the cluster automatically fail over nodes which have been unreachable for a period of time. The minimum threshold interval for this feature is 30 seconds. Use this feature with care, as this may sometimes result in a node being failed over simply because it is slow (for example, during an immense flood of traffic). In this case, reducing the cluster capacity will typically increase the severity of the issue!

Discovering unreachable nodes

Couchbase's auto-failover detection is rather limited. Because Couchbase can only discover rudimentary statistics of another system over the network, it is by design rather conservative about auto-failover.

Applications may leverage existing infrastructure telemetry to preemptively fail over a node. Existing telemetry may be obtained via a variety of sources, most commonly server monitoring frameworks.

A clear case in which an application may utilize infrastructure information to invoke a failover would be in a virtualized environment. Assuming a virtual machine has been forcefully destroyed, infrastructure scripts could then query the hypervisor of the virtualized environment for the node's status. If the node is down, the infrastructure could fail over the node in a matter of seconds.

Avoiding failed nodes

Avoiding a failed node is another way of ensuring high throughput in applications. However, this requires that the application:

  • Knows which nodes are unreachable
  • Has a defined policy for handling operations which cannot execute immediately.

Knowing which nodes are reachable or not may be done in several ways. The application may use existing infrastructure methods (per above) to determine which nodes are done.

An alternate mechanism may be to maintain a client-local "blacklist" of nodes which may be unreachable, periodically refreshing this blacklist after a specified interval has elapsed. Consider the following code:

class Blacklist {
public:
	Blacklist(lcb_t instance) : instance(instance) {}
	bool isBlacklisted(string& key) {
		string node = nodeForKey(key);
		if (entries.find(node) == entries.end()) {
			return false; // no blacklist for node
		} else {
			time_t now = time(NULL);
			if (entries[node] < now) {
				entries.erase(node);
				return false; // Entry expired
			} else {
				return true;
			}
		}
	}
	
	void updateBlacklist(string& key) {
		string node = nodeForKey(key);
		// Mark the node as blacklisted for 60 seconds
		entries[node] = time(NULL) + 60;
	}

private:
	lcb_t instance;
	map<string,time_t> entries;
	
	string nodeForKey(string& key) {
		lcbvb_CONFIG *vbucketConfig;
		lcb_cntl(instance, LCB_CNTL_GET, LCB_CNTL_VBCONFIG, &vbucketConfig);
		int vbucketForKey, serverForVBucket;
		lcbvb_map_key(vbucketConfig, key.c_str(), key.size(), &vbucketForKey, &serverForVBucket);
		return lcb_get_node(instance, LCB_NODE_DATA, serverforvbuckeyt);
	}
}

void doOperation(Blacklist& bl, string& key) {
	if (bl.isBlacklisted(key)) {
		pendingOpQueue.put(key);
		return;
	}
	// Do operation here, and retrieve the error code via the callback
	if (LCB_EIFNET(retrievedError)) {
		pendingOpQueue.put(key);
		bl.updatedBlacklist(key);
	}
}

In the above code, the application maintains a blacklist of nodes which the application assumes to be down, because of previous network errors received on them. In order to allow for the possibility that the node may recover, the entry is only valid for 60 seconds.

The sample code also shows how to obtain the node to which the key is mapped. This may also be used to compare with an external topology/telemetry source to determine if a node has failed (not shown here). Bear in mind that the node returned is in host:port format; so in most cases you will likely want to strip the :port section.

For operations which cannot be completed because the node has been blacklisted, the application places them in a special pending queue to be handled later. This document deliberately does not show a sample implementation of such a queue, since its logic is highly dependent on the nature of the application (for example one may write pending operations to disk; discard them immediately; attempt to perform this operation against another cluster or database, etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment