The original design of external snapshots allows esnap clone lvols to successfully open before the external snapshot is present. A logical volume with a missing external snapshot is referred to as degraded. It was done this way because:
- It is hard for the SPDK administrator to control the order that
examine_config
callbacks will be called. - The consumer of an esnap clone may not actually need to perform reads from the external snapshot because the clusters containing the requested blocks may have already been allocated in the esnap clone's blob.
- Esnap clones may be at the root of a deep tree of snapshots and clones. It would be complex to delay the online of all of these until the external snapshot is available.
- Immediate registration of bdevs simplifies management of degraded lvols.
bdev_get_bdevs
allows degraded lvols to be seen. The module-specific section indicates when it is degraded.bdev_lvol_delete
allows degraded esnap clones and their clones to be deleted. Note there is no "snapshot of an esnap clone" because when an esnap clone vol1 is snapshotted with vol2, vol2 becomes a read-only esnap clone and vol1 becomes a clone of vol2. This is important to be able to clean up an lvolstore when an external snapshot is missing, especially if the external snapshot has been destroyed.bdev_lvol_rename
allows degraded esnap clones to be renamed. This may be important to make the name available for a replacement while the original is being debugged.
- Immediate registration reserves the lvol name and UUID so that no one else can squat on it.
The initial solution to this was to have reads that depend on missing esnap clone to immediately call the completion callback with -EIO
. During review concern was expressed that consumers will not handle EIO
well - perhaps offlining sectors or the entire device in the consumer.
It is believed that external snapshot devices will be added relatively shortly after the lvol is registered. Rather than failing the IOs, it would be better to queue them. If the IOs cannot be completed within a timeout period (e.g. 30 seconds) they will be completed with an error.
The blob_eio
blobstore device is no longer needed and will be removed.
A new blob_queue
blobstore device will be added. It will interact with the lvol load, missing esnaps, etc., parts of vbdev_lvol in much the same way as blob_eio
does. The key difference is that the read
, readv
, and readv_ext
callbacks will queue the IO. The replacement of a blob_queue
with a blob_bdev
is handled in lvs_esnap_missing_hotplug()
as follows:
- Thew new
bs_dev
is created withlvs->esnap_bs_dev_create()
- The blob is frozen to prevent any new IOs from being submitted to the blob.
- The queued IOs are submitted to the
bs_dev
. - The blob's
back_bs_dev
is replaced withspdk_blob_set_esnap_bs_dev()
. - The blob is unfrozen.
Each channel that has queued IOs will have a poller that is set to fire when the next queued IO times out. This poller's callback will complete all expired IOs in the queue with -EIO
.
In some cases, queueing would be detrimental. The most apparent case is that of examine_disk()
callbacks on a newly registered degraded lvol. Today, examine_disk()
is called asynchronously with all bdev modules with registered examine_disk()
callbacks racing. This is subject to change: see spdk#2855.
The fix for spdk#2855 discussed during a bug scrub or community call (or shortly thereafter) was to still call examine_disk()
asynchronously, but switch to calling them sequentially. That is, one module's call to spdk_bdev_module_examine_done()
would trigger the next to begin its examine_disk()
. If an lvol's external snapshot is not available for a long period of time, this could mean (e.g.) 30 seconds of delay for each bdev module's examine_disk()
. This would mean that spdk_bdev_wait_for_examine()
could be waiting for several minutes, which is likely to be undesirable.
To avoid this, after opening a bdev each module's examine_disk()
callback should call:
spdk_bdev_desc_set_flag(bdev_desc, SPDK_BDEV_IO_FAILFAST);
The following will be added to lvol_read()
and lvol_write()
:
lvol_io->ext_io_opts.io_flags = spdk_bdev_io_get_desc_flags(bdev_io);
Note that it is needed for lvol_write()
because a write can trigger a read if CoW is needed.
bs_dev_queue_readv_ext()
will contain:
if (io_opts->io_flags & SPDK_BDEV_IO_FAILFAST) {
cb_args->cb_fn(cb_args->channel, cb_args->cb_arg, -EIO);
return;
}
/* queue IO */
This design strives to keep lvol's implementation of missing esnap handling out of the blobstore code. This means that queuing could happen in blobstore.c
(queue in an spdk_bs_channel
) and in a channel allocated by bs_dev_queue_create_channel()
. I don't think this is a problem, but it calls for multiple queue implementations.
With this option, lvol bdevs are not created until the external snapshot is present.
As an lvolstore is being loaded, two passes are taken. The first iterates all blobs, creating per-blob lvol structures that are stored then added to lvs->lvols
list. During the second pass, all blobs are opened and the corresponding lvol
s and vbdev_lvol
s are created.
In the first pass, the opens happen in the order of blob ID using spdk_bs_ister_first()
and spdk_bs_iter_next()
. When a clone blob is opened, the open of its parent is triggered such that the snapshot (or external snapshot) finishes opening before the clone opens. After a blob is opened, blobstore calls the callback specified with the spdk_bs_iter_*
call, which is load_next_lvol()
. Before load_next_lvol()
loads the next lvol by continuing the iteration, it allocates an spdk_lvol
and inserts it on the tail of lvs->lvols
. This depth-first algorithm causes a snapshot to appear on lvs->lvols
before its clones.
In the second pass, _vbdev_lvs_examine_cb()
iterates through lvs->lvols
and opens the lvols in the order they appear. Because of the order imposed by the first pass, we know that an esnap clone that is itself a snapshot will appear in lvs->lvols
ahead of any clones. Thus, an esnap clone that fails to open its external snapshot can mark itself as degraded. If a regular clone can find its parent snapshot, it should then be able to determine if the parent is degraded. Blobstore already has a means for recursively getting data from parent devices and that can be extended to support is function.
spdk_bs_dev
gets a new callback:
struct spdk_bs_dev {
/* ... */
bool is_healthy(struct spdk_bs_dev *dev);
};
Existing bs_dev modules get an is_healthy()
implementation that returns true
.
When an external snapshot is missing, the blob's back_bs_dev
will be a minimal blobstore device that returns false
to the is_healthy()
callback.
Blobstore gets a new function:
bool
spdk_blob_is_healthy(struct spdk_blob *blob)
{
if (blob->back_bs_dev != NULL) {
return true;
}
return blob->back_bs_dev->is_healthy();
}
Missing devices are then largely handled as they are in the code currently under review. The following changes are needed:
- Rather than loading an EIO bs_dev, an
unhealthy
bs_dev is loaded. - Do not register a bdev when
spdk_blob_is_healthy()
returns false. This catches esnap clones and dependents. - When a missing bdev appears, the
examine_config()
callback will trigger recursive iteration of lvol's clones to cause bdevs to be created for the now healthy lvols.
Currently, lvstore prevents new lvols from being created when the new name collides with a name on the lvs->lvols
list. No additional work is needed. In theory, someone could create a non-lvol bdev with a name that collides with <lvs>/<vol>
, but that seems pretty unlikely.
The UUID will not be reserved, but that is of minimal risk.
A new API will be added: bdev_lvol_get_lvols
. Despite starting with bdev
this will iterate lvs->lvols
and will dump the driver_specific
data for each lvol. This will provide the result of blob->back_bs_dev_healthy()
in a new boolean healthy
key. The other changes added so far for external snapshots will remain in this json output. See vbdev_lvol_dump_info_json()
.
This can likely go on as it currently does, but will skip the spdk_bdev_*()
calls.
An lvol that is not healthy (per spdk_blob_is_healthy(lvol->blob)
) will not allow new snapshots or clones, cannot be resized, inflated, decoupled, or made read-only.