Skip to content

Instantly share code, notes, and snippets.

@alexclear
Last active August 29, 2015 14:16
Show Gist options
  • Save alexclear/b291787701026e28910f to your computer and use it in GitHub Desktop.
Save alexclear/b291787701026e28910f to your computer and use it in GitHub Desktop.
LeoFS node recovery issue(?)
When I start "recover node <storage_node_name>" message of type QUEUE_ID_RECOVERY_NODE seems to be
blocking processing of other messages (mainly, QUEUE_ID_PER_OBJECT ones).
I tried to visualize message processing using StatsD:
http://ns2.1888.spb.ru/2015-02-24-024731_1366x768_scrot.png
A graph at the lower right corner depicts calls to
handle_call({consume, ?QUEUE_ID_RECOVERY_NODE, MessageBin}).
It is my understanding that recover_node_callback(Node) should generate a lot
of QUEUE_TYPE_PER_OBJECT messages synchronously for a single QUEUE_ID_RECOVERY_NODE
message but the graph shows that QUEUE_ID_RECOVERY_NODE messages are constantly processed.
I guess all 8 MQ workers are processing the same QUEUE_ID_RECOVERY_NODE message and
this processing will last until the very first worker synchronously generates ALL
QUEUE_TYPE_PER_OBJECT messages for each and every object
on the node. This does not seem to be right.
I propose the following fix: the QUEUE_ID_RECOVERY_NODE message should be removed from
an MQ queue before erlang:apply(Mod, handle_call, [{consume, Id, MsgBin}]) call to stop
processing of the same message by other MQ workers.
Upd:
I tried the proposed fix and it did not help, QUEUE_ID_RECOVERY_NODE circular processing
was successfully stopped but QUEUE_TYPE_PER_OBJECT messages overloaded the queue anyway:
mq-stats [email protected]
id | state | number of msgs | batch of msgs | interval | description
--------------------------------+-------------+----------------|----------------|----------------|-----------------------------------
leo_delete_dir_queue | idling | 0 | 5000 | 100 | delete directories
leo_comp_meta_with_dc_queue | idling | 0 | 5000 | 100 | compare metadata w/remote-node
leo_sync_obj_with_dc_queue | idling | 0 | 5000 | 100 | sync objs w/remote-node
leo_recovery_node_queue | idling | 0 | 5000 | 100 | recovery objs of node
leo_async_deletion_queue | idling | 0 | 5000 | 100 | async deletion of objs
leo_rebalance_queue | idling | 0 | 5000 | 100 | rebalance objs
leo_sync_by_vnode_id_queue | idling | 0 | 5000 | 100 | sync objs by vnode-id
leo_per_object_queue | running | 3027 | 5000 | 100 | recover inconsistent objs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment