Last active
August 29, 2015 14:16
-
-
Save alexclear/b291787701026e28910f to your computer and use it in GitHub Desktop.
LeoFS node recovery issue(?)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When I start "recover node <storage_node_name>" message of type QUEUE_ID_RECOVERY_NODE seems to be | |
blocking processing of other messages (mainly, QUEUE_ID_PER_OBJECT ones). | |
I tried to visualize message processing using StatsD: | |
http://ns2.1888.spb.ru/2015-02-24-024731_1366x768_scrot.png | |
A graph at the lower right corner depicts calls to | |
handle_call({consume, ?QUEUE_ID_RECOVERY_NODE, MessageBin}). | |
It is my understanding that recover_node_callback(Node) should generate a lot | |
of QUEUE_TYPE_PER_OBJECT messages synchronously for a single QUEUE_ID_RECOVERY_NODE | |
message but the graph shows that QUEUE_ID_RECOVERY_NODE messages are constantly processed. | |
I guess all 8 MQ workers are processing the same QUEUE_ID_RECOVERY_NODE message and | |
this processing will last until the very first worker synchronously generates ALL | |
QUEUE_TYPE_PER_OBJECT messages for each and every object | |
on the node. This does not seem to be right. | |
I propose the following fix: the QUEUE_ID_RECOVERY_NODE message should be removed from | |
an MQ queue before erlang:apply(Mod, handle_call, [{consume, Id, MsgBin}]) call to stop | |
processing of the same message by other MQ workers. | |
Upd: | |
I tried the proposed fix and it did not help, QUEUE_ID_RECOVERY_NODE circular processing | |
was successfully stopped but QUEUE_TYPE_PER_OBJECT messages overloaded the queue anyway: | |
mq-stats [email protected] | |
id | state | number of msgs | batch of msgs | interval | description | |
--------------------------------+-------------+----------------|----------------|----------------|----------------------------------- | |
leo_delete_dir_queue | idling | 0 | 5000 | 100 | delete directories | |
leo_comp_meta_with_dc_queue | idling | 0 | 5000 | 100 | compare metadata w/remote-node | |
leo_sync_obj_with_dc_queue | idling | 0 | 5000 | 100 | sync objs w/remote-node | |
leo_recovery_node_queue | idling | 0 | 5000 | 100 | recovery objs of node | |
leo_async_deletion_queue | idling | 0 | 5000 | 100 | async deletion of objs | |
leo_rebalance_queue | idling | 0 | 5000 | 100 | rebalance objs | |
leo_sync_by_vnode_id_queue | idling | 0 | 5000 | 100 | sync objs by vnode-id | |
leo_per_object_queue | running | 3027 | 5000 | 100 | recover inconsistent objs |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment