How to revive any orphaned frameworkID using Marathon.
Deploy Marathon-Resurrection JSON. When started successfully, suspend it. It will create the required Zookeeper directories in /universe/marathon-resurrection
.
Important NOTE: In order for this procedure to work, it requires a backup of a framework:id that matches the same format that you need replaced. For example, if you want to teardown any framework (not just marathon) a framework ID with this format, {8}-{4}-{4}-{4}-{12}-{4} will need to exist in a zookeeper backup taken previously from marathon or from a different unrelated marathon istance that has the exact format for this procedure to work. (i.e 0cd52b83-52af-4c05-a6e7-45c8d9b8c649-0001
or 12345678-1234-4444-5555-1234567890A-0002
but not 123-34-51-633-0001
as this has format {3}-{2}-{2}-{3}-{4}) See (https://issues.apache.org/jira/browse/MESOS-6419) if you can upgrade Mesos, but this is a quick/temporary workaround for those running the older version.
{
"volumes": null,
"id": "/marathon-resurrection",
"cmd": "LIBPROCESS_PORT=$PORT1 && ./bin/start --checkpoint --decline_offer_duration \"120000\" --default_accepted_resource_roles \"*\" --enable_features \"vips,task_killing\" --event_stream_max_outstanding_messages \"50\" --executor \"//cmd\" --failover_timeout \"604800\" --framework_name marathon-resurrection --ha --hostname $LIBPROCESS_IP --http_compression --http_event_callback_slow_consumer_timeout \"10000\" --http_event_request_timeout \"10000\" --http_port $PORT0 --http_realm \"Mesosphere\" --launch_token_refresh_interval \"30000\" --launch_tokens \"100\" --leader_proxy_connection_timeout \"5000\" --leader_proxy_read_timeout \"10000\" --local_port_max \"20000\" --local_port_min \"10000\" --master \"zk://master.mesos:2181/mesos\" --max_tasks_per_offer \"1\" --disable_mesos_authentication --mesos_authentication_principal marathon-resurrection --mesos_leader_ui_url \"/mesos\" --mesos_role marathon-resurrection --metrics --min_revive_offers_interval \"5000\" --offer_matching_timeout \"1000\" --on_elected_prepare_timeout \"180000\" --reconciliation_initial_delay \"15000\" --reconciliation_interval \"600000\" --revive_offers_repetitions \"3\" --save_tasks_to_launch_timeout \"3000\" --scale_apps_initial_delay \"15000\" --scale_apps_interval \"300000\" --store_cache --task_launch_confirm_timeout \"300000\" --task_launch_timeout \"300000\" --task_lost_expunge_gc \"75000\" --task_lost_expunge_initial_delay \"300000\" --task_lost_expunge_interval \"30000\" --task_reservation_timeout \"20000\" --disable_tracing --zk zk://master.mesos:2181/universe/marathon-resurrection --zk_compression --zk_compression_threshold \"65536\" --zk_max_node_size \"1024000\" --zk_max_versions \"25\" --zk_session_timeout \"10000\" --zk_timeout \"10000\"",
"args": null,
"user": null,
"env": {
"JVM_OPTS": "-Xms256m -Xmx768m"
},
"instances": 1,
"cpus": 1,
"mem": 1536,
"disk": 0,
"gpus": 0,
"executor": null,
"constraints": [
[
"hostname",
"UNIQUE"
]
],
"fetch": null,
"storeUrls": null,
"backoffSeconds": 1,
"backoffFactor": 1.15,
"maxLaunchDelaySeconds": 3600,
"container": {
"docker": {
"image": "mesosphere/marathon:v1.3.3",
"forcePullImage": false,
"privileged": false,
"network": "HOST"
}
},
"healthChecks": [
{
"protocol": "HTTP",
"path": "/ping",
"gracePeriodSeconds": 120,
"intervalSeconds": 10,
"timeoutSeconds": 5,
"maxConsecutiveFailures": 3,
"ignoreHttp1xx": false
}
],
"readinessChecks": null,
"dependencies": null,
"upgradeStrategy": {
"minimumHealthCapacity": 1,
"maximumOverCapacity": 1
},
"labels": {
"DCOS_PACKAGE_RELEASE": "4",
"DCOS_SERVICE_SCHEME": "http",
"DCOS_PACKAGE_SOURCE": "https://universe.mesosphere.com/repo",
"DCOS_PACKAGE_METADATA": "eyJwYWNrYWdpbmdWZXJzaW9uIjoiMy4wIiwibmFtZSI6Im1hcmF0aG9uIiwidmVyc2lvbiI6IjEuMy4zIiwibWFpbnRhaW5lciI6InN1cHBvcnRAbWVzb3NwaGVyZS5pbyIsImRlc2NyaXB0aW9uIjoiQSBjb250YWluZXIgb3JjaGVzdHJhdGlvbiBwbGF0Zm9ybSBmb3IgTWVzb3MgYW5kIERDT1MuIiwidGFncyI6WyJpbml0IiwibG9uZy1ydW5uaW5nIl0sInNlbGVjdGVkIjp0cnVlLCJzY20iOiJodHRwczovL2dpdGh1Yi5jb20vbWVzb3NwaGVyZS9tYXJhdGhvbi5naXQiLCJmcmFtZXdvcmsiOnRydWUsInByZUluc3RhbGxOb3RlcyI6IldlIHJlY29tbWVuZCBhIG1pbmltdW0gb2Ygb25lIG5vZGUgd2l0aCBhdCBsZWFzdCAyIENQVSBzaGFyZXMgYW5kIDFHQiBvZiBSQU0gYXZhaWxhYmxlIGZvciB0aGUgTWFyYXRob24gRENPUyBTZXJ2aWNlLiIsInBvc3RJbnN0YWxsTm90ZXMiOiJNYXJhdGhvbiBEQ09TIFNlcnZpY2UgaGFzIGJlZW4gc3VjY2Vzc2Z1bGx5IGluc3RhbGxlZCFcblxuXHREb2N1bWVudGF0aW9uOiBodHRwczovL21lc29zcGhlcmUuZ2l0aHViLmlvL21hcmF0aG9uXG5cdElzc3VlczogaHR0cHM6Ly9naXRodWIuY29tL21lc29zcGhlcmUvbWFyYXRob24vaXNzdWVzXG4iLCJwb3N0VW5pbnN0YWxsTm90ZXMiOiJUaGUgTWFyYXRob24gRENPUyBTZXJ2aWNlIGhhcyBiZWVuIHVuaW5zdGFsbGVkIGFuZCB3aWxsIG5vIGxvbmdlciBydW4uXG5QbGVhc2UgZm9sbG93IHRoZSBpbnN0cnVjdGlvbnMgYXQgaHR0cDovL2RvY3MubWVzb3NwaGVyZS5jb20vc2VydmljZXMvbWFyYXRob24vI3VuaW5zdGFsbCB0byBjbGVhbiB1cCBhbnkgcGVyc2lzdGVkIHN0YXRlIiwibGljZW5zZXMiOlt7Im5hbWUiOiJBcGFjaGUgTGljZW5zZSBWZXJzaW9uIDIuMCIsInVybCI6Imh0dHBzOi8vZ2l0aHViLmNvbS9tZXNvc3BoZXJlL21hcmF0aG9uL2Jsb2IvbWFzdGVyL0xJQ0VOU0UifV0sImltYWdlcyI6eyJpY29uLXNtYWxsIjoiaHR0cHM6Ly9kb3dubG9hZHMubWVzb3NwaGVyZS5jb20vbWFyYXRob24vYXNzZXRzL2ljb24tc2VydmljZS1tYXJhdGhvbi1zbWFsbC5wbmciLCJpY29uLW1lZGl1bSI6Imh0dHBzOi8vZG93bmxvYWRzLm1lc29zcGhlcmUuY29tL21hcmF0aG9uL2Fzc2V0cy9pY29uLXNlcnZpY2UtbWFyYXRob24tbWVkaXVtLnBuZyIsImljb24tbGFyZ2UiOiJodHRwczovL2Rvd25sb2Fkcy5tZXNvc3BoZXJlLmNvbS9tYXJhdGhvbi9hc3NldHMvaWNvbi1zZXJ2aWNlLW1hcmF0aG9uLWxhcmdlLnBuZyJ9fQ==",
"DCOS_PACKAGE_REGISTRY_VERSION": "3.0",
"DCOS_SERVICE_NAME": "marathon-resurrection",
"DCOS_PACKAGE_FRAMEWORK_NAME": "marathon-resurrection",
"DCOS_SERVICE_PORT_INDEX": "0",
"DCOS_PACKAGE_VERSION": "1.3.3",
"DCOS_PACKAGE_NAME": "marathon",
"DCOS_PACKAGE_IS_FRAMEWORK": "true"
}
}
Pull down our guano backup and restore utility
sudo curl -O https://raw.githubusercontent.com/mesosphere/docker-containers/master/dcos-debug/toolbox && sudo chmod +x toolbox && sudo ./toolbox
Perform a backup of the marathon resurrection app
guano -d /universe/marathon-resurrection/state -o backup -s localhost:2181
Now that the backup is performed, lets modify it with the frameworkId we want to resurrect. Please note that the structure of the sed is: sed s/{framework_id_marathon_resurrect}/{framework_id_orphaned_framework}
sed -i.bak 's/{marathon_resurrect_id}/{orphaned_frameworkdid}/' backup/universe/marathon-resurrection/state/framework\:id
Lets restore this backup back into Zookeeper
guano -i backup/universe/marathon-resurrection/state/framework\:id -r /universe/marathon-resurrection/state/framework\:id -s localhost:2181
ZK Operation Complete!
NEXT:
- Now we need to redploy Marathon-Resurrection JSON. Go ahead and suspsend the application. Once suspended, scale the instances back to 1. It will register with the orphaned framework ID.
Turn off the Marathon-Resurrection JSON as this will now cause the orphaned Framework ID to be resurrected and it is no longer required.
If you look in Mesos UI/ Frameworks Tab, you will see two Marathon-Resurrection framework IDs. You will now perform the teardown on both.
Perform the /teardown operation with
curl leader.mesos:5050/teardown -d frameworkId={framework_id_marathon_resurrect}
curl leader.mesos:5050/teardown -d frameworkId={framework_id_orphaned_framework}
Now you can go in ZK and clean up the Zookeeper directories in /universe/marathon-resurrection
.
Complete.
@bernadinm I am trying to use your resurrection debug utility on DC/OS 1.8 /Mesos 1.0.1.
I don't understand what "marathon_resurrection_id" should be. Is it something that I come up with using the known pre-defined format as described earlier in your gist?Update:
After pondering a bit more on the "ZK Operation" section, I understand what you mean. To summarize:
Once you get the backup zNode value for state/framework:id, make sure to note down retrieved framework id (its the last piece of text after the parentheses delimiter). you have to replace {marathon_ressurrect_id} with aforementioned framework id . You also have to replace {orphaned_framework_id} with the actual orphaned framework id. Execute sed command with these values explicitly defined.
Make sure the local framework:id file has the correct contents. I encourage people not to modify this file directly via GUI (incl Exhibitor). After the file is uploaded using the last step, you can restart/un-suspend resurrection app.
If all work as expected orphan tasks are resurrected, allowing you to conduct /teardown. I've noticed in my particular case that I still had task references to "unregistered_frameworks", but at least they were not claiming resources :)