Skip to content

Instantly share code, notes, and snippets.

@bernadinm
Last active September 29, 2017 20:20
Show Gist options
  • Save bernadinm/41bca6058f9137cd21f4fb562fd20d50 to your computer and use it in GitHub Desktop.
Save bernadinm/41bca6058f9137cd21f4fb562fd20d50 to your computer and use it in GitHub Desktop.

Guano Restore of Marathon framework ID Procedure

How to revive any orphaned frameworkID using Marathon.

Deploy Marathon-Resurrection JSON

Deploy Marathon-Resurrection JSON. When started successfully, suspend it. It will create the required Zookeeper directories in /universe/marathon-resurrection.

Important NOTE: In order for this procedure to work, it requires a backup of a framework:id that matches the same format that you need replaced. For example, if you want to teardown any framework (not just marathon) a framework ID with this format, {8}-{4}-{4}-{4}-{12}-{4} will need to exist in a zookeeper backup taken previously from marathon or from a different unrelated marathon istance that has the exact format for this procedure to work. (i.e 0cd52b83-52af-4c05-a6e7-45c8d9b8c649-0001 or 12345678-1234-4444-5555-1234567890A-0002 but not 123-34-51-633-0001 as this has format {3}-{2}-{2}-{3}-{4}) See (https://issues.apache.org/jira/browse/MESOS-6419) if you can upgrade Mesos, but this is a quick/temporary workaround for those running the older version.

Marathon-Resurrection.json

{
  "volumes": null,
  "id": "/marathon-resurrection",
  "cmd": "LIBPROCESS_PORT=$PORT1 && ./bin/start    --checkpoint --decline_offer_duration \"120000\" --default_accepted_resource_roles \"*\"   --enable_features \"vips,task_killing\"  --event_stream_max_outstanding_messages \"50\"  --executor \"//cmd\" --failover_timeout \"604800\" --framework_name marathon-resurrection  --ha --hostname $LIBPROCESS_IP   --http_compression   --http_event_callback_slow_consumer_timeout \"10000\" --http_event_request_timeout \"10000\"  --http_port  $PORT0  --http_realm \"Mesosphere\"   --launch_token_refresh_interval \"30000\" --launch_tokens \"100\" --leader_proxy_connection_timeout \"5000\" --leader_proxy_read_timeout \"10000\"  --local_port_max \"20000\" --local_port_min \"10000\"    --master \"zk://master.mesos:2181/mesos\"  --max_tasks_per_offer \"1\" --disable_mesos_authentication --mesos_authentication_principal marathon-resurrection   --mesos_leader_ui_url \"/mesos\" --mesos_role marathon-resurrection   --metrics --min_revive_offers_interval \"5000\" --offer_matching_timeout \"1000\" --on_elected_prepare_timeout \"180000\"   --reconciliation_initial_delay \"15000\" --reconciliation_interval \"600000\"   --revive_offers_repetitions \"3\" --save_tasks_to_launch_timeout \"3000\" --scale_apps_initial_delay \"15000\" --scale_apps_interval \"300000\"      --store_cache --task_launch_confirm_timeout \"300000\" --task_launch_timeout \"300000\" --task_lost_expunge_gc \"75000\" --task_lost_expunge_initial_delay \"300000\" --task_lost_expunge_interval \"30000\" --task_reservation_timeout \"20000\" --disable_tracing  --zk zk://master.mesos:2181/universe/marathon-resurrection  --zk_compression --zk_compression_threshold \"65536\" --zk_max_node_size \"1024000\" --zk_max_versions \"25\" --zk_session_timeout \"10000\" --zk_timeout \"10000\"",
  "args": null,
  "user": null,
  "env": {
    "JVM_OPTS": "-Xms256m -Xmx768m"
  },
  "instances": 1,
  "cpus": 1,
  "mem": 1536,
  "disk": 0,
  "gpus": 0,
  "executor": null,
  "constraints": [
    [
      "hostname",
      "UNIQUE"
    ]
  ],
  "fetch": null,
  "storeUrls": null,
  "backoffSeconds": 1,
  "backoffFactor": 1.15,
  "maxLaunchDelaySeconds": 3600,
  "container": {
    "docker": {
      "image": "mesosphere/marathon:v1.3.3",
      "forcePullImage": false,
      "privileged": false,
      "network": "HOST"
    }
  },
  "healthChecks": [
    {
      "protocol": "HTTP",
      "path": "/ping",
      "gracePeriodSeconds": 120,
      "intervalSeconds": 10,
      "timeoutSeconds": 5,
      "maxConsecutiveFailures": 3,
      "ignoreHttp1xx": false
    }
  ],
  "readinessChecks": null,
  "dependencies": null,
  "upgradeStrategy": {
    "minimumHealthCapacity": 1,
    "maximumOverCapacity": 1
  },
  "labels": {
    "DCOS_PACKAGE_RELEASE": "4",
    "DCOS_SERVICE_SCHEME": "http",
    "DCOS_PACKAGE_SOURCE": "https://universe.mesosphere.com/repo",
    "DCOS_PACKAGE_METADATA": "eyJwYWNrYWdpbmdWZXJzaW9uIjoiMy4wIiwibmFtZSI6Im1hcmF0aG9uIiwidmVyc2lvbiI6IjEuMy4zIiwibWFpbnRhaW5lciI6InN1cHBvcnRAbWVzb3NwaGVyZS5pbyIsImRlc2NyaXB0aW9uIjoiQSBjb250YWluZXIgb3JjaGVzdHJhdGlvbiBwbGF0Zm9ybSBmb3IgTWVzb3MgYW5kIERDT1MuIiwidGFncyI6WyJpbml0IiwibG9uZy1ydW5uaW5nIl0sInNlbGVjdGVkIjp0cnVlLCJzY20iOiJodHRwczovL2dpdGh1Yi5jb20vbWVzb3NwaGVyZS9tYXJhdGhvbi5naXQiLCJmcmFtZXdvcmsiOnRydWUsInByZUluc3RhbGxOb3RlcyI6IldlIHJlY29tbWVuZCBhIG1pbmltdW0gb2Ygb25lIG5vZGUgd2l0aCBhdCBsZWFzdCAyIENQVSBzaGFyZXMgYW5kIDFHQiBvZiBSQU0gYXZhaWxhYmxlIGZvciB0aGUgTWFyYXRob24gRENPUyBTZXJ2aWNlLiIsInBvc3RJbnN0YWxsTm90ZXMiOiJNYXJhdGhvbiBEQ09TIFNlcnZpY2UgaGFzIGJlZW4gc3VjY2Vzc2Z1bGx5IGluc3RhbGxlZCFcblxuXHREb2N1bWVudGF0aW9uOiBodHRwczovL21lc29zcGhlcmUuZ2l0aHViLmlvL21hcmF0aG9uXG5cdElzc3VlczogaHR0cHM6Ly9naXRodWIuY29tL21lc29zcGhlcmUvbWFyYXRob24vaXNzdWVzXG4iLCJwb3N0VW5pbnN0YWxsTm90ZXMiOiJUaGUgTWFyYXRob24gRENPUyBTZXJ2aWNlIGhhcyBiZWVuIHVuaW5zdGFsbGVkIGFuZCB3aWxsIG5vIGxvbmdlciBydW4uXG5QbGVhc2UgZm9sbG93IHRoZSBpbnN0cnVjdGlvbnMgYXQgaHR0cDovL2RvY3MubWVzb3NwaGVyZS5jb20vc2VydmljZXMvbWFyYXRob24vI3VuaW5zdGFsbCB0byBjbGVhbiB1cCBhbnkgcGVyc2lzdGVkIHN0YXRlIiwibGljZW5zZXMiOlt7Im5hbWUiOiJBcGFjaGUgTGljZW5zZSBWZXJzaW9uIDIuMCIsInVybCI6Imh0dHBzOi8vZ2l0aHViLmNvbS9tZXNvc3BoZXJlL21hcmF0aG9uL2Jsb2IvbWFzdGVyL0xJQ0VOU0UifV0sImltYWdlcyI6eyJpY29uLXNtYWxsIjoiaHR0cHM6Ly9kb3dubG9hZHMubWVzb3NwaGVyZS5jb20vbWFyYXRob24vYXNzZXRzL2ljb24tc2VydmljZS1tYXJhdGhvbi1zbWFsbC5wbmciLCJpY29uLW1lZGl1bSI6Imh0dHBzOi8vZG93bmxvYWRzLm1lc29zcGhlcmUuY29tL21hcmF0aG9uL2Fzc2V0cy9pY29uLXNlcnZpY2UtbWFyYXRob24tbWVkaXVtLnBuZyIsImljb24tbGFyZ2UiOiJodHRwczovL2Rvd25sb2Fkcy5tZXNvc3BoZXJlLmNvbS9tYXJhdGhvbi9hc3NldHMvaWNvbi1zZXJ2aWNlLW1hcmF0aG9uLWxhcmdlLnBuZyJ9fQ==",
    "DCOS_PACKAGE_REGISTRY_VERSION": "3.0",
    "DCOS_SERVICE_NAME": "marathon-resurrection",
    "DCOS_PACKAGE_FRAMEWORK_NAME": "marathon-resurrection",
    "DCOS_SERVICE_PORT_INDEX": "0",
    "DCOS_PACKAGE_VERSION": "1.3.3",
    "DCOS_PACKAGE_NAME": "marathon",
    "DCOS_PACKAGE_IS_FRAMEWORK": "true"
  }
}

ZK Operation

Pull down our guano backup and restore utility

sudo curl -O https://raw.githubusercontent.com/mesosphere/docker-containers/master/dcos-debug/toolbox && sudo chmod +x toolbox && sudo ./toolbox

Perform a backup of the marathon resurrection app

guano -d /universe/marathon-resurrection/state -o backup -s localhost:2181

Now that the backup is performed, lets modify it with the frameworkId we want to resurrect. Please note that the structure of the sed is: sed s/{framework_id_marathon_resurrect}/{framework_id_orphaned_framework}

sed -i.bak 's/{marathon_resurrect_id}/{orphaned_frameworkdid}/' backup/universe/marathon-resurrection/state/framework\:id

Lets restore this backup back into Zookeeper

guano -i backup/universe/marathon-resurrection/state/framework\:id -r /universe/marathon-resurrection/state/framework\:id -s localhost:2181

ZK Operation Complete!

NEXT:

  • Now we need to redploy Marathon-Resurrection JSON. Go ahead and suspsend the application. Once suspended, scale the instances back to 1. It will register with the orphaned framework ID.

Teardown Resurrected Active Frameworks

Turn off the Marathon-Resurrection JSON as this will now cause the orphaned Framework ID to be resurrected and it is no longer required.

If you look in Mesos UI/ Frameworks Tab, you will see two Marathon-Resurrection framework IDs. You will now perform the teardown on both.

Perform the /teardown operation with

curl leader.mesos:5050/teardown -d frameworkId={framework_id_marathon_resurrect}
curl leader.mesos:5050/teardown -d frameworkId={framework_id_orphaned_framework}

Now you can go in ZK and clean up the Zookeeper directories in /universe/marathon-resurrection.

Complete.

@arana3
Copy link

arana3 commented Mar 13, 2017

@bernadinm I am trying to use your resurrection debug utility on DC/OS 1.8 /Mesos 1.0.1.

I don't understand what "marathon_resurrection_id" should be. Is it something that I come up with using the known pre-defined format as described earlier in your gist?

Update:
After pondering a bit more on the "ZK Operation" section, I understand what you mean. To summarize:

  • Once you get the backup zNode value for state/framework:id, make sure to note down retrieved framework id (its the last piece of text after the parentheses delimiter). you have to replace {marathon_ressurrect_id} with aforementioned framework id . You also have to replace {orphaned_framework_id} with the actual orphaned framework id. Execute sed command with these values explicitly defined.

  • Make sure the local framework:id file has the correct contents. I encourage people not to modify this file directly via GUI (incl Exhibitor). After the file is uploaded using the last step, you can restart/un-suspend resurrection app.

If all work as expected orphan tasks are resurrected, allowing you to conduct /teardown. I've noticed in my particular case that I still had task references to "unregistered_frameworks", but at least they were not claiming resources :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment