Skip to content

Instantly share code, notes, and snippets.

@xasima
Last active August 29, 2015 14:15
Show Gist options
  • Select an option

  • Save xasima/12de906475d70523316a to your computer and use it in GitHub Desktop.

Select an option

Save xasima/12de906475d70523316a to your computer and use it in GitHub Desktop.
aurora configuration issue

Mesos Master and Slave status

> lsb_release -a | grep Description
Description:	Ubuntu 12.04.4 LTS

> sudo lsof -i :2181 | grep '(LISTEN)'
java      29720 zookeeper   27u  IPv4 236483534      0t0  TCP *:2181 (LISTEN)

> mesos-master --version
mesos 0.21.1

> ps aux | grep mesos-master
/usr/sbin/mesos-master --zk=zk://localhost:2181/mesos --port=5050 --log_dir=/var/log/mesos --cluster=pangramia --quorum=1 --work_dir=/var/lib/mesos

> ps aux | grep mesos-slave
/usr/sbin/mesos-slave --master=zk://localhost:2181/mesos --log_dir=/var/log/mesos --containerizers=docker,mesos --executor_registration_timeout=5mins --attributes=containerizer_path:/usr/local/bin/deimos;isolation:external;rack:a

> curl -s http://localhost:5050/stats.json | python2.7 -mjson.tool | grep activated_slaves
    "activated_slaves": 1, 
    "deactivated_slaves": 0, 

Zookeeper paths


> sudo /usr/share/zookeeper/bin/zkCli.sh 
[zk: localhost:2181(CONNECTED) 0] ls /mesos
[info_0000000000, log_replicas]

[zk: localhost:2181(CONNECTED) 3] ls /
[local, cassandraMesos, mesos,  zookeeper]

[zk: localhost:2181(CONNECTED) 3] ls /local/service/mesos-native-log
[0000000010]

UPDATED: after deletion of zk_in_proc,  aurora scheduler added aurora's path in zk
[zk: localhost:2181(CONNECTED) 1] ls /aurora/scheduler/
member_0000000011                singleton_candidate_0000000010

AURORA Configuration

> env | grep AURORA
AURORA_HOME=/usr/local/aurora-scheduler
AURORA_DIST=/opt/apache-aurora-0.7.0-incubating

> cd $AURORA_DIST/third_party && ls -a
.   mesos-0.20.1-py2.7-linux-x86_64.egg  mesos.interface-0.20.1-py2.7.egg  mesos.native-0.20.1-py2.7-linux-x86_64.egg
..  mesos-0.21.0-py2.7-linux-x86_64.egg  mesos.interface-0.21.1-py2.7.egg  mesos.native-0.21.1-py2.7-linux-x86_64.egg

> cd $AURORA_DIST/dist && ls -a
.  ..  aurora.pex  gc_executor.pex  thermos_executor.pex  thermos_observer.pex  thermos_runner.pex

> cat /etc/profile | grep AURORA | grep PATH
PATH=$PATH:$AURORA_HOME/bin:$AURORA_DIST/dist

vi $AURORA_HOME/bin/aurora-scheduler-devcluster

#!/bin/bash
AURORA_HOME=/usr/local/aurora-scheduler
AURORA_DIST=/opt/apache-aurora-0.7.0-incubating/dist
 # Flags controlling the JVM.
 JAVA_OPTS=(
         -Xmx2g
         -Xms2g
         # GC tuning, etc.
         -Djava.library.path=/usr/local/lib
         -Dlog4j.configuration="file:///etc/zookeeper/conf/log4j.properties"
 )

 aurora_flags=(
  -thermos_executor_path=$AURORA_DIST/thermos_executor.pex
  -gc_executor_path=$AURORA_DIST/gc_executor.pex 
  -http_port=8091
## UPDATED: FIXED by comment out the following zk_in_proc line, 
## since this forbid to use zk, and use in process (testing) zk instead
##  -zk_in_proc=true
  -zk_endpoints=localhost:2181
  -zk_session_timeout=2secs
#  -zk_digest_credentials=xxx:xxxx
  -serverset_path=/aurora/scheduler
  -mesos_master_address=zk://localhost:2181/mesos
#  -mesos_ssl_keyfile=src/test/resources/com/twitter/aurora/scheduler/app/AuroraTestKeyStore
  -cluster_name=devcluster
  -native_log_quorum_size=1
  -native_log_file_path=$AURORA_HOME/db
  -native_log_zk_group_path=/local/service/mesos-native-log
  -backup_dir=$AURORA_HOME/backups
  -logtostderr
  -vlog=INFO
)
export GLOG_v=0
export LIBPROCESS_PORT=8093
set -x

JAVA_OPTS="${JAVA_OPTS[*]}" exec "$AURORA_HOME/bin/aurora-scheduler" "${aurora_flags[@]}"

Log of aurora scheduler start

cd $AURORA_HOME && sudo ./aurora-scheduler-devcluster

Feb 17, 2015 8:37:04 PM com.twitter.common.args.ArgScanner process
INFO: hostname (org.apache.aurora.scheduler.app.SchedulerMain.hostname): null
Feb 17, 2015 8:37:04 PM com.twitter.common.args.ArgScanner process
INFO: http_port (org.apache.aurora.scheduler.http.JettyServerModule.http_port): 8091
Feb 17, 2015 8:37:04 PM com.twitter.common.args.ArgScanner process

INFO: serverset_path (org.apache.aurora.scheduler.app.SchedulerMain.serverset_path): /aurora/scheduler
Feb 17, 2015 8:37:04 PM com.twitter.common.args.ArgScanner process  
I0217 20:37:08.957415  1559 group.cpp:385] Trying to create path '/local/service/mesos-native-log' in ZooKeeper
2015-02-17 20:37:08,957:1403(0x7fd7eb240700):ZOO_INFO@check_events@1750: session establishment complete on server [127.0.0.1:2181], sessionId=0x14b974b1b1d0018, negotiated timeout=4000
I0217 20:37:08.958426  1563 group.cpp:313] Group process (group(1)@10.149.254.37:8093) connected to ZooKeeper
I0217 20:37:08.958518  1563 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0217 20:37:08.958554  1563 group.cpp:385] Trying to create path '/local/service/mesos-native-log' in ZooKeeper
I0217 17:37:09.154 THREAD1 com.twitter.common.util.StateMachine$Builder$1.execute: storage state machine transition PREPARED -> READY
I0217 17:37:09.160 THREAD137 org.apache.aurora.scheduler.mesos.SchedulerDriverService.startUp: Connecting to mesos master: zk://localhost:2181/mesos
I0217 20:37:09.182327  1559 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I0217 17:37:09.184 THREAD1 com.twitter.common.application.Lifecycle.awaitShutdown: Awaiting shutdown
I0217 20:37:09.184404  1559 detector.cpp:138] Detected a new leader: (id='0')
I0217 20:37:09.184581  1560 group.cpp:659] Trying to get '/mesos/info_0000000000' in ZooKeeper
I0217 20:37:09.185600  1562 detector.cpp:433] A new leading master (UPID=master@10.149.254.37:5050) is detected
I0217 20:37:09.185735  1562 sched.cpp:234] New master detected at master@10.149.254.37:5050
I0217 17:37:09.276 THREAD139 org.apache.aurora.scheduler.mesos.MesosSchedulerImpl.registered: Registered with ID value: "20150216-110011-637441290-5050-17760-0000"
, master: id: "20150217-142747-637441290-5050-29888"
ip: 637441290
port: 5050
pid: "master@10.149.254.37:5050"
hostname: "bymsq-bsu-hmetrics002"
W0217 17:37:09.362 THREAD140 com.twitter.common.zookeeper.ServerSetImpl.join: Joining a ServerSet without a shard ID is deprecated and will soon break.
I0217 17:37:09.398 THREAD140 com.twitter.common.zookeeper.Group$ActiveMembership.join: Set group member ID to member_0000000001
I0217 17:37:09.417 THREAD132 com.twitter.common.zookeeper.ServerSetImpl$ServerSetWatcher.logChange: server set /aurora/scheduler change: from 0 members to 1
        joined:
                ServiceInstance(serviceEndpoint:Endpoint(host:bymsq-bsu-hmetrics002, port:8091), additionalEndpoints:{http=Endpoint(host:bymsq-bsu-hmetrics002, port:8091)}, status:ALIVE)
I0217 17:37:09.417 THREAD132 org.apache.aurora.scheduler.http.LeaderRedirect$SchedulerMonitor.onChange: Found leader scheduler at [ServiceInstance(serviceEndpoint:Endpoint(host:bymsq-bsu-hmetrics002, port:8091), additionalEndpoints:{http=Endpoint(host:bymsq-bsu-hmetrics002, port:8091)}, status:ALIVE)]



Check the registered status of the framework

curl -s http://localhost:8091/vars | grep framework_registered
framework_registered 1

Aurora /etc/aurora/clusters.json

[{
  "name": "devcluster",
  "zk": "localhost",
  "zk_port":"2181",
  "slave_root": "/var/lib/mesos",
  "slaverundirectory": "latest",
  "scheduler_zk_path": "/aurora/scheduler",
  "auth_mechanism": "UNAUTHENTICATED"
}]
~      

Issue on aurora.pex create job

UPDATED: The following errors/warn disappeared after deletion of zk_in_proc line

aurora.pex job create devcluster/weblab/devel/hello_server hello_server.aurora 
 WARN] Could not connect to scheduler: No schedulers detected in devcluster!
 WARN] Could not connect to scheduler: Failed to connect to Zookeeper within 10 seconds.

@xasima
Copy link
Copy Markdown
Author

xasima commented Feb 17, 2015

The log of aurora.pex client with verbose flag
UPDATED: Warning/errors disappeared, see the resolutions steps at the bottom of the comment

$AURORA_DIST/dist/aurora.pex job create devcluster/weblab/devel/hello_server /opt/aurora-pangramia/hello_server.aurora --verbose
DEBUG] Command=(['job', 'create', 'devcluster/weblab/devel/hello_server', '/opt/aurora-pangramia/hello_server.aurora', '--verbose'])
DEBUG] Config: ['import getpass\n', '\n', 'run_hello_server = Process(\n', "  name = 'run_server',\n", '  cmdline = """\n', '    while true; do\n', '      echo hello world\n', '      sleep 10\n', '    done\n', '  """)\n', ' \n', 'server_task = SequentialTask(\n', '  processes = [run_hello_server],\n', '  resources = Resources(cpu = 1.0, ram = 128*MB, disk = 64*MB)\n', ')\n', '\n', 'job = Service (\n', "  name = 'hello_server', \n", "  cluster = 'devcluster',\n", '  instances = 1, \n', '  task = server_task,\n', '  role = getpass.getuser(),\n', "  environment = 'devel',\n", "  contact = '{{role}}@localhost',\n", '  announce=Announcer(),\n', ')\n', '\n', 'jobs = [ job ] \n']
 INFO] Creating job hello_server
DEBUG] Full configuration: JobConfiguration(instanceCount=1, cronSchedule=None, cronCollisionPolicy=0, key=JobKey(environment=u'devel', role=u'weblab', name=u'hello_server'), taskConfig=TaskConfig(isService=True, priority=0, taskLinks={}, container=Container(docker=None, mesos=MesosContainer()), executorConfig=ExecutorConfig(data='{"environment": "devel", "task": {"processes": [{"daemon": false, "name": "run_server", "ephemeral": false, "max_failures": 1, "min_duration": 5, "cmdline": "\\n    while true; do\\n      echo hello world\\n      sleep 10\\n    done\\n  ", "final": false}], "name": "run_server", "finalization_wait": 30, "max_failures": 1, "max_concurrency": 0, "resources": {"disk": 67108864, "ram": 134217728, "cpu": 1.0}, "constraints": [{"order": ["run_server"]}]}, "name": "hello_server", "service": true, "max_task_failures": 1, "contact": "weblab@localhost", "priority": 0, "cluster": "devcluster", "health_check_config": {"initial_interval_secs": 15.0, "interval_secs": 10.0, "timeout_secs": 1.0, "max_consecutive_failures": 0}, "role": "weblab", "enable_hooks": false, "announce": {"primary_port": "http", "portmap": {"aurora": "http"}}, "cron_collision_policy": "KILL_EXISTING", "production": false}', name='AuroraExecutor'), metadata=frozenset([]), requestedPorts=set([u'http']), jobName=u'hello_server', environment=u'devel', ramMb=128, job=JobKey(environment=u'devel', role=u'weblab', name=u'hello_server'), production=False, diskMb=64, owner=Identity(role=u'weblab', user='weblab'), maxTaskFailures=1, contactEmail=u'weblab@localhost', numCpus=1.0, constraints=set([])), owner=Identity(role=u'weblab', user='weblab'))
DEBUG] Lock None
DEBUG] Using auth module: <apache.aurora.common.auth.auth_module.InsecureAuthModule object at 0x2750bd0>
 WARN] Could not connect to scheduler: No schedulers detected in devcluster!
 WARN] Could not connect to scheduler: Failed to connect to Zookeeper within 10 seconds.

UPDATE: This was fixed by the following instructions

  • Deletion of zk_in_proc='true' from aurora scheduler configuration, since this prevent from using of normal zk
  • Increase the maximum number of client connections 'maxClientCnxns=50' to /etc/zookeeper/zoo.cfg
  • Wipe out the content of /var/run/mesos/meta and mesos-slave stop
  • Running sudo /usr/sbin/mesos-slave --master=zk://localhost:2181/mesos --ip=10.149.6.18 --hostname=10.149.6.18 --attributes="host:10.149.6.18;rack:a" --work_dir="/var/lib/mesos" --containerizers=mesos with ip and rack attributes

@xasima
Copy link
Copy Markdown
Author

xasima commented Feb 18, 2015

The following create job
aurora.pex job create devcluster/weblab/devel/hello_server hello_server.aurora command creates the job successfully, but leads to 'THROTTLED : Rescheduled, penalized for 60000 ms for flapping' error from UI

UPDATED: see the comment below, how to investigate error in certain task on slave from command line, since aurora scheduler UI doesn't provide so detailed information. The root cause of that THROTTLED was the absence of one of lib dependency of mesos.egg.

Here is the hello_server.aurora content

cat hello_server.aurora

import getpass

run_hello_server = Process(
  name = 'run_server',
  cmdline = """
    while true; do
      echo hello world
      sleep 10
    done
  """)

server_task = SequentialTask(
  processes = [run_hello_server],
  resources = Resources(cpu = 1.0, ram = 128*MB, disk = 64*MB)
)

job = Service (
  name = 'hello_server', 
  cluster = 'devcluster',
  instances = 1, 
  task = server_task,
  role = getpass.getuser(),
  environment = 'devel',
  contact = '{{role}}@localhost',
  announce=Announcer(),
)

jobs = [ job ]

Here is the log of scheduler

I0218 14:26:54.722 THREAD120 org.apache.aurora.scheduler.http.RequestLogger$1.log: bymsq-bsu-hmetrics002 10.149.254.37 [18/Feb/2015:14:26:54 +0000] "P
OST /api HTTP/1.1" 200 0 "-" "Aurora V2" 2
I0218 14:26:54.756 THREAD120 org.apache.aurora.scheduler.thrift.aop.LoggingInterceptor.invoke: createJob(JobConfiguration(key:JobKey(role:weblab, envi
ronment:devel, name:hello_server), owner:Identity(role:weblab, user:weblab), cronSchedule:null, cronCollisionPolicy:KILL_EXISTING, taskConfig:TaskConf
ig(job:JobKey(role:weblab, environment:devel, name:hello_server), owner:Identity(role:weblab, user:weblab), environment:devel, jobName:hello_server, i
sService:true, numCpus:1.0, ramMb:128, diskMb:64, priority:0, maxTaskFailures:1, production:false, constraints:[], requestedPorts:[http], container:<C
ontainer mesos:MesosContainer()>, taskLinks:{}, contactEmail:weblab@localhost, executorConfig:ExecutorConfig(name:BLANKED, data:BLANKED), metadata:[])
, instanceCount:1), null, SessionKey(mechanism:UNAUTHENTICATED, data:50 D0 14 4C 71 0D 4C 80 80 4C 40))
I0218 14:26:54.783 THREAD120 org.apache.aurora.scheduler.thrift.SchedulerThriftInterface$2.apply: Launching 1 tasks.
I0218 14:26:54.793 THREAD120 com.twitter.common.util.StateMachine$Builder$1.execute: 1424269614784-weblab-devel-hello_server-0-ec81a5cd-472d-41ed-a12d
-7d998b816896 state machine transition INIT -> PENDING
I0218 14:26:54.794 THREAD120 org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work command SAVE_STATE for 1424269614784-weblab-d
evel-hello_server-0-ec81a5cd-472d-41ed-a12d-7d998b816896
I0218 17:26:54.802815 18374 log.cpp:680] Attempting to append 1423 bytes to the log
I0218 17:26:54.802994 18378 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 183
I0218 17:26:54.803382 18378 replica.cpp:508] Replica received write request for position 183
I0218 17:26:54.810410 18378 leveldb.cpp:343] Persisting action (1443 bytes) to leveldb took 6.951663ms
I0218 17:26:54.810544 18378 replica.cpp:676] Persisted action at 183
I0218 17:26:54.810905 18375 replica.cpp:655] Replica received learned notice for position 183
I0218 17:26:54.811462 18375 leveldb.cpp:343] Persisting action (1445 bytes) to leveldb took 417579ns
I0218 17:26:54.811560 18375 replica.cpp:676] Persisted action at 183
I0218 17:26:54.811604 18375 replica.cpp:661] Replica learned APPEND action at position 183

I0218 14:26:54.837 THREAD122 org.apache.aurora.scheduler.http.RequestLogger$1.log: bymsq-bsu-hmetrics002 10.149.254.37 [18/Feb/2015:14:26:54 +0000] "P
OST /api HTTP/1.1" 200 0 "-" "Aurora V2" 0
I0218 14:26:54.839 THREAD153 com.twitter.common.util.StateMachine$Builder$1.execute: 1424269614784-weblab-devel-hello_server-0-ec81a5cd-472d-41ed-a12d
-7d998b816896 state machine transition PENDING -> ASSIGNED
I0218 14:26:54.840 THREAD153 org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work command SAVE_STATE for 1424269614784-weblab-d
evel-hello_server-0-ec81a5cd-472d-41ed-a12d-7d998b816896
I0218 14:26:54.841 THREAD153 org.apache.aurora.scheduler.state.TaskAssigner$TaskAssignerImpl.assign: Offer on slave 10.149.6.18 (id 20150218-142336-63
7441290-5050-14692-S3) is being assigned task for 1424269614784-weblab-devel-hello_server-0-ec81a5cd-472d-41ed-a12d-7d998b816896.
I0218 14:26:54.845 THREAD122 org.apache.aurora.scheduler.thrift.aop.LoggingInterceptor.invoke: getTasksWithoutConfigs(TaskQuery(owner:null, role:null,
 environment:null, jobName:null, taskIds:null, statuses:null, instanceIds:null, slaveHosts:null, jobKeys:[JobKey(role:weblab, environment:devel, name:
hello_server)], offset:0, limit:0))
I0218 17:26:54.866507 18373 log.cpp:680] Attempting to append 1555 bytes to the log
I0218 17:26:54.866672 18380 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 184
I0218 17:26:54.867007 18376 replica.cpp:508] Replica received write request for position 184
I0218 17:26:54.867882 18376 leveldb.cpp:343] Persisting action (1575 bytes) to leveldb took 821507ns
I0218 17:26:54.867974 18376 replica.cpp:676] Persisted action at 184
I0218 17:26:54.868127 18376 replica.cpp:655] Replica received learned notice for position 184
I0218 17:26:54.868386 18376 leveldb.cpp:343] Persisting action (1577 bytes) to leveldb took 183964ns
I0218 17:26:54.868444 18376 replica.cpp:676] Persisted action at 184
I0218 17:26:54.868466 18376 replica.cpp:661] Replica learned APPEND action at position 184
I0218 14:26:55.745 THREAD154 org.apache.aurora.scheduler.mesos.MesosSchedulerImpl.statusUpdate: Received status update for task 1424269614784-weblab-devel-hello_server-0-ec81a5cd-472d-41ed-a12d-7d998b816896 in state TASK_LOST with core message Executor terminated
I0218 14:26:55.747 THREAD154 com.twitter.common.util.StateMachine$Builder$1.execute: 1424269614784-weblab-devel-hello_server-0-ec81a5cd-472d-41ed-a12d-7d998b816896 state machine transition ASSIGNED -> LOST
I0218 14:26:55.748 THREAD154 org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work command RESCHEDULE for 1424269614784-weblab-devel-hello_server-0-ec81a5cd-472d-41ed-a12d-7d998b816896
I0218 14:26:55.748 THREAD154 org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work command KILL for 1424269614784-weblab-devel-hello_server-0-ec81a5cd-472d-41ed-a12d-7d998b816896
I0218 14:26:55.749 THREAD154 org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work command SAVE_STATE for 1424269614784-weblab-devel-hello_server-0-ec81a5cd-472d-41ed-a12d-7d998b816896
I0218 14:26:55.750 THREAD154 org.apache.aurora.scheduler.state.StateManagerImpl.updateTaskAndExternalState: Task being rescheduled: 1424269614784-weblab-devel-hello_server-0-ec81a5cd-472d-41ed-a12d-7d998b816896
I0218 14:26:55.754 THREAD154 com.twitter.common.util.StateMachine$Builder$1.execute: 1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92 state machine transition INIT -> PENDING

I0218 14:26:55.755 THREAD154 org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work command SAVE_STATE for 1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92
I0218 17:26:55.759238 18377 log.cpp:680] Attempting to append 3121 bytes to the log
I0218 17:26:55.759443 18378 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 185
I0218 17:26:55.759907 18379 replica.cpp:508] Replica received write request for position 185
I0218 17:26:55.761018 18379 leveldb.cpp:343] Persisting action (3141 bytes) to leveldb took 1.055476ms
I0218 17:26:55.761073 18379 replica.cpp:676] Persisted action at 185
I0218 17:26:55.761339 18373 replica.cpp:655] Replica received learned notice for position 185
I0218 17:26:55.762336 18373 leveldb.cpp:343] Persisting action (3143 bytes) to leveldb took 764461ns
I0218 17:26:55.762425 18373 replica.cpp:676] Persisted action at 185
I0218 17:26:55.762467 18373 replica.cpp:661] Replica learned APPEND action at position 185
I0218 14:26:56.805 THREAD153 com.twitter.common.util.StateMachine$Builder$1.execute: 1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92 state machine transition PENDING -> ASSIGNED
I0218 14:26:56.805 THREAD153 org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work command SAVE_STATE for 1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92
I0218 14:26:56.807 THREAD153 org.apache.aurora.scheduler.state.TaskAssigner$TaskAssignerImpl.assign: Offer on slave 10.149.6.18 (id 20150218-142336-637441290-5050-14692-S3) is being assigned task for 1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92.
I0218 17:26:56.809345 18375 log.cpp:680] Attempting to append 1658 bytes to the log
I0218 17:26:56.809607 18375 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 186
I0218 17:26:56.810067 18377 replica.cpp:508] Replica received write request for position 186
I0218 17:26:56.810922 18377 leveldb.cpp:343] Persisting action (1678 bytes) to leveldb took 786073ns
I0218 17:26:56.810971 18377 replica.cpp:676] Persisted action at 186
I0218 17:26:56.811234 18377 replica.cpp:655] Replica received learned notice for position 186
I0218 17:26:56.811946 18377 leveldb.cpp:343] Persisting action (1680 bytes) to leveldb took 556765ns
I0218 17:26:56.812022 18377 replica.cpp:676] Persisted action at 186
I0218 17:26:56.812042 18377 replica.cpp:661] Replica learned APPEND action at position 186
I0218 14:26:57.759 THREAD156 org.apache.aurora.scheduler.mesos.MesosSchedulerImpl.statusUpdate: Received status update for task 1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92 in state TASK_LOST with core message Executor terminated
I0218 14:26:57.761 THREAD156 com.twitter.common.util.StateMachine$Builder$1.execute: 1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92 state machine transition ASSIGNED -> LOST
I0218 14:26:57.761 THREAD156 org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work command RESCHEDULE for 1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92
I0218 14:26:57.761 THREAD156 org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work command KILL for 1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92
I0218 14:26:57.762 THREAD156 org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work command SAVE_STATE for 1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92
I0218 14:26:57.764 THREAD156 org.apache.aurora.scheduler.state.StateManagerImpl.updateTaskAndExternalState: Task being rescheduled: 1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92
I0218 14:26:57.765 THREAD156 org.apache.aurora.scheduler.async.RescheduleCalculator$RescheduleCalculatorImpl.getFlappingPenaltyMs: Ancestor of 1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92 flapped: 1424269614784-weblab-devel-hello_server-0-ec81a5cd-472d-41ed-a12d-7d998b816896
I0218 14:26:57.767 THREAD156 com.twitter.common.util.StateMachine$Builder$1.execute: 1424269617765-weblab-devel-hello_server-0-af677862-4121-4046-8341-33f8152e2038 state machine transition INIT -> THROTTLED
I0218 14:26:57.767 THREAD156 org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work command SAVE_STATE for 1424269617765-weblab-devel-hello_server-0-af677862-4121-4046-8341-33f8152e2038
I0218 17:26:57.769611 18373 log.cpp:680] Attempting to append 3261 bytes to the log
I0218 17:26:57.769760 18373 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 187
I0218 17:26:57.770073 18373 replica.cpp:508] Replica received write request for position 187
I0218 14:26:57.770 THREAD135 org.apache.aurora.scheduler.async.RescheduleCalculator$RescheduleCalculatorImpl.getFlappingPenaltyMs: Ancestor of 1424269617765-weblab-devel-hello_server-0-af677862-4121-4046-8341-33f8152e2038 flapped: 1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92
I0218 17:26:57.770854 18373 leveldb.cpp:343] Persisting action (3281 bytes) to leveldb took 669197ns
I0218 17:26:57.770892 18373 replica.cpp:676] Persisted action at 187
I0218 17:26:57.771126 18378 replica.cpp:655] Replica received learned notice for position 187
I0218 17:26:57.771597 18378 leveldb.cpp:343] Persisting action (3283 bytes) to leveldb took 449424ns
I0218 14:26:57.771 THREAD135 org.apache.aurora.scheduler.async.RescheduleCalculator$RescheduleCalculatorImpl.getFlappingPenaltyMs: Ancestor of 1424269617765-weblab-devel-hello_server-0-af677862-4121-4046-8341-33f8152e2038 flapped: 1424269614784-weblab-devel-hello_server-0-ec81a5cd-472d-41ed-a12d-7d998b816896
I0218 17:26:57.771631 18378 replica.cpp:676] Persisted action at 187
I0218 17:26:57.771643 18378 replica.cpp:661] Replica learned APPEND action at position 187

@xasima
Copy link
Copy Markdown
Author

xasima commented Feb 18, 2015

Per valuable help from aurora dev mailing list (Bill Farner, Steve Niemitz at all) the following lines reported that executor is dying, so the scheduler is backing off the task


I0218 14:26:57.759 THREAD156
org.apache.aurora.scheduler.mesos.MesosSchedulerImpl.statusUpdate:
Received status update for task
1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92
in state TASK_LOST with core message Executor terminated

so to investigate logs we need to find the stderr of appropriate task

cd /var/lib/mesos/slaves/
find . -name thermos-1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92

vi thermos-1424269615751-weblab-devel-hello_server-0-d3181d51-669d-4ed3-aea2-8a7a82711b92/runs/latest/stderr

the task complains about lacking of dependencies on one of the mesos.egg, that I had download instead of building myself

ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.18' not found (required by /home/weblab/.pex/install/mesos.native-0.20.1-py2.7-linux-x86_64.egg.3ad559b2b9ba2a363146049c27abff70d4860891/mesos.native-0.20.1-py2.7-linux-x86_64.egg/mesos/native/_mesos.so)

Here are instruction to overcome

strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX
GLIBCXX_3.4
...
GLIBCXX_3.4.16
GLIBCXX_DEBUG_MESSAGE_LENGTH

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install libstdc++6-4.7-dev

strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX
GLIBCXX_3.4
...
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_3.4.20
GLIBCXX_DEBUG_MESSAGE_LENGTH

@xasima
Copy link
Copy Markdown
Author

xasima commented Feb 18, 2015

In order to see the status line of job execution from the aurora scheduler UI we need to start thermos observer

/opt/apache-aurora-0.7.0-incubating/dist$ ./thermos_observer.pex --root=/var/run/thermos
Writing log files to disk in /var/tmp
Bottle v0.11.6 server starting up (using CherryPyServer())...
Listening on http://0.0.0.0:1338/

That finish the hello world tutorial running from the normal (not vagrant based) install

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment