-
-
Save dhh/734779 to your computer and use it in GitHub Desktop.
Dec 9 14:27:03 acc-db-01 [248740.420900] divide error: 0000 [#1] SMP | |
Dec 9 14:27:03 acc-db-01 [248740.428791] last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map | |
Dec 9 14:27:03 acc-db-01 [248740.444194] CPU 6 | |
Dec 9 14:27:03 acc-db-01 [248740.450660] Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs reiserfs xfs exportfs nfs lockd nfs_acl auth_rpcgss sunrpc ipmi_devintf ipmi_si ipmi_msghandler autofs4 bonding fbcon tileblit font bitblit softcursor vga16fb vgastate bnx2 psmouse dell_wmi serio_raw joydev power_meter dcdbas lp parport ses enclosure usbhid hid megaraid_sas | |
Dec 9 14:27:03 acc-db-01 [248740.516499] Pid: 17864, comm: dsm_sa_snmp32d Not tainted 2.6.32-22-generic #33-Ubuntu PowerEdge R710 | |
Dec 9 14:27:03 acc-db-01 [248740.538698] RIP: 0010:[<ffffffff8105621c>] [<ffffffff8105621c>] find_busiest_group+0x63c/0x900 | |
Dec 9 14:27:03 acc-db-01 [248740.561223] RSP: 0018:ffff880604711b88 EFLAGS: 00010046 | |
Dec 9 14:27:03 acc-db-01 [248740.574067] RAX: 0000000000000000 RBX: ffff880604711d54 RCX: 0000000000000001 | |
Dec 9 14:27:03 acc-db-01 [248740.597125] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 | |
Dec 9 14:27:03 acc-db-01 [248740.621991] RBP: ffff880604711cf8 R08: ffff88034ac6fd88 R09: 0000000000000040 | |
Dec 9 14:27:03 acc-db-01 [248740.648293] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffff | |
Dec 9 14:27:03 acc-db-01 [248740.676802] R13: 0000000000015bc0 R14: ffffffffffffffff R15: 0000000000000000 | |
Dec 9 14:27:03 acc-db-01 [248740.706983] FS: 0000000000000000(0000) GS:ffff88034ac60000(0063) knlGS:00000000f56ffb70 | |
Dec 9 14:27:03 acc-db-01 [248740.739611] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b | |
Dec 9 14:27:03 acc-db-01 [248740.757771] CR2: 00007f593183a000 CR3: 00000002abc38000 CR4: 00000000000006e0 | |
Dec 9 14:27:03 acc-db-01 [248740.790812] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 | |
Dec 9 14:27:03 acc-db-01 [248740.824413] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 | |
Dec 9 14:27:03 acc-db-01 [248740.858726] Process dsm_sa_snmp32d (pid: 17864, threadinfo ffff880604710000, task ffff88062d4a44d0) | |
Dec 9 14:27:03 acc-db-01 [248740.895283] Stack: | |
Dec 9 14:27:03 acc-db-01 [248740.910801] ffff880604711c98 ffff880604711c08 ffff880604711d40 0000000000000cce | |
Dec 9 14:27:03 acc-db-01 [248740.932047] <0> ffff88034ac6fc60 00000006810fae02 000000010000000e 0000000000000008 | |
Dec 9 14:27:03 acc-db-01 [248740.967302] <0> 0000000000015bc0 0000000000015bc0 ffff88034ac6fd70 0000000000015bc0 | |
Dec 9 14:27:03 acc-db-01 [248741.016468] Call Trace: | |
Dec 9 14:27:03 acc-db-01 [248741.032629] [<ffffffff8105c928>] load_balance_newidle+0xa8/0x310 | |
Dec 9 14:27:03 acc-db-01 [248741.052237] [<ffffffff8153ea7a>] thread_return+0x35a/0x420 | |
Dec 9 14:27:03 acc-db-01 [248741.071095] [<ffffffff8153fd4d>] do_nanosleep+0x8d/0xc0 | |
Dec 9 14:27:03 acc-db-01 [248741.089458] [<ffffffff81089834>] hrtimer_nanosleep+0xc4/0x180 | |
Dec 9 14:27:03 acc-db-01 [248741.108107] [<ffffffff81088550>] ? hrtimer_wakeup+0x0/0x30 | |
Dec 9 14:27:03 acc-db-01 [248741.126252] [<ffffffff81089664>] ? hrtimer_start_range_ns+0x14/0x20 | |
Dec 9 14:27:03 acc-db-01 [248741.144902] [<ffffffff810acee4>] compat_sys_nanosleep+0xb4/0x120 | |
Dec 9 14:27:03 acc-db-01 [248741.163137] [<ffffffff8104870f>] sysenter_dispatch+0x7/0x2e | |
Dec 9 14:27:03 acc-db-01 [248741.180700] Code: ff c7 85 c4 fe ff ff 01 00 00 00 e9 95 fb ff ff 0f 1f 80 00 00 00 00 48 8b 95 e0 fe ff ff 48 8b 45 a8 8b 72 08 48 c1 e0 0a 31 d2 <48> f7 f6 48 8b 75 b0 48 89 45 a0 31 c0 48 85 f6 74 0c 48 8b 45 | |
Dec 9 14:27:03 acc-db-01 [248741.242654] RIP [<ffffffff8105621c>] find_busiest_group+0x63c/0x900 | |
Dec 9 14:27:03 acc-db-01 [248741.261816] RSP <ffff880604711b88> | |
Dec 9 14:27:03 acc-db-01 [248747.675681] ------------[ cut here ]------------ | |
Dec 9 14:27:03 acc-db-01 [248747.692559] WARNING: at /build/buildd/linux-2.6.32/net/sched/sch_generic.c:261 dev_watchdog+0x262/0x270() | |
Dec 9 14:27:03 acc-db-01 [248747.725803] Hardware name: PowerEdge R710 | |
Dec 9 14:27:03 acc-db-01 [248747.741343] NETDEV WATCHDOG: eth0 (bnx2): transmit queue 3 timed out | |
Dec 9 14:27:03 acc-db-01 [248747.759239] Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs reiserfs xfs exportfs nfs lockd nfs_acl auth_rpcgss sunrpc ipmi_devintf ipmi_si ipmi_msghandler autofs4 bonding fbcon tileblit font bitblit softcursor vga16fb vgastate bnx2 psmouse dell_wmi serio_raw joydev power_meter dcdbas lp parport ses enclosure usbhid hid megaraid_sas | |
Dec 9 14:27:03 acc-db-01 [248747.856877] Pid: 0, comm: swapper Not tainted 2.6.32-22-generic #33-Ubuntu | |
Dec 9 14:27:03 acc-db-01 [248747.875145] Call Trace: | |
Dec 9 14:27:03 acc-db-01 [248747.888978] <IRQ> [<ffffffff81066d0b>] warn_slowpath_common+0x7b/0xc0 | |
Dec 9 14:27:03 acc-db-01 [248747.907068] [<ffffffff81066db1>] warn_slowpath_fmt+0x41/0x50 | |
Dec 9 14:27:03 acc-db-01 [248747.923951] [<ffffffff814765e2>] dev_watchdog+0x262/0x270 | |
Dec 9 14:27:03 acc-db-01 [248747.940388] [<ffffffff8108b37d>] ? sched_clock_cpu+0xcd/0x110 | |
Dec 9 14:27:03 acc-db-01 [248747.957298] [<ffffffff8101a103>] ? native_sched_clock+0x13/0x60 | |
Dec 9 14:27:03 acc-db-01 [248747.974338] [<ffffffff81019e59>] ? sched_clock+0x9/0x10 | |
Dec 9 14:27:03 acc-db-01 [248747.990651] [<ffffffff81476380>] ? dev_watchdog+0x0/0x270 | |
Dec 9 14:27:03 acc-db-01 [248748.007098] [<ffffffff81077697>] run_timer_softirq+0x197/0x340 | |
Dec 9 14:27:03 acc-db-01 [248748.024087] [<ffffffff81094870>] ? tick_sched_timer+0x0/0xc0 | |
Dec 9 14:27:03 acc-db-01 [248748.040982] [<ffffffff8108f523>] ? ktime_get+0x63/0xe0 | |
Dec 9 14:27:03 acc-db-01 [248748.057391] [<ffffffff8106e3a7>] __do_softirq+0xb7/0x1e0 | |
Dec 9 14:27:03 acc-db-01 [248748.073871] [<ffffffff8109445a>] ? tick_program_event+0x2a/0x30 | |
Dec 9 14:27:03 acc-db-01 [248748.090929] [<ffffffff810142ec>] call_softirq+0x1c/0x30 | |
Dec 9 14:27:03 acc-db-01 [248748.107042] [<ffffffff81015cb5>] do_softirq+0x65/0xa0 | |
Dec 9 14:27:03 acc-db-01 [248748.122742] [<ffffffff8106e245>] irq_exit+0x85/0x90 | |
Dec 9 14:27:03 acc-db-01 [248748.137184] [<ffffffff81545f91>] smp_apic_timer_interrupt+0x71/0x9c | |
Dec 9 14:27:03 acc-db-01 [248748.153075] [<ffffffff81013cb3>] apic_timer_interrupt+0x13/0x20 | |
Dec 9 14:27:03 acc-db-01 [248748.168536] <EOI> [<ffffffff8130d337>] ? acpi_idle_enter_bm+0x28a/0x2be | |
Dec 9 14:27:03 acc-db-01 [248748.185007] [<ffffffff8130d330>] ? acpi_idle_enter_bm+0x283/0x2be | |
Dec 9 14:27:03 acc-db-01 [248748.200691] [<ffffffff81437507>] ? cpuidle_idle_call+0xa7/0x140 | |
Dec 9 14:27:03 acc-db-01 [248748.216235] [<ffffffff81011e73>] ? cpu_idle+0xb3/0x110 | |
Dec 9 14:27:03 acc-db-01 [248748.231001] [<ffffffff8153ad4b>] ? start_secondary+0xa8/0xaa | |
Dec 9 14:27:03 acc-db-01 [248748.246119] ---[ end trace d893f09a380f2ae2 ]--- |
Ubuntu backported btrfs in lucid.
Not sure it's a btrfs issue, looks like it could be related to networking or cpu (apic or smp).
Leading theory is this bug: http://bit.ly/eku2pj
Yeah I was looking down. I kind of cringed when I saw the bnx2 driver. I've had random problems with those NICs in the past. It might be worth checking with Dell for some firmware updates related to capacity. It could just also be a backlog effect from bringing the box back online with a shitload of traffic to it.
I would take one of the boxes out of the "cluster". Remove whatever is sending them traffic and see if it's stable or not. Then bring it back in.
you need this: http://launchpadlibrarian.net/58956370/lp614853.patch
In lieu of patching the kernel for now, wouldn't it be possible to switch schedulers at boot to avoid the bug?
That's the exact patch we have started using.
@lusis Apparently switching schedulers does not help. Haven't tested this independently.
@tweibley yes i worked with the original bug reporter on this a while back. yes bnx2 is a piece of shit and no switching schedulers doesnt help.
@ice799 Know anything about C states (see http://support.citrix.com/article/CTX127395) and Ubuntu? http://lists.us.dell.com/pipermail/linux-poweredge/2010-May/042280.html also.
I've asked around rigled2 @ #rhel on freenode, line 4: Dec 9 14:27:03 acc-db-01 [248740.450660] Modules linked in: btrfs, kernel 2.6.32-22-generic support for brtfs added in 2.6.35, incompatible module?
Also line #38, 39,40 which seem to back this up.