Skip to content

Instantly share code, notes, and snippets.

@dhh
Created December 9, 2010 14:49
Show Gist options
  • Save dhh/734779 to your computer and use it in GitHub Desktop.
Save dhh/734779 to your computer and use it in GitHub Desktop.
Dec 9 14:27:03 acc-db-01 [248740.420900] divide error: 0000 [#1] SMP
Dec 9 14:27:03 acc-db-01 [248740.428791] last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
Dec 9 14:27:03 acc-db-01 [248740.444194] CPU 6
Dec 9 14:27:03 acc-db-01 [248740.450660] Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs reiserfs xfs exportfs nfs lockd nfs_acl auth_rpcgss sunrpc ipmi_devintf ipmi_si ipmi_msghandler autofs4 bonding fbcon tileblit font bitblit softcursor vga16fb vgastate bnx2 psmouse dell_wmi serio_raw joydev power_meter dcdbas lp parport ses enclosure usbhid hid megaraid_sas
Dec 9 14:27:03 acc-db-01 [248740.516499] Pid: 17864, comm: dsm_sa_snmp32d Not tainted 2.6.32-22-generic #33-Ubuntu PowerEdge R710
Dec 9 14:27:03 acc-db-01 [248740.538698] RIP: 0010:[<ffffffff8105621c>] [<ffffffff8105621c>] find_busiest_group+0x63c/0x900
Dec 9 14:27:03 acc-db-01 [248740.561223] RSP: 0018:ffff880604711b88 EFLAGS: 00010046
Dec 9 14:27:03 acc-db-01 [248740.574067] RAX: 0000000000000000 RBX: ffff880604711d54 RCX: 0000000000000001
Dec 9 14:27:03 acc-db-01 [248740.597125] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Dec 9 14:27:03 acc-db-01 [248740.621991] RBP: ffff880604711cf8 R08: ffff88034ac6fd88 R09: 0000000000000040
Dec 9 14:27:03 acc-db-01 [248740.648293] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffff
Dec 9 14:27:03 acc-db-01 [248740.676802] R13: 0000000000015bc0 R14: ffffffffffffffff R15: 0000000000000000
Dec 9 14:27:03 acc-db-01 [248740.706983] FS: 0000000000000000(0000) GS:ffff88034ac60000(0063) knlGS:00000000f56ffb70
Dec 9 14:27:03 acc-db-01 [248740.739611] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
Dec 9 14:27:03 acc-db-01 [248740.757771] CR2: 00007f593183a000 CR3: 00000002abc38000 CR4: 00000000000006e0
Dec 9 14:27:03 acc-db-01 [248740.790812] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 9 14:27:03 acc-db-01 [248740.824413] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec 9 14:27:03 acc-db-01 [248740.858726] Process dsm_sa_snmp32d (pid: 17864, threadinfo ffff880604710000, task ffff88062d4a44d0)
Dec 9 14:27:03 acc-db-01 [248740.895283] Stack:
Dec 9 14:27:03 acc-db-01 [248740.910801] ffff880604711c98 ffff880604711c08 ffff880604711d40 0000000000000cce
Dec 9 14:27:03 acc-db-01 [248740.932047] <0> ffff88034ac6fc60 00000006810fae02 000000010000000e 0000000000000008
Dec 9 14:27:03 acc-db-01 [248740.967302] <0> 0000000000015bc0 0000000000015bc0 ffff88034ac6fd70 0000000000015bc0
Dec 9 14:27:03 acc-db-01 [248741.016468] Call Trace:
Dec 9 14:27:03 acc-db-01 [248741.032629] [<ffffffff8105c928>] load_balance_newidle+0xa8/0x310
Dec 9 14:27:03 acc-db-01 [248741.052237] [<ffffffff8153ea7a>] thread_return+0x35a/0x420
Dec 9 14:27:03 acc-db-01 [248741.071095] [<ffffffff8153fd4d>] do_nanosleep+0x8d/0xc0
Dec 9 14:27:03 acc-db-01 [248741.089458] [<ffffffff81089834>] hrtimer_nanosleep+0xc4/0x180
Dec 9 14:27:03 acc-db-01 [248741.108107] [<ffffffff81088550>] ? hrtimer_wakeup+0x0/0x30
Dec 9 14:27:03 acc-db-01 [248741.126252] [<ffffffff81089664>] ? hrtimer_start_range_ns+0x14/0x20
Dec 9 14:27:03 acc-db-01 [248741.144902] [<ffffffff810acee4>] compat_sys_nanosleep+0xb4/0x120
Dec 9 14:27:03 acc-db-01 [248741.163137] [<ffffffff8104870f>] sysenter_dispatch+0x7/0x2e
Dec 9 14:27:03 acc-db-01 [248741.180700] Code: ff c7 85 c4 fe ff ff 01 00 00 00 e9 95 fb ff ff 0f 1f 80 00 00 00 00 48 8b 95 e0 fe ff ff 48 8b 45 a8 8b 72 08 48 c1 e0 0a 31 d2 <48> f7 f6 48 8b 75 b0 48 89 45 a0 31 c0 48 85 f6 74 0c 48 8b 45
Dec 9 14:27:03 acc-db-01 [248741.242654] RIP [<ffffffff8105621c>] find_busiest_group+0x63c/0x900
Dec 9 14:27:03 acc-db-01 [248741.261816] RSP <ffff880604711b88>
Dec 9 14:27:03 acc-db-01 [248747.675681] ------------[ cut here ]------------
Dec 9 14:27:03 acc-db-01 [248747.692559] WARNING: at /build/buildd/linux-2.6.32/net/sched/sch_generic.c:261 dev_watchdog+0x262/0x270()
Dec 9 14:27:03 acc-db-01 [248747.725803] Hardware name: PowerEdge R710
Dec 9 14:27:03 acc-db-01 [248747.741343] NETDEV WATCHDOG: eth0 (bnx2): transmit queue 3 timed out
Dec 9 14:27:03 acc-db-01 [248747.759239] Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs reiserfs xfs exportfs nfs lockd nfs_acl auth_rpcgss sunrpc ipmi_devintf ipmi_si ipmi_msghandler autofs4 bonding fbcon tileblit font bitblit softcursor vga16fb vgastate bnx2 psmouse dell_wmi serio_raw joydev power_meter dcdbas lp parport ses enclosure usbhid hid megaraid_sas
Dec 9 14:27:03 acc-db-01 [248747.856877] Pid: 0, comm: swapper Not tainted 2.6.32-22-generic #33-Ubuntu
Dec 9 14:27:03 acc-db-01 [248747.875145] Call Trace:
Dec 9 14:27:03 acc-db-01 [248747.888978] <IRQ> [<ffffffff81066d0b>] warn_slowpath_common+0x7b/0xc0
Dec 9 14:27:03 acc-db-01 [248747.907068] [<ffffffff81066db1>] warn_slowpath_fmt+0x41/0x50
Dec 9 14:27:03 acc-db-01 [248747.923951] [<ffffffff814765e2>] dev_watchdog+0x262/0x270
Dec 9 14:27:03 acc-db-01 [248747.940388] [<ffffffff8108b37d>] ? sched_clock_cpu+0xcd/0x110
Dec 9 14:27:03 acc-db-01 [248747.957298] [<ffffffff8101a103>] ? native_sched_clock+0x13/0x60
Dec 9 14:27:03 acc-db-01 [248747.974338] [<ffffffff81019e59>] ? sched_clock+0x9/0x10
Dec 9 14:27:03 acc-db-01 [248747.990651] [<ffffffff81476380>] ? dev_watchdog+0x0/0x270
Dec 9 14:27:03 acc-db-01 [248748.007098] [<ffffffff81077697>] run_timer_softirq+0x197/0x340
Dec 9 14:27:03 acc-db-01 [248748.024087] [<ffffffff81094870>] ? tick_sched_timer+0x0/0xc0
Dec 9 14:27:03 acc-db-01 [248748.040982] [<ffffffff8108f523>] ? ktime_get+0x63/0xe0
Dec 9 14:27:03 acc-db-01 [248748.057391] [<ffffffff8106e3a7>] __do_softirq+0xb7/0x1e0
Dec 9 14:27:03 acc-db-01 [248748.073871] [<ffffffff8109445a>] ? tick_program_event+0x2a/0x30
Dec 9 14:27:03 acc-db-01 [248748.090929] [<ffffffff810142ec>] call_softirq+0x1c/0x30
Dec 9 14:27:03 acc-db-01 [248748.107042] [<ffffffff81015cb5>] do_softirq+0x65/0xa0
Dec 9 14:27:03 acc-db-01 [248748.122742] [<ffffffff8106e245>] irq_exit+0x85/0x90
Dec 9 14:27:03 acc-db-01 [248748.137184] [<ffffffff81545f91>] smp_apic_timer_interrupt+0x71/0x9c
Dec 9 14:27:03 acc-db-01 [248748.153075] [<ffffffff81013cb3>] apic_timer_interrupt+0x13/0x20
Dec 9 14:27:03 acc-db-01 [248748.168536] <EOI> [<ffffffff8130d337>] ? acpi_idle_enter_bm+0x28a/0x2be
Dec 9 14:27:03 acc-db-01 [248748.185007] [<ffffffff8130d330>] ? acpi_idle_enter_bm+0x283/0x2be
Dec 9 14:27:03 acc-db-01 [248748.200691] [<ffffffff81437507>] ? cpuidle_idle_call+0xa7/0x140
Dec 9 14:27:03 acc-db-01 [248748.216235] [<ffffffff81011e73>] ? cpu_idle+0xb3/0x110
Dec 9 14:27:03 acc-db-01 [248748.231001] [<ffffffff8153ad4b>] ? start_secondary+0xa8/0xaa
Dec 9 14:27:03 acc-db-01 [248748.246119] ---[ end trace d893f09a380f2ae2 ]---
@Oneiroi
Copy link

Oneiroi commented Dec 9, 2010

I've asked around rigled2 @ #rhel on freenode, line 4: Dec 9 14:27:03 acc-db-01 [248740.450660] Modules linked in: btrfs, kernel 2.6.32-22-generic support for brtfs added in 2.6.35, incompatible module?

Also line #38, 39,40 which seem to back this up.

@lusis
Copy link

lusis commented Dec 9, 2010

Ubuntu backported btrfs in lucid.

@jefftrudeau
Copy link

Not sure it's a btrfs issue, looks like it could be related to networking or cpu (apic or smp).

@dhh
Copy link
Author

dhh commented Dec 9, 2010

Leading theory is this bug: http://bit.ly/eku2pj

@lusis
Copy link

lusis commented Dec 9, 2010

Yeah I was looking down. I kind of cringed when I saw the bnx2 driver. I've had random problems with those NICs in the past. It might be worth checking with Dell for some firmware updates related to capacity. It could just also be a backlog effect from bringing the box back online with a shitload of traffic to it.

I would take one of the boxes out of the "cluster". Remove whatever is sending them traffic and see if it's stable or not. Then bring it back in.

@ice799
Copy link

ice799 commented Dec 9, 2010

@lusis
Copy link

lusis commented Dec 9, 2010

In lieu of patching the kernel for now, wouldn't it be possible to switch schedulers at boot to avoid the bug?

@tweibley
Copy link

tweibley commented Dec 9, 2010

That's the exact patch we have started using.

@tweibley
Copy link

tweibley commented Dec 9, 2010

@lusis Apparently switching schedulers does not help. Haven't tested this independently.

@ice799
Copy link

ice799 commented Dec 9, 2010

@tweibley yes i worked with the original bug reporter on this a while back. yes bnx2 is a piece of shit and no switching schedulers doesnt help.

@tweibley
Copy link

tweibley commented Dec 9, 2010

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment