BUG: unable to handle kernel NULL pointer dereference at pick_next_task_fair

Ran_Tavory1 · October 16, 2012, 12:21pm

I’m running chef on aws ubuntu’s and a few of my ubuntu instances crashed.
When looking at the log I found that the chef-client process which runs as
init.d script, stumbles upon what seems to be a kernel bug.

Full details are here: https://gist.github.com/3885828

And here’s the gist of the gist is:

[119832.732086] BUG: unable to handle kernel NULL pointer dereference at
0000000000000038
[119832.732111] IP: [] pick_next_task_fair+0xa7/0x1a0
[119832.732124] PGD 1cee6f067 PUD 1cd5e7067 PMD 0
[119832.732132] Oops: 0000 [#1] SMP
[119832.732137] last sysfs file:
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
[119832.732145] CPU 1
[119832.732147] Modules linked in: acpiphp raid456 async_pq async_xor xor
async_memcpy async_raid6_recov raid10 raid6_pq async_tx raid1 raid0 multipath
linear
[119832.732172]
[119832.732177] Pid: 7896, comm: chef-client Not tainted 2.6.38-8-virtual
#42-Ubuntu
…

$ uname -a
Linux hostname 2.6.38-8-virtual #42-Ubuntu SMP Mon Apr 11 04:06:34 UTC 2011
x86_64 x86_64 x86_64 GNU/Linux

Had anyone seem that?
Been googling that but nothing of note came up…

I see a few ways out…

upgrade ubuntu and hope I don’t see this again. (it doesn’t happen every
day, but of the month I’d been using chef it happened four times to three
different hosts, out of 20 hosts)
Instead of using init script, use a cron or other method to run chef-client
on the hour.
OK, move to centos or something else desperate…

I’d be happy to get to the bottom of it and see where the bug really comes from
and whether it was fixed in newer versions of the kernel but I shamefully admit
that so far I’d failed to get to the bottom of this bug.
what would you do?

kallistec · October 16, 2012, 3:21pm

On Tuesday, October 16, 2012 at 5:21 AM, ran@totango.com wrote:

I'm running chef on aws ubuntu's and a few of my ubuntu instances crashed.
When looking at the log I found that the chef-client process which runs as
init.d script, stumbles upon what seems to be a kernel bug.

Full details are here: Kernel bug found by chef-client on ubuntu 11.04 ec2 · GitHub

And here's the gist of the gist is:

[119832.732086] BUG: unable to handle kernel NULL pointer dereference at
0000000000000038
[119832.732111] IP: [] pick_next_task_fair+0xa7/0x1a0
[119832.732124] PGD 1cee6f067 PUD 1cd5e7067 PMD 0
[119832.732132] Oops: 0000 [#1] SMP
[119832.732137] last sysfs file:
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
[119832.732145] CPU 1
[119832.732147] Modules linked in: acpiphp raid456 async_pq async_xor xor
async_memcpy async_raid6_recov raid10 raid6_pq async_tx raid1 raid0 multipath
linear
[119832.732172]
[119832.732177] Pid: 7896, comm: chef-client Not tainted 2.6.38-8-virtual
#42-Ubuntu
...

$ uname -a
Linux hostname 2.6.38-8-virtual #42-Ubuntu SMP Mon Apr 11 04:06:34 UTC 2011
x86_64 x86_64 x86_64 GNU/Linux

Had anyone seem that?
Been googling that but nothing of note came up...

I don't know anything about this particular bug, but it looks probable that ohai is triggering this bug reading some entry in /proc (we've seen different bugs when reading /proc entries in ohai before).

I see a few ways out...

upgrade ubuntu and hope I don't see this again. (it doesn't happen every
day, but of the month I'd been using chef it happened four times to three
different hosts, out of 20 hosts)

Instead of using init script, use a cron or other method to run chef-client
on the hour.

OK, move to centos or something else desperate...

To have confidence in #1, you'll probably have to track down the specific kernel bug, and if there's a fix, find out what version it was fixed it. Then see if Ubuntu has ported the patch to your release. Additionally, I'd recommend adding some checks to your monitoring system, lest you accidentally spin up a new box with an old ISO/AMI/whatever.

Unless something about this bug means that it is only triggered during system startup, #2 probably won't work.

As for option #3, you could try to pinpoint which ohai plugin is triggering the bug and then disable it until a kernel patch is available (or a workaround is implemented in ohai, if possible).

I'd be happy to get to the bottom of it and see where the bug really comes from
and whether it was fixed in newer versions of the kernel but I shamefully admit
that so far I'd failed to get to the bottom of this bug.
what would you do?

--
Daniel DeLeo

Ran_Tavory1 · October 18, 2012, 12:21pm

Daniel thanks, I'm running an experiment - disabled all ohai plugins I
could find, and wait a few days/weeks to see if something cashes again.

On Tue, Oct 16, 2012 at 5:21 PM, Daniel DeLeo dan@kallistec.com wrote:

On Tuesday, October 16, 2012 at 5:21 AM, ran@totango.com wrote:

I'm running chef on aws ubuntu's and a few of my ubuntu instances crashed.
When looking at the log I found that the chef-client process which runs as
init.d script, stumbles upon what seems to be a kernel bug.

Full details are here: Kernel bug found by chef-client on ubuntu 11.04 ec2 · GitHub

And here's the gist of the gist is:

[119832.732086] BUG: unable to handle kernel NULL pointer dereference at
0000000000000038
[119832.732111] IP: [] pick_next_task_fair+0xa7/0x1a0
[119832.732124] PGD 1cee6f067 PUD 1cd5e7067 PMD 0
[119832.732132] Oops: 0000 [#1] SMP
[119832.732137] last sysfs file:
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
[119832.732145] CPU 1
[119832.732147] Modules linked in: acpiphp raid456 async_pq async_xor xor
async_memcpy async_raid6_recov raid10 raid6_pq async_tx raid1 raid0
multipath
linear
[119832.732172]
[119832.732177] Pid: 7896, comm: chef-client Not tainted 2.6.38-8-virtual
#42-Ubuntu
...

$ uname -a
Linux hostname 2.6.38-8-virtual #42-Ubuntu SMP Mon Apr 11 04:06:34 UTC 2011
x86_64 x86_64 x86_64 GNU/Linux

Had anyone seem that?
Been googling that but nothing of note came up...

I don't know anything about this particular bug, but it looks probable
that ohai is triggering this bug reading some entry in /proc (we've seen
different bugs when reading /proc entries in ohai before).

I see a few ways out...

upgrade ubuntu and hope I don't see this again. (it doesn't happen every
day, but of the month I'd been using chef it happened four times to three
different hosts, out of 20 hosts)

Instead of using init script, use a cron or other method to run
chef-client
on the hour.

OK, move to centos or something else desperate...

To have confidence in #1, you'll probably have to track down the specific
kernel bug, and if there's a fix, find out what version it was fixed it.
Then see if Ubuntu has ported the patch to your release. Additionally, I'd
recommend adding some checks to your monitoring system, lest you
accidentally spin up a new box with an old ISO/AMI/whatever.

Unless something about this bug means that it is only triggered during
system startup, #2 probably won't work.

As for option #3, you could try to pinpoint which ohai plugin is
triggering the bug and then disable it until a kernel patch is available
(or a workaround is implemented in ohai, if possible).

I'd be happy to get to the bottom of it and see where the bug really comes
from
and whether it was fixed in newer versions of the kernel but I shamefully
admit
that so far I'd failed to get to the bottom of this bug.
what would you do?

--
Daniel DeLeo

Topic		Replies	Views
Ohai-143? Chef Infra (archive)	1	258	February 5, 2010
Chef servers crashing w/ kernel panics? this may help Chef Infra (archive)	1	299	October 6, 2012
76 running chef-client processes exhausting system memory Chef Infra (archive)	5	1589	October 3, 2013
Chef-client (still) randomly failing Chef Infra (archive)	12	964	November 22, 2013
Chef-client massive memory usage spikes Chef Infra (archive)	7	941	January 24, 2012

BUG: unable to handle kernel NULL pointer dereference at pick_next_task_fair

Related topics