Aws autoscaling and chef cleanup

If using AWS auto-scaling + Chef, the final step of instance cleanup seems to
be slightly unclear.

One solution is to run a script in /etc/rc0.d which is called on shutdown.
“knife node delete”. This requires knife to be configured and working on the
instance, which is a (minor) pain. This method will also fail for an abrupt
machine crash.

Another solution is to have a script which queries chef server for instances
that haven’t checked in for a while, and removes those. That would require
having chef-client running very often or as a daemon.

I wonder what the security implications would be of adding functionality into
chef-client:

chef-client --remove-self-from-server

Some people have posted about a script which checks for terminated instances
and removes them. this sounds like the best way. Perhaps they mean to
query AWS first, and then make changes to chef-server. now to figure out
how…

This is what I use: https://gist.github.com/stormerider/5600427 -- bear in
mind you're still using a node with knife access, but it doesn't have to be
the Chef server. Any workstation can suffice.

This was based on a script that floated around the mailing list a while
back. I modified it so that a) it checks more than one AWS account (at one
point we had three, now we're down to 2 and soon just one), and b) it
checks more than one AWS region. For Chef 11 I also had to modify it so
that it uses the embedded Chef ruby (and I also had to install the aws gem
inside Chef as well).

This doesn't necessarily catch everything. I do have some tool scripts to
find nodes that exist in EC2 that don't exist in Chef (and vice versa...).
If for some reason the ec2.instance_id field gets nulled out, the node can
get stuck in limbo. This also applies to nodes that are in a stopped (vs.
terminated) state, because in that case the instance_id field is still
valid, but everything else is bogus.

--
~~ StormeRider ~~

"Every world needs its heroes [...] They inspire us to be better than we
are. And they protect from the darkness that's just around the corner."

(from Smallville Season 6x1: "Zod")

On why I hate the phrase "that's so lame"... http://bit.ly/Ps3uSS

On Fri, May 17, 2013 at 9:43 AM, Sam Darwin samuel.d.darwin@gmail.comwrote:

If using AWS auto-scaling + Chef, the final step of instance cleanup seems
to
be slightly unclear.

One solution is to run a script in /etc/rc0.d which is called on shutdown.
"knife node delete". This requires knife to be configured and working
on the
instance, which is a (minor) pain. This method will also fail for an
abrupt
machine crash.

Another solution is to have a script which queries chef server for
instances
that haven't checked in for a while, and removes those. That would
require
having chef-client running very often or as a daemon.

I wonder what the security implications would be of adding functionality
into
chef-client:

chef-client --remove-self-from-server

Some people have posted about a script which checks for terminated
instances
and removes them. this sounds like the best way. Perhaps they mean to
query AWS first, and then make changes to chef-server. now to figure out
how...

I have one machine profile that runs on auto-scale. It's an ephemeral worker that listens on a sidekiq queue and performs tasks.

I took three steps to ensure that those don't keep polluting the Chef Server.

1 - During the first Chef run, it adds to its own run list a recipe that actually deletes itself (node and client) from the server. This way, on the second run it gets deleted. -- Note that this machine gets configured once and stays that way until its life ends, which is perfectly fine for my use case, but might not be for yours;
2 - Another recipe adds a rc0 script to delete it when it shuts down (this is in case the first chef run never completes -- ideally it would be executed at compile time. Alternatively, this script could be baked in the AMI or created in the user-data script;
3 - I have a cron job that searches for stray nodes and deletes them -- how to do that will depend on your setup, but you seem to have a pretty good grasp of what you'll need.

As for setting up knife, this is a non-issue. Just point it to your client.rb:

knife node delete <%= node.name %> -y -c /etc/chef/client.rb
knife client delete <%= node.name %> -y -c /etc/chef/client.rb

It's been working quite well for me for about 4 or 5 months already, but my setup isn't anything very fancy. :slight_smile:

Hope this helps a bit!

  • cassiano

On Friday, May 17, 2013 at 13:43, Sam Darwin wrote:

If using AWS auto-scaling + Chef, the final step of instance cleanup seems to
be slightly unclear.

One solution is to run a script in /etc/rc0.d which is called on shutdown.
"knife node delete". This requires knife to be configured and working on the
instance, which is a (minor) pain. This method will also fail for an abrupt
machine crash.

Another solution is to have a script which queries chef server for instances
that haven't checked in for a while, and removes those. That would require
having chef-client running very often or as a daemon.

I wonder what the security implications would be of adding functionality into
chef-client:

chef-client --remove-self-from-server

Some people have posted about a script which checks for terminated instances
and removes them. this sounds like the best way. Perhaps they mean to
query AWS first, and then make changes to chef-server. now to figure out
how...

Hi,
AutoScaling supports notifications. You can cause Autoscaling actions to
generate events in an SQS queue, which you can then process at your
leisure. I'd just run a script that pops notifications, and when you see a
delete notification, uses Ridley/Spice to remove the client/node from chef.
Thanks,
-Thom

Docs on AutoScaling notifications:

commentary on hooking SNS to SQS:

Ridley: GitHub - berkshelf/ridley: A reliable Chef API client with a clean syntax
Spice: GitHub - danryan/spice: A zesty Chef server API wrapper

On Fri, May 17, 2013 at 5:43 PM, Sam Darwin samuel.d.darwin@gmail.comwrote:

If using AWS auto-scaling + Chef, the final step of instance cleanup seems
to
be slightly unclear.

One solution is to run a script in /etc/rc0.d which is called on shutdown.
"knife node delete". This requires knife to be configured and working
on the
instance, which is a (minor) pain. This method will also fail for an
abrupt
machine crash.

Another solution is to have a script which queries chef server for
instances
that haven't checked in for a while, and removes those. That would
require
having chef-client running very often or as a daemon.

I wonder what the security implications would be of adding functionality
into
chef-client:

chef-client --remove-self-from-server

Some people have posted about a script which checks for terminated
instances
and removes them. this sounds like the best way. Perhaps they mean to
query AWS first, and then make changes to chef-server. now to figure out
how...

http://www.nuvolecomputing.com/2012/07/02/chef-node-de-registration-for-autoscaling-groups/

(Not my article)

This is not a self-cleanup, requires a backend workflow/process, but it
can be expanded in any number of ways.

  • alex

On 05/17/2013 11:43 AM, Sam Darwin wrote:

If using AWS auto-scaling + Chef, the final step of instance cleanup seems to
be slightly unclear.

One solution is to run a script in /etc/rc0.d which is called on shutdown.
"knife node delete". This requires knife to be configured and working on the
instance, which is a (minor) pain. This method will also fail for an abrupt
machine crash.

Another solution is to have a script which queries chef server for instances
that haven't checked in for a while, and removes those. That would require
having chef-client running very often or as a daemon.

I wonder what the security implications would be of adding functionality into
chef-client:

chef-client --remove-self-from-server

Some people have posted about a script which checks for terminated instances
and removes them. this sounds like the best way. Perhaps they mean to
query AWS first, and then make changes to chef-server. now to figure out
how...

--
Alex Corley | Software as a Service Engineer
Zenoss, Inc. | Transforming IT Operations
acorley@zenoss.com | Skype: acorley_zenoss | Github: anthroprose

Based upon that Nuvole Computing article, I wrote this:

Might be useful.

On Fri, May 17, 2013 at 1:49 PM, Alex Corley acorley@zenoss.com wrote:

http://www.nuvolecomputing.com/2012/07/02/chef-node-de-
registration-for-autoscaling-**groups/http://www.nuvolecomputing.com/2012/07/02/chef-node-de-registration-for-autoscaling-groups/

(Not my article)

This is not a self-cleanup, requires a backend workflow/process, but it
can be expanded in any number of ways.

  • alex

On 05/17/2013 11:43 AM, Sam Darwin wrote:

If using AWS auto-scaling + Chef, the final step of instance cleanup
seems to
be slightly unclear.

One solution is to run a script in /etc/rc0.d which is called on shutdown.
"knife node delete". This requires knife to be configured and working
on the
instance, which is a (minor) pain. This method will also fail for an
abrupt
machine crash.

Another solution is to have a script which queries chef server for
instances
that haven't checked in for a while, and removes those. That would
require
having chef-client running very often or as a daemon.

I wonder what the security implications would be of adding functionality
into
chef-client:

chef-client --remove-self-from-server

Some people have posted about a script which checks for terminated
instances
and removes them. this sounds like the best way. Perhaps they mean to
query AWS first, and then make changes to chef-server. now to figure
out
how...

--
Alex Corley | Software as a Service Engineer
Zenoss, Inc. | Transforming IT Operations
acorley@zenoss.com | Skype: acorley_zenoss | Github: anthroprose

Thanks for all the great replies!!

Morgan: that is checking for terminated instances in AWS, but they may
completely vanish from AWS (not even shows as 'terminated' anymore)
and not necessarily get processed, unless the script is run often?
Cassiano: very cool about knife not having to be setup.
Thom: SNS , SQS, looks like the way to go
Alex: Nuvole has an implementation for that
Brian: has an implementation for Nuvole
each step further it seems... will look into these suggestions.

On Fri, May 17, 2013 at 8:52 PM, Brian Hatfield bmhatfield@gmail.com wrote:

Based upon that Nuvole Computing article, I wrote this:

GitHub - bmhatfield/chef-deregistration-manager: Queue Based Chef Client Deregistration for the Cloud

Might be useful.

On Fri, May 17, 2013 at 1:49 PM, Alex Corley acorley@zenoss.com wrote:

Chef Node Deregistration For Autoscaling Groups | Nuvole Computing

(Not my article)

This is not a self-cleanup, requires a backend workflow/process, but it
can be expanded in any number of ways.

  • alex

On 05/17/2013 11:43 AM, Sam Darwin wrote:

If using AWS auto-scaling + Chef, the final step of instance cleanup
seems to
be slightly unclear.

One solution is to run a script in /etc/rc0.d which is called on
shutdown.
"knife node delete". This requires knife to be configured and working
on the
instance, which is a (minor) pain. This method will also fail for an
abrupt
machine crash.

Another solution is to have a script which queries chef server for
instances
that haven't checked in for a while, and removes those. That would
require
having chef-client running very often or as a daemon.

I wonder what the security implications would be of adding functionality
into
chef-client:

chef-client --remove-self-from-server

Some people have posted about a script which checks for terminated
instances
and removes them. this sounds like the best way. Perhaps they mean
to
query AWS first, and then make changes to chef-server. now to figure
out
how...

--
Alex Corley | Software as a Service Engineer
Zenoss, Inc. | Transforming IT Operations
acorley@zenoss.com | Skype: acorley_zenoss | Github: anthroprose

Sam, that's not exactly what it does. It generates a list of all the EC2
instance_id values, and then does a Knife search for all nodes with the
attribute ec2_instance_id. (You can duplicate this on the command line with
"knife search node "ec2_instance_id:*". Note that the actual attribute for
-a parameters to other commands would be ec2:instance_id; this bit of
syntax is somewhat confusing IMO, but I'm guessing there's a good reason
for "flattening" the Ohai trees for search.) It then compares these two
lists and prunes any Chef nodes that have the attribute but don't have a
corresponding EC2 listing.

I haven't run into any problems with this personally, but it just occurs to
me that if there's a non-fatal error with the EC2 calls (say a region is
having issues with the API service, but EC2 is running fine), then it could
potentially end up pruning a node that's still valid. However, assuming
you're not running the chef-client::delete_validation (and I don't see much
of an issue leaving that on a running system vs. configuring that in an
AMI...), then the next chef-client run will re-register and everything
should likely be fine. (Although I'm not sure if that would honor the -j
flag passed to chef-client on startup if it's running via init... which
could lead to a node with an empty runlist. If you're running it via cron,
or manually, and always specifying your JSON file or configuring the
runlist otherwise, it shouldn't be an issue either.)

YMMV; this is certainly an easy way to get up and started while you look
into other solutions. If you happen to find that another one resolves
issues you see with this, I'd certainly be interested in hearing about it.

--
~~ StormeRider ~~

"Every world needs its heroes [...] They inspire us to be better than we
are. And they protect from the darkness that's just around the corner."

(from Smallville Season 6x1: "Zod")

On why I hate the phrase "that's so lame"... http://bit.ly/Ps3uSS

On Mon, May 20, 2013 at 2:55 AM, Sam Darwin samuel.d.darwin@gmail.comwrote:

Thanks for all the great replies!!

Morgan: that is checking for terminated instances in AWS, but they may
completely vanish from AWS (not even shows as 'terminated' anymore)
and not necessarily get processed, unless the script is run often?
Cassiano: very cool about knife not having to be setup.
Thom: SNS , SQS, looks like the way to go
Alex: Nuvole has an implementation for that
Brian: has an implementation for Nuvole
each step further it seems... will look into these suggestions.

On Fri, May 17, 2013 at 8:52 PM, Brian Hatfield bmhatfield@gmail.com
wrote:

Based upon that Nuvole Computing article, I wrote this:

GitHub - bmhatfield/chef-deregistration-manager: Queue Based Chef Client Deregistration for the Cloud

Might be useful.

On Fri, May 17, 2013 at 1:49 PM, Alex Corley acorley@zenoss.com wrote:

Chef Node Deregistration For Autoscaling Groups | Nuvole Computing

(Not my article)

This is not a self-cleanup, requires a backend workflow/process, but it
can be expanded in any number of ways.

  • alex

On 05/17/2013 11:43 AM, Sam Darwin wrote:

If using AWS auto-scaling + Chef, the final step of instance cleanup
seems to
be slightly unclear.

One solution is to run a script in /etc/rc0.d which is called on
shutdown.
"knife node delete". This requires knife to be configured and
working
on the
instance, which is a (minor) pain. This method will also fail for an
abrupt
machine crash.

Another solution is to have a script which queries chef server for
instances
that haven't checked in for a while, and removes those. That would
require
having chef-client running very often or as a daemon.

I wonder what the security implications would be of adding
functionality
into
chef-client:

chef-client --remove-self-from-server

Some people have posted about a script which checks for terminated
instances
and removes them. this sounds like the best way. Perhaps they mean
to
query AWS first, and then make changes to chef-server. now to figure
out
how...

--
Alex Corley | Software as a Service Engineer
Zenoss, Inc. | Transforming IT Operations
acorley@zenoss.com | Skype: acorley_zenoss | Github: anthroprose