Dealing with unavailable resources

I am a real newbie.
We use gitlab/chef/kitchen(AWS) to develop server configs, then deploy to OVH to run a custom webservice written mostly in PHP.
Then chef-client runs hourly to keep everything up-to-date.

We are using php-pear and installing Archive_Tar through it.

At this moment the pear site is down. Our recipe includes

package 'php-pear' do
action :install
end

php_pear 'Archive_Tar' do
action :install
end

as a result of which our chef-client process is aborting.

Now on previously-running servers the chance of either of those making a functional difference is close to zero. So I could happily add

ignore_failure true

to both of them.

Similarly if I am simply testing my chef scripting in kitchen, I don't really care whether Tar_Archive is successfully installed or not.

However if
I am intending to try my application code on the VM that kitchen builds for me
I am building a new OVH server from scratch to run the app.
Then I probably SHOULD about because the pear site is down ( Lord knows what we do if it stays down )

So my question is - how would more experienced heads deal with this? Do I need to make the 'ignore_failure' conditional on the environment?

Only you can know what makes sense for your organization given time, resources, culture, legacy systems being supported, etc.

There are several considerations.

  • how critical is the requested operation
  • does is block any critical functions and what is the impact
  • how often does this happen
  • what could be done to mitigate it

Only you can answer if they are critical for your use case and company. For example if it was removing a user from a server during a revoke you probably want to make sure chef blows up so your monitoring will catch it and alert someone. Conversely if it was installing some non required dev tool you might make a lot of sense.

If you need to do it short term through an outage that you can't recover from (more on that later) sure that makes sense. If its happening every other day you might want to consider installing it from a more reliable source. You can mirror repos (host packages) on a blob storage system (s3), artifact management solutions (artifactory), etc if say specific versions keep getting yanked constantly (oracle java) or are down regularly.

In general I highly recommend pinning a version this will have several benefits: reducing likelihood of software suddenly breaking due to an update, the package manager can be smart and not do anything even if chef requests which in turn results in faster subsequent convergence time, and test pushing newer versions locally before ever pushing it to a real environments such as dev, ci, production, etc.

If it is say a slow server you are pulling from or are on an unreliable network then you can try configuring retries and retry_delay to provide resiliency.You could argue this just prolongs the chef client run so you might want to make it an attribute so it can be changed easier so it could be set at the env level to accommodate desired behavior.

I would say leave them in for testing unless there is a compelling reason not to as it gives you feedback about potential errors before you see them in production when maybe touching something unrelated. I help maintain the java cookbook and we regularly suffer from issues where java yanks their old versions resulting in a 404 and we regularly need pull requests to update the links, if we ignored all the failures in many the failures reported by users would be first time users which makes the first experience a poor one. Also I would suggest avoid coding anything around environments, its better to let those things that need to change just be passing inputs as otherwise it can actually be hard to test code without pushing to production if that is the only place it is leveraged and can erode the confidence in your test suite. You can always leave it commented out and then uncomment to quickly respond to issues.