/_status endpoint intermittently failing after upgrading to chef-server-core 13.1.13-1

We have a health check monitor that hits our chef server's /_status endpoint every minute. Once
I upgraded from chef-server-core 13.0.17 to 13.1.13 last week, this endpoint has been intermittently failing about once or twice a day with a 500 response code and the following response:
{"status":"fail","upstreams":{"chef_solr":"fail","chef_sql":"pong","chef_index":"pong","oc_chef_action":"pong","oc_chef_authz":"pong"},"keygen":{"keys":10,"max":10,"max_workers":1,"cur_max_workers":1,"inflight":0,"avail_workers":1,"start_size":0},"indexing":{"mode":"rabbitmq","indexer_message_queue_length":0},"analytics_queue":{"queue_at_capacity":false,"dropped_since_last_check":0,"max_length":10000,"last_recorded_length":0,"total_dropped":0,"check_count":2959,"mailbox_length":0}}

This chef server is just a single chef server running on a t3.medium in ec2 (averages about 10% cpu, 45% memory, so I don't believe it's overloaded). The server is using all the default settings except for setting a custom SSL certificate for nginx. It's been running for a little over 2 years without any issues and I haven't made any changes to our environment except for upgrading the chef-server-core package. Even though the /_status endpoint is intermittently failing, the rest of the server appears to be working fine and I'm not seeing any issues when running chef-client.

The response seems to indicate some issue with solr, however I'm really struggling to find any problems with solr. I've scoured all of the logs in /var/log/opscode and can't find any indication of any problem in our solr logs. The erchef crash logs contain the following text but nothing else:
2020-01-26 13:09:32 =ERROR REPORT====
{<<"method=GET; path=/_status; status=500; ">>,"Internal Server Error"}

and the erchef current log contains the following which also isn't super helpful:
2020-01-26_13:09:32.55911 [error] /_status
2020-01-26_13:09:32.55912 {{status,fail},{upstreams,{[{<<"chef_solr">>,<<"fail">>},{<<"chef_sql">>,<<"pong">>},{<<"chef_index">>,<<"pong">>},{<<"oc_chef_action">>,<<"pong">>},{<<"oc_chef_authz">>,<<"pong">>}]}},{<<"analytics_queue">>,{[{queue_at_capacity,false},{dropped_since_last_check,0},{max_length,10000},{last_recorded_length,0},{total_dropped,0},{check_count,2959},{mailbox_length,0}]}}}
2020-01-26_13:09:32.56091 [error] {<<"method=GET; path=/_status; status=500; ">>,"Internal Server Error"}

I also tried looking through every other possible log in the /var/log/opscode folder during the time of these failures and haven't been able to find anything interesting.

I started looking through the chef-server code to see what the /_status endpoint is actually doing. It appears that to check chef_solr's heath, this endpoint makes an http request to the following endpoint: http://127.0.0.1:8983/solr/admin/ping?wt=json

So I setup a tcpdump to see all traffic to 127.0.0.1:8983 to see if I could catch what response solr was returning to the /_status endpoint. I was able to successfully see all of the ping requests coming from the /_status endpoint to solr's ping endpoint and was able to see the response that solr returned. I then waited until the /_status endpoint returned a 500 error again. I then checked the tcpdump logs and found that during the failure, the /_status endpoint did not send any request to the solr ping endpoint at all. So it seems like there is some issue where a small amount of time the /_status endpoint is marking chef_solr as failed but not actually sending any ping request to solr. Additionally, I also setup a script that runs every minute and records the results of http://127.0.0.1:8983/solr/admin/ping?wt=json to confirm that there are no issues affecting solr. I ran this script on cron every minute for days and the solr ping endpoint never failed once. It seems like the issue has something to do with the erchef /_status endpoint, but I'm not sure what.

  • I also saw in the code for the /_status endpoint that it has a 400ms timeout, but when our _status endpoint fails, it's often failing within 40-50ms, so I don't think that is the issue. At any rate, I tried increasing opscode_erchef['health_ping_timeout'] to 1000, but it hasn't helped.

-I also tried looking through the release notes for 13.1.13, but didn't really see anything obvious that could be a culprit. However, I'm fairly convinced it has something to do with a change in this release. I tried reverting our server back to 13.0.17 from a previous snapshot and the error disappeared. I've tried the upgrade from 13.0.17 -> 13.1.13 twice now and both times this error appears shortly after the upgrade.

-I've also tried obvious things like restarting the chef server, rebooting the server, etc, but that has not helped.

If anyone has any ideas on how to troubleshoot this further or might have any ideas on what's causing this, it would be appreciated. For now, I'm just making our health check less sensitive.

Thanks,
Chris