Problems running standalone Chef server with external database

Hey all,

I’m having some trouble deploying a standalone Chef server with an external database server (I’m using AWS RDS for the DB server).

OS: Ubuntu 14.04
Server version: chef-server-core 12.8.0

/etc/opscode/chef-server.rb:

api_fqdn 'default-ubuntu-1404.test.internal'
ip_version 'ipv4'
topology 'standalone'
bookshelf['access_key_id'] = 'SOMEACCESSKEY'
bookshelf['secret_access_key'] = 'SOMESECRETKEY'
opscode_erchef['s3_bucket'] = 'mybucket'
bookshelf['external_url'] = 'https://s3-eu-west-1.amazonaws.com'
bookshelf['vip'] = 's3-eu-west-1.amazonaws.com'
bookshelf['enable'] = false
postgresql['external'] = true
postgresql['db_superuser'] = 'chefrdsuser'
postgresql['db_superuser_password'] = 'chefrdspassword'
postgresql['external'] = true
postgresql['vip'] = 'my-rds-endpoint.eu-west-1.rds.amazonaws.com'
postgresql['port'] = 5432

Error:

Recipe: private-chef::bootstrap
  * execute[/opt/opscode/bin/chef-server-ctl start postgresql] action run
    - execute /opt/opscode/bin/chef-server-ctl start postgresql
  * execute[/opt/opscode/bin/chef-server-ctl start oc_bifrost] action run
    - execute /opt/opscode/bin/chef-server-ctl start oc_bifrost
  * ruby_block[bootstrap-chef-server-data] action run

    ================================================================================
    Error executing action `run` on resource 'ruby_block[bootstrap-chef-server-data]'
    ================================================================================

    Errno::ECONNREFUSED
    -------------------
    Connection refused - connect(2) for "127.0.0.1" port 9463

    Cookbook Trace:
    ---------------
    /var/opt/opscode/local-mode-cache/cookbooks/private-chef/libraries/chef_server_data_bootstrap.rb:187:in `bifrost_request'
    /var/opt/opscode/local-mode-cache/cookbooks/private-chef/libraries/chef_server_data_bootstrap.rb:154:in `create_object_in_authz'
    /var/opt/opscode/local-mode-cache/cookbooks/private-chef/libraries/chef_server_data_bootstrap.rb:146:in `create_actor_in_authz'
    /var/opt/opscode/local-mode-cache/cookbooks/private-chef/libraries/chef_server_data_bootstrap.rb:36:in `bootstrap'
    /var/opt/opscode/local-mode-cache/cookbooks/private-chef/recipes/bootstrap.rb:46:in `block (2 levels) in from_file'

    Resource Declaration:
    ---------------------
    # In /var/opt/opscode/local-mode-cache/cookbooks/private-chef/recipes/bootstrap.rb

     44: ruby_block "bootstrap-chef-server-data" do
     45:   block do
     46:     ChefServerDataBootstrap.new(node).bootstrap
     47:   end
     48:   not_if { OmnibusHelper.has_been_bootstrapped? }
     49:   notifies :restart, 'service[opscode-erchef]'
     50: end
     51:

    Compiled Resource:
    ------------------
    # Declared in /var/opt/opscode/local-mode-cache/cookbooks/private-chef/recipes/bootstrap.rb:44:in `from_file'

    ruby_block("bootstrap-chef-server-data") do
      action [:run]
      retries 0
      retry_delay 2
      default_guard_interpreter :default
      block_name "bootstrap-chef-server-data"
      declared_type :ruby_block
      cookbook_name "private-chef"
      recipe_name "bootstrap"
      block #<Proc:0x00000002b20a00@/var/opt/opscode/local-mode-cache/cookbooks/private-chef/recipes/bootstrap.rb:45>
      not_if { #code block }
    end

    Platform:
    ---------
    x86_64-linux

The configuration works if I don’t use an external postgres database (i.e. if I set postgresql['external'] = false and remove the other postgresql attributes).

The (basic) process to get to this point is:
Launch fresh RDS and EC2 instances -> Update /etc/hosts with hostname -> install chef server from https://packages.chef.io -> update /etc/opscode/chef-server.rb -> chef-server-ctl reconfigure

After the reconfigure fails, the bifrost service restarts (because of a pending service restart) and listens on 127.0.0.1:9463 (I can curl it and get back a response), but because the setup fails before /etc/opscode/pivotal.pem is created, the next chef-server-ctl reconfigure fails too. I’ve tried removing /etc/opscode/private-chef-secrets.json and creating a new RDS instance, but there are further errors there.

Am I doing something wrong with my config, or is there anything else I can look into?

Regards
Kieran

After some more investigation, it looks like sqerl / pooler is crashing.

I added a section to private-chef/recipes/bootstrap.rb so that the bootstrap waits for bifrost to come online and then the bootstrap finishes. Afterwards though, opscode-erchef won’t start listening on tcp:8000.

In the oc_bifrost and opscode-erchef logs, there are a lot of messages that mention pool 'sqerl': starting member timeout, and there is this big error dump in the opscode-erchef logs:

2016-08-04_14:44:16.57315 =ERROR REPORT==== 4-Aug-2016::14:44:16 ===
2016-08-04_14:44:16.57315 ** Generic server <0.1016.0> terminating
2016-08-04_14:44:16.57315 ** Last message in was timeout
2016-08-04_14:44:16.57315 ** When Server state == {starter,
2016-08-04_14:44:16.57315                             {pool,sqerl,undefined,20,20,
2016-08-04_14:44:16.57315                                 {sqerl_client,start_link,[]},
2016-08-04_14:44:16.57316                                 [],0,0,1,
2016-08-04_14:44:16.57316                                 {1,min},
2016-08-04_14:44:16.57316                                 {30,sec},
2016-08-04_14:44:16.57316                                 pooler_sqerl_member_sup,undefined,
2016-08-04_14:44:16.57316                                 {dict,0,16,16,8,80,48,
2016-08-04_14:44:16.57316                                     {[],[],[],[],[],[],[],[],[],[],[],[],[],
2016-08-04_14:44:16.57317                                      [],[],[]},
2016-08-04_14:44:16.57317                                     {{[],[],[],[],[],[],[],[],[],[],[],[],[],
2016-08-04_14:44:16.57317                                       [],[],[]}}},
2016-08-04_14:44:16.57317                                 {dict,0,16,16,8,80,48,
2016-08-04_14:44:16.57317                                     {[],[],[],[],[],[],[],[],[],[],[],[],[],
2016-08-04_14:44:16.57318                                      [],[],[]},
2016-08-04_14:44:16.57318                                     {{[],[],[],[],[],[],[],[],[],[],[],[],[],
2016-08-04_14:44:16.57318                                       [],[],[]}}},
2016-08-04_14:44:16.57318                                 [],
2016-08-04_14:44:16.57319                                 {1,min},
2016-08-04_14:44:16.57319                                 folsom_metrics,folsom,
2016-08-04_14:44:16.57319                                 {[],[]},
2016-08-04_14:44:16.57319                                 40},
2016-08-04_14:44:16.57319                             <0.1003.0>,undefined}
2016-08-04_14:44:16.57319 ** Reason for termination ==
2016-08-04_14:44:16.57320 ** {{killed,{gen_server,call,
2016-08-04_14:44:16.57320                         [pooler_sqerl_member_sup,{start_child,[]},infinity]}},
2016-08-04_14:44:16.57320     [{gen_server,call,3,[{file,"gen_server.erl"},{line,190}]},
2016-08-04_14:44:16.57320      {pooler_starter,do_start_member,1,
2016-08-04_14:44:16.57320                      [{file,"/var/cache/omnibus/src/oc_erchef/_build/default/lib/pooler/src/pooler_starter.erl"},
2016-08-04_14:44:16.57320                       {line,138}]},
2016-08-04_14:44:16.57321      {pooler_starter,handle_info,2,
2016-08-04_14:44:16.57321                      [{file,"/var/cache/omnibus/src/oc_erchef/_build/default/lib/pooler/src/pooler_starter.erl"},
2016-08-04_14:44:16.57321                       {line,123}]},
2016-08-04_14:44:16.57321      {gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,593}]},
2016-08-04_14:44:16.57321      {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,659}]},
2016-08-04_14:44:16.57322      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,237}]}]}
2016-08-04_14:44:16.59409
2016-08-04_14:44:16.59410 =INFO REPORT==== 4-Aug-2016::14:44:16 ===
2016-08-04_14:44:16.59410     application: pooler
2016-08-04_14:44:16.59410     exited: {{error,
2016-08-04_14:44:16.59410               {shutdown,
2016-08-04_14:44:16.59410                {failed_to_start_child,pooler_sqerl_pool_sup,
2016-08-04_14:44:16.59410                 {shutdown,
2016-08-04_14:44:16.59411                  {failed_to_start_child,pooler,
2016-08-04_14:44:16.59411                   {badarg,
2016-08-04_14:44:16.59411                    [{ets,member,
2016-08-04_14:44:16.59411                      [folsom,<<"pooler.sqerl.starting_member_timeout">>],
2016-08-04_14:44:16.59412                      []},
2016-08-04_14:44:16.59412                     {folsom_ets,handler_exists,1,
2016-08-04_14:44:16.59412                      [{file,
2016-08-04_14:44:16.59413                        "/var/cache/omnibus/src/oc_erchef/_build/default/lib/folsom/src/folsom_ets.erl"},
2016-08-04_14:44:16.59413                       {line,96}]},
2016-08-04_14:44:16.59413                     {folsom_ets,notify,3,
2016-08-04_14:44:16.59413                      [{file,
2016-08-04_14:44:16.59413                        "/var/cache/omnibus/src/oc_erchef/_build/default/lib/folsom/src/folsom_ets.erl"},
2016-08-04_14:44:16.59413                       {line,116}]},
2016-08-04_14:44:16.59414                     {pooler,send_metric,4,
2016-08-04_14:44:16.59414                      [{file,
2016-08-04_14:44:16.59414                        "/var/cache/omnibus/src/oc_erchef/_build/default/lib/pooler/src/pooler.erl"},
2016-08-04_14:44:16.59414                       {line,818}]},
2016-08-04_14:44:16.59415                     {pooler,accumulate_starting_member_not_stale,5,
2016-08-04_14:44:16.59415                      [{file,
2016-08-04_14:44:16.59415                        "/var/cache/omnibus/src/oc_erchef/_build/default/lib/pooler/src/pooler.erl"},
2016-08-04_14:44:16.59415                       {line,501}]},
2016-08-04_14:44:16.59415                     {lists,foldl,3,[{file,"lists.erl"},{line,1261}]},
2016-08-04_14:44:16.59416                     {pooler,remove_stale_starting_members,3,
2016-08-04_14:44:16.59416                      [{file,
2016-08-04_14:44:16.59416                        "/var/cache/omnibus/src/oc_erchef/_build/default/lib/pooler/src/pooler.erl"},
2016-08-04_14:44:16.59416                       {line,490}]},
2016-08-04_14:44:16.59416                     {pooler,do_accept_member,2,
2016-08-04_14:44:16.59417                      [{file,
2016-08-04_14:44:16.59417                        "/var/cache/omnibus/src/oc_erchef/_build/default/lib/pooler/src/pooler.erl"},
2016-08-04_14:44:16.59417                       {line,396}]}]}}}}}},
2016-08-04_14:44:16.59417              {pooler_app,start,[normal,[]]}}
2016-08-04_14:44:16.59417     type: permanent

I tried to look at pooler.erl and folsom_ets.erl to see if there were any clues as to what’s causing the crash, but I don’t know much about erlang so I couldn’t see anything.

Is there a log for pooler that would give me more info?

Cheers
Kieran

For anyone who has the same issue in the future, it turns out that the issue was because postgres had the ‘log_hostname’ parameter set to 1 which caused it to attempt to resolve the hostname of my connecting instance. This caused ~5-10 second delay on each initial connection which caused the timeouts.