Problems running standalone Chef server with external database


#1

Hey all,

I’m having some trouble deploying a standalone Chef server with an external database server (I’m using AWS RDS for the DB server).

OS: Ubuntu 14.04
Server version: chef-server-core 12.8.0

/etc/opscode/chef-server.rb:

api_fqdn 'default-ubuntu-1404.test.internal'
ip_version 'ipv4'
topology 'standalone'
bookshelf['access_key_id'] = 'SOMEACCESSKEY'
bookshelf['secret_access_key'] = 'SOMESECRETKEY'
opscode_erchef['s3_bucket'] = 'mybucket'
bookshelf['external_url'] = 'https://s3-eu-west-1.amazonaws.com'
bookshelf['vip'] = 's3-eu-west-1.amazonaws.com'
bookshelf['enable'] = false
postgresql['external'] = true
postgresql['db_superuser'] = 'chefrdsuser'
postgresql['db_superuser_password'] = 'chefrdspassword'
postgresql['external'] = true
postgresql['vip'] = 'my-rds-endpoint.eu-west-1.rds.amazonaws.com'
postgresql['port'] = 5432

Error:

Recipe: private-chef::bootstrap
  * execute[/opt/opscode/bin/chef-server-ctl start postgresql] action run
    - execute /opt/opscode/bin/chef-server-ctl start postgresql
  * execute[/opt/opscode/bin/chef-server-ctl start oc_bifrost] action run
    - execute /opt/opscode/bin/chef-server-ctl start oc_bifrost
  * ruby_block[bootstrap-chef-server-data] action run

    ================================================================================
    Error executing action `run` on resource 'ruby_block[bootstrap-chef-server-data]'
    ================================================================================

    Errno::ECONNREFUSED
    -------------------
    Connection refused - connect(2) for "127.0.0.1" port 9463

    Cookbook Trace:
    ---------------
    /var/opt/opscode/local-mode-cache/cookbooks/private-chef/libraries/chef_server_data_bootstrap.rb:187:in `bifrost_request'
    /var/opt/opscode/local-mode-cache/cookbooks/private-chef/libraries/chef_server_data_bootstrap.rb:154:in `create_object_in_authz'
    /var/opt/opscode/local-mode-cache/cookbooks/private-chef/libraries/chef_server_data_bootstrap.rb:146:in `create_actor_in_authz'
    /var/opt/opscode/local-mode-cache/cookbooks/private-chef/libraries/chef_server_data_bootstrap.rb:36:in `bootstrap'
    /var/opt/opscode/local-mode-cache/cookbooks/private-chef/recipes/bootstrap.rb:46:in `block (2 levels) in from_file'

    Resource Declaration:
    ---------------------
    # In /var/opt/opscode/local-mode-cache/cookbooks/private-chef/recipes/bootstrap.rb

     44: ruby_block "bootstrap-chef-server-data" do
     45:   block do
     46:     ChefServerDataBootstrap.new(node).bootstrap
     47:   end
     48:   not_if { OmnibusHelper.has_been_bootstrapped? }
     49:   notifies :restart, 'service[opscode-erchef]'
     50: end
     51:

    Compiled Resource:
    ------------------
    # Declared in /var/opt/opscode/local-mode-cache/cookbooks/private-chef/recipes/bootstrap.rb:44:in `from_file'

    ruby_block("bootstrap-chef-server-data") do
      action [:run]
      retries 0
      retry_delay 2
      default_guard_interpreter :default
      block_name "bootstrap-chef-server-data"
      declared_type :ruby_block
      cookbook_name "private-chef"
      recipe_name "bootstrap"
      block #<Proc:0x00000002b20a00@/var/opt/opscode/local-mode-cache/cookbooks/private-chef/recipes/bootstrap.rb:45>
      not_if { #code block }
    end

    Platform:
    ---------
    x86_64-linux

The configuration works if I don’t use an external postgres database (i.e. if I set postgresql['external'] = false and remove the other postgresql attributes).

The (basic) process to get to this point is:
Launch fresh RDS and EC2 instances -> Update /etc/hosts with hostname -> install chef server from https://packages.chef.io -> update /etc/opscode/chef-server.rb -> chef-server-ctl reconfigure

After the reconfigure fails, the bifrost service restarts (because of a pending service restart) and listens on 127.0.0.1:9463 (I can curl it and get back a response), but because the setup fails before /etc/opscode/pivotal.pem is created, the next chef-server-ctl reconfigure fails too. I’ve tried removing /etc/opscode/private-chef-secrets.json and creating a new RDS instance, but there are further errors there.

Am I doing something wrong with my config, or is there anything else I can look into?

Regards
Kieran


#2

After some more investigation, it looks like sqerl / pooler is crashing.

I added a section to private-chef/recipes/bootstrap.rb so that the bootstrap waits for bifrost to come online and then the bootstrap finishes. Afterwards though, opscode-erchef won’t start listening on tcp:8000.

In the oc_bifrost and opscode-erchef logs, there are a lot of messages that mention pool 'sqerl': starting member timeout, and there is this big error dump in the opscode-erchef logs:

2016-08-04_14:44:16.57315 =ERROR REPORT==== 4-Aug-2016::14:44:16 ===
2016-08-04_14:44:16.57315 ** Generic server <0.1016.0> terminating
2016-08-04_14:44:16.57315 ** Last message in was timeout
2016-08-04_14:44:16.57315 ** When Server state == {starter,
2016-08-04_14:44:16.57315                             {pool,sqerl,undefined,20,20,
2016-08-04_14:44:16.57315                                 {sqerl_client,start_link,[]},
2016-08-04_14:44:16.57316                                 [],0,0,1,
2016-08-04_14:44:16.57316                                 {1,min},
2016-08-04_14:44:16.57316                                 {30,sec},
2016-08-04_14:44:16.57316                                 pooler_sqerl_member_sup,undefined,
2016-08-04_14:44:16.57316                                 {dict,0,16,16,8,80,48,
2016-08-04_14:44:16.57316                                     {[],[],[],[],[],[],[],[],[],[],[],[],[],
2016-08-04_14:44:16.57317                                      [],[],[]},
2016-08-04_14:44:16.57317                                     {{[],[],[],[],[],[],[],[],[],[],[],[],[],
2016-08-04_14:44:16.57317                                       [],[],[]}}},
2016-08-04_14:44:16.57317                                 {dict,0,16,16,8,80,48,
2016-08-04_14:44:16.57317                                     {[],[],[],[],[],[],[],[],[],[],[],[],[],
2016-08-04_14:44:16.57318                                      [],[],[]},
2016-08-04_14:44:16.57318                                     {{[],[],[],[],[],[],[],[],[],[],[],[],[],
2016-08-04_14:44:16.57318                                       [],[],[]}}},
2016-08-04_14:44:16.57318                                 [],
2016-08-04_14:44:16.57319                                 {1,min},
2016-08-04_14:44:16.57319                                 folsom_metrics,folsom,
2016-08-04_14:44:16.57319                                 {[],[]},
2016-08-04_14:44:16.57319                                 40},
2016-08-04_14:44:16.57319                             <0.1003.0>,undefined}
2016-08-04_14:44:16.57319 ** Reason for termination ==
2016-08-04_14:44:16.57320 ** {{killed,{gen_server,call,
2016-08-04_14:44:16.57320                         [pooler_sqerl_member_sup,{start_child,[]},infinity]}},
2016-08-04_14:44:16.57320     [{gen_server,call,3,[{file,"gen_server.erl"},{line,190}]},
2016-08-04_14:44:16.57320      {pooler_starter,do_start_member,1,
2016-08-04_14:44:16.57320                      [{file,"/var/cache/omnibus/src/oc_erchef/_build/default/lib/pooler/src/pooler_starter.erl"},
2016-08-04_14:44:16.57320                       {line,138}]},
2016-08-04_14:44:16.57321      {pooler_starter,handle_info,2,
2016-08-04_14:44:16.57321                      [{file,"/var/cache/omnibus/src/oc_erchef/_build/default/lib/pooler/src/pooler_starter.erl"},
2016-08-04_14:44:16.57321                       {line,123}]},
2016-08-04_14:44:16.57321      {gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,593}]},
2016-08-04_14:44:16.57321      {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,659}]},
2016-08-04_14:44:16.57322      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,237}]}]}
2016-08-04_14:44:16.59409
2016-08-04_14:44:16.59410 =INFO REPORT==== 4-Aug-2016::14:44:16 ===
2016-08-04_14:44:16.59410     application: pooler
2016-08-04_14:44:16.59410     exited: {{error,
2016-08-04_14:44:16.59410               {shutdown,
2016-08-04_14:44:16.59410                {failed_to_start_child,pooler_sqerl_pool_sup,
2016-08-04_14:44:16.59410                 {shutdown,
2016-08-04_14:44:16.59411                  {failed_to_start_child,pooler,
2016-08-04_14:44:16.59411                   {badarg,
2016-08-04_14:44:16.59411                    [{ets,member,
2016-08-04_14:44:16.59411                      [folsom,<<"pooler.sqerl.starting_member_timeout">>],
2016-08-04_14:44:16.59412                      []},
2016-08-04_14:44:16.59412                     {folsom_ets,handler_exists,1,
2016-08-04_14:44:16.59412                      [{file,
2016-08-04_14:44:16.59413                        "/var/cache/omnibus/src/oc_erchef/_build/default/lib/folsom/src/folsom_ets.erl"},
2016-08-04_14:44:16.59413                       {line,96}]},
2016-08-04_14:44:16.59413                     {folsom_ets,notify,3,
2016-08-04_14:44:16.59413                      [{file,
2016-08-04_14:44:16.59413                        "/var/cache/omnibus/src/oc_erchef/_build/default/lib/folsom/src/folsom_ets.erl"},
2016-08-04_14:44:16.59413                       {line,116}]},
2016-08-04_14:44:16.59414                     {pooler,send_metric,4,
2016-08-04_14:44:16.59414                      [{file,
2016-08-04_14:44:16.59414                        "/var/cache/omnibus/src/oc_erchef/_build/default/lib/pooler/src/pooler.erl"},
2016-08-04_14:44:16.59414                       {line,818}]},
2016-08-04_14:44:16.59415                     {pooler,accumulate_starting_member_not_stale,5,
2016-08-04_14:44:16.59415                      [{file,
2016-08-04_14:44:16.59415                        "/var/cache/omnibus/src/oc_erchef/_build/default/lib/pooler/src/pooler.erl"},
2016-08-04_14:44:16.59415                       {line,501}]},
2016-08-04_14:44:16.59415                     {lists,foldl,3,[{file,"lists.erl"},{line,1261}]},
2016-08-04_14:44:16.59416                     {pooler,remove_stale_starting_members,3,
2016-08-04_14:44:16.59416                      [{file,
2016-08-04_14:44:16.59416                        "/var/cache/omnibus/src/oc_erchef/_build/default/lib/pooler/src/pooler.erl"},
2016-08-04_14:44:16.59416                       {line,490}]},
2016-08-04_14:44:16.59416                     {pooler,do_accept_member,2,
2016-08-04_14:44:16.59417                      [{file,
2016-08-04_14:44:16.59417                        "/var/cache/omnibus/src/oc_erchef/_build/default/lib/pooler/src/pooler.erl"},
2016-08-04_14:44:16.59417                       {line,396}]}]}}}}}},
2016-08-04_14:44:16.59417              {pooler_app,start,[normal,[]]}}
2016-08-04_14:44:16.59417     type: permanent

I tried to look at pooler.erl and folsom_ets.erl to see if there were any clues as to what’s causing the crash, but I don’t know much about erlang so I couldn’t see anything.

Is there a log for pooler that would give me more info?

Cheers
Kieran


#3

For anyone who has the same issue in the future, it turns out that the issue was because postgres had the ‘log_hostname’ parameter set to 1 which caused it to attempt to resolve the hostname of my connecting instance. This caused ~5-10 second delay on each initial connection which caused the timeouts.