When `hab svc load` doesn't actually start a process

Can you share your spec files? I have a suspicion I’d like to check out.

@elliott-davis / @christophermaier, the smartb-private/api Spec looks like this:

ident = "smartb-private/api"
group = "staging"
bldr_url = "https://bldr.habitat.sh"
channel = "unstable"
topology = "standalone"
update_strategy = "at-once"
binds = ["redis:redis.staging"]
binding_mode = "strict"
desired_state = "up"

I also see there is no Spec file for smartb/redis for some reason - digging in to why that might be.

Hrmm… what I was thinking isn’t happening (which is good).

Everything looks normal. :thinking:

Can you share the BINDS and BINDS_OPTIONAL files from your /hab/pkgs/smartb-private/api/$VERSION/$RELEASE/ directory? You mentioned earlier that the BINDS_OPTIONAL file had incorrect information in it, which would be consistent with the error you showed (Invalid bind(s), redis). I’d like to know more about how that package was built in order for it to have incorrect information. What version of hab was used to build it? Docker studio or chroot studio? Any “special” dynamic things happening in that plan? Also, were you using your cacher here? If so, I’m wondering if it’s somehow caching too much (you say you’re seeing an “old” bind from a previous iteration).

Comments inline. I grabbed the data after re-provisioning all 3 'fullstack' VMs:

# cat $(hab pkg path smartb-private/api)/BINDS
cat: /hab/pkgs/smartb-private/api/0.1.0/20180911145435/BINDS: No such file or directory

# cat $(hab pkg path smartb-private/api)/BINDS_OPTIONAL
irondb=port

Okay, that's odd that we still have the wrong package version. And I don't have a good explanation for that.

Is there some file caching happening somewhere (like for hab artifacts). I know we do that to speed up builder builds but sometimes leads to unexpected results

journalctl shows me that the service has been ‘loaded’, but I don’t see the service actually running:

$ journalctl | grep 'api.staging' | grep -v 'no active members'
Sep 13 13:32:44 staging-fullstack-0 sudo[15310]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/hab svc load smartb/nginx --strategy at-once --channel unstable --group staging --bind api:api.staging
Sep 13 13:33:07 staging-fullstack-0 sudo[17265]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/hab config apply api.staging 1536845587 /hab/svc/api/user.toml
Sep 13 13:33:14 staging-fullstack-0 hab[13737]: hab-sup(MR): Setting new configuration version 1536845587 for api.staging

$ hab svc status | column -t
package                                          type        state  elapsed  (s)     pid                          group
smartb/redis/3.2.4/20180911151411                standalone  up     1685     14153   redis.staging
smartb/api_celerybeat/0.0.1/20180911151413       standalone  up     6        3932    api_celerybeat.staging
smartb/pgbouncer/1.7.2/20180911151412            standalone  up     1532     14788   pgbouncer.staging
smartb/journalbeat/5.6.0/20180502145432          standalone  up     1532     14795   journalbeat.staging
smartb/api_celeryd_default/0.0.1/20180911145435  standalone  up     6        3935    api_celeryd_default.staging
smartb/circonus-agent/0.13.0/20180810181314      standalone  up     5        3938    circonus-agent.staging
smartb/nginx/1.13.10/20180911151414              standalone  down   1507     <none>  nginx.staging
smartb/api_celeryd_slow/0.0.1/20180911151413     standalone  up     5        3992    api_celeryd_slow.staging
smartb/api_celeryd_events/0.0.1/20180911151413   standalone  up     5        4035    api_celeryd_events.staging
core/postgresql/9.6.9/20180716151728             standalone  down   1485     <none>  postgresql.staging

$ hab pkg path smartb-private/api
/hab/pkgs/smartb-private/api/0.1.0/20180911145435
root@staging-fullstack-0:~# cat /hab/pkgs/smartb-private/api/0.1.0/20180911145435/BINDS_OPTIONAL
irondb=port

Still the wrong optional bind though…

Looking at Builder, I see we actually have a bunch of failed builds for the API, so we may just have an old version of the package here :frowning:

This means the work I did to notify us when a build fails isn’t working correctly, so I need to fix that.

Let me try to get a working build and then see if the issue is resolved. In the meantime, I think it would still be good to get better messaging to the console when a service doesn’t actually load (either that or have had svc load return with success only when the service is actually up…this would be far easier to work with in Terraform.

@elliott-davis & @christophermaier, looks like we need some help with Builder here:

api: 'do_after_success' callback function is defined; executing...
BUILDER_SLACK_CHANNEL=info
{"start":1536762689,"stop":1536762689,"title":"api built","_last_modified":1536762690,"_last_modified_by":"/user/4598","rel_metrics":[],"category":"builds","_cid":"/annotation/353568","description":"Builder completed build of https://bldr.habitat.sh/#/pkgs/smartb-private/api/0.1.0/20180912142511","_created":1536762690}ok   api: 
   api: Source Path: /hab/cache/src/api-0.1.0
   api: Installed Path: /hab/pkgs/smartb-private/api/0.1.0/20180912142511
   api: Artifact: /src/results/smartb-private-api-0.1.0-20180912142511-x86_64-linux.hart
   api: Build Report: /src/results/last_build.env
   api: SHA256 Checksum: 0558406d6d28c5067c9843533f4f81d60152fedc1beba56489cdfc14d02ef1e6
   api: Blake2b Checksum: c08f9d3bc027f53e3d7a3a438d34d8d5dd88172f368e99f9c4b0427d5890ba9f
   api: 
   api: I love it when a plan.sh comes together.
   api: 
   api: Build time: 6m19s
Script done, file is /src/results/logs/api.2018-09-12-142511.log
umount: can't unmount /hab/studios/src/dev/pts: Resource busy
An error occurred when unmounting filesystem '/hab/studios/src/dev/pts'
ERROR: Unmount of /hab/studios/src/dev/pts failed
Failed studio build for smartb-private/api, err=BuildEnvFile("/hab/svc/builder-worker/data/1069514772922220544/out/pre_build.env", Os { code: 2, kind: NotFound, message: "No such file or directory" })
builder_log_section::end::build_package
builder_log::end::1069514772922220544

Do you know what’s going on with these failures? Or can you see logs for them? My ability to debug is limited.