Request for Input - Handling breaking changes to core plans when software is not being updated

Hello all!

This is not quite at the point of an RFC yet, so I’m starting the discussion here.

I’m wondering how we should handle making breaking changes to core plans when we are not doing a major update of the software that plan packages.

A good example is in this pull request - which greatly simplifies the configuration for core/mongodb. This is a potentially breaking change to anyone using core/mongodb. Normally I would convey that it is a breaking change through use of semantic versioning, but in this case (as is the case with most of our core plans) the plan version tracks the version of the software we are packaging.

We have a few options as I see them, but would love to hear ideas for more:

  1. Wait until a major version change of the software before we do a breaking change to the plan (not ideal at all)
  2. Somehow communicate that this is a breaking change - maybe by appending something to the version number? This also seems less than ideal.

What are your thoughts on this, oh great Habitat community?

1 Like

Hey,

I’ve been thinking about a way to version the configuration of a package since I first used the great toml based templates.

Today I decided to spend some time to try to figure out what we could do, and put it in a blog post: https://romain.sertelon.fr/tech/habitat-service-versioning-proposal.html

TL;DR I propose that we include an optional pkg_svc_version in packages to version the whole service API, represented mainly by the default.toml file. We could use this information in many places to help ops manage the configuration changes more easily thanks to habitat.

That is a great post, ty @rsertelon!

I think what this comes down to is a need for additional versions that track aspects of the plan, and those should be metadata of the plan, but have no impact on the build or versioning of packages.

In core plans, we want to use the pkg_version as the SAME version as the upstream package because it makes intuitive sense for other plans that are depending on those packages. If you pin your dependencies and depend on core/openssl/2.016, then it’s VERY CLEAR, right there in the plan of not only the version of the package that you’re depending on, but on the precise version of openssl. Of course, this bubbles up in the http gateway and any potential programmatic auditing you might be doing of dependencies in your infrastructure, so this is a important, simple, and clear mechanism for communicating which version of the software is running.

In order to solve this particular problem, we could include an optional, metadata version called “pkg_build_version”. This is a version that is representative of the state of the build and configuration in the plan at build time so that we can communicate to users major changes that occur to the compile or configuration without changing the underlying source software.

So if this isn’t effecting the dependency system in Habitat, what does it do?

The problem here is that a human needs to know what’s going on, and not a robot. So we should give the human an update with an email, rss feed, or notification.

The idea is for Builder to have a new feature where users can subscribe to updates for changes to plans. This feed can monitor not only the pkg_version but the pkg_*_version as well, so that users can subscribe to updates.

The unique opportunity here is to leverage the dependency management system so that you could opt to receive updates for transitive dependencies – that way you get a clear understanding about what might have happened when the sands shift out from under you.

For example, for OpenSSL, you have a build version as so:

pkg_version=2.0.16 (matches the upstream source version)
pkg_build_version=15.0.1

This gives you flexibility because you may want to compile OpenSSL in a particular way, and make breaking changes in the way you compile OpenSSL, even though you did not change the upstream source code in use.

To quote @nellshamrell “Strong opinions loosly held”
I’ve been writing this across a few hours and multiple interruptions, so apologies in advance if I’m incoherent.

My first reaction to this question is “We need a way to version Plans”. (Note: I use the capital Plan here to denote the set of inputs that go into making an artifact.) This would allow us to communicate breaking changes in how we build the software, configuration values, or how the service runs.

However, as I’m noodling on it I’m not entirely sure it’s the right answer, or at least not a complete answer. Part of my reasoning is that in the cases of user software ( i.e. the user owns both the software and the plan ), the versions of the two would be the same and so changes to the Plan would be communicated through major/minor version bumps of the software it packages and Plan version would be redundant. (I could absolutly be wrong here). That leaves “core-plans-like” software that doesn’t control the version.

With core-plans-like software, providing a Plan version, while it will communicate to users that it may break them, still leaves them with the choice Upgrade and eat the pain, or Stop, which is to me an anti-pattern with hab. There are reasons and cases for stopping/pinning, but by and large I believe this is true.

One question I have is should people be running core-plans directly? I wibble back and forth on this. By maintaining their own plan that is just a thin layer on top of a core-plan (or running an on-prem depot), a user can manage channel promotion themselves and control what gets deployed. This does start to lean away from always consuming updates and lead us back to batching changes. It also only solves for the service aspect and doesn’t handle the case of build/runtime deps changing, build options changing, etc.

My next thought is rather than versioning plans, we version Core-plans as a whole and limit breaking changes to a regular cadence. We do this to some extent with base plans refreshes already. If we were to limit breaking changes to the refreshes, it would allow our users to plan accordingly. This also isn’t a complete solution in my mind, but I think it gets us closer than plan versions. It does start to lead us away from the rolling release model, but there will always be a set of Plans that need to move together and have regular release cadences (glibc, etc) that could be used as guideposts.

I think a lot of the breaking changes issues (for services) could be mitigated by having upgrade testing as part of the pre-merge. Lifecycle hooks are also there to help us move from one version to the next, though maintaining those could become cumbersome. I also suspect that as Plans mature, breaking changes will tend to occur only when we’re updating the underlying software due to its configuration/dependencies changing.

I’ve considered that we may be able to use channels to signify updates, but always get back to we end up batching updates and still break the user, or they stop updating both of which are bad in my opinion.

What I think I’m landing on is a regular cadence of windows to introduce potentially breaking changes. That seems like a good first step as we iterate toward a better solution. We’ve seen a lot of activity around testing recently which is AMAZING, but I think there are some standard tests that we could provide (like service upgrades). I still don’t know that that would get us where we need to be and it’s possible there are more primitives/metadata needed, but my inclination is to work with what we have first and add features as we find them absolutely necessary.

I ran into this same problem when I was working to make the stock nginx plan more useful. Maintaining existing support and behavior for the redirector config that got merged in to implement one specific use within the primitive capabilities of handlebars was making my config template way nastier than it needed to be. I would have loved to just propose depreciating that config instead (since IMHO it has no business being a first class config)

The ability to write a lifecycle hook that can transform config before it is applied, which I’ve seen discussed in a few contexts, could potentially be a good way to help here. A config transform hook could help make up for the limitations of handlebars for example by transforming deprecated config automatically of possible or throwing a detailed error about it

I've been thinking about this subject for a while now and I wanted to put together some of the ideas that are that are in my head. These contradict my thoughts above, and I rescind my previous comment above.

First and foremost: Software versions are built to communicate intent, but they fail at doing so. If you pulled someone off the street who knew nothing about technology, and showed them the version numbers between Red Hat 5 and Red Hat 7, what would it tell them? It would be useless, and meaningless. Even for someone who is a software engineer, perhaps they could tell you the differences in base kernel versions, or the choice of SystemD over InitD, and so on, but then what are the minor, and patch versions for? You'd have to go reference the CHANGELOG, or better, actually read the code.

I think the best way to understand that using software versions to communicate intent is a failure is to go read semver.org. For something so simple, it's detailed and complicated.

SemVer.org says:

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible API changes,
MINOR version when you add functionality in a backwards-compatible manner, and
PATCH version when you make backwards-compatible bug fixes.

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

Under this scheme, version numbers and the way they change convey meaning about the underlying code and what has been modified from one version to the next.

But there's no actual system for verifying any of this! So in reality, SemVer is just hopes and dreams. You HOPE that the developer has done rigorous testing to ensure they maintain compatibility, you DREAM that patch versions make their bug fixes backwards-compatible, but none of this can be verified. And even IF the developer has done some super rigorous testing, path-dependency can lead to runtime failures. So at least to me, putting faith in a version number, and trusting a version number for communicating intent is negligent.

The answer to this problem of communication, at least in my mind, is two things:

  1. Read the code, all of it.
  2. Use the version number as a way to establish your understanding of the code.

So once I have read all of the code, the point of the version number is that it's a human readable mapping to a git commit hash. Then in the future, it's possible for me to do a diff on the code and make the task of reading all the new code that much easier.

I already have a git commit hash, which marks a snapshot of a period of time of the code. The git commit hash, and a copy of the code at that git commit is what I actually care about, because that's where the intent is at. It's total nonsense to expect to understand the intent of the new version of the software by trusting a version number!

So, by my mind, the whole point of a version number is just to make it easy for developers to write dependency management software using a human understandable integer.

Now that I've talked your ear off about software versions, now we can talk about how the dependency management software should consume those software versions.

As I said above, I only care about the version number because it gives me a baseline for understanding which git commit hash is in play. If build systems didn't ever interact with humans, we could just base all of the upstream software versions on git commit hash.

Version numbers exist because they are easy for humans to understand and use in our dependency management systems. A single incrementing integer is much better than trying to squint your eyes between two 40 character hex-valued strings.

We humans also care about the time associated with the produced version of the artifact resulting from the build process that the code comes from, because it makes it easy to reason about when the thing was produced.

Of course we also care about what the thing is, and who produced it.

So what should a "version number" be, really? I think it should be a composite of all of these things:

who made the thing / what the thing is / the version of the thing (git commit mapped to the version) / the single incrementing integer that represents time the thing was produced

or, if you haven't already guessed:

origin/package/version/release

example:

core/openssl/1.0.2q/20181221225447

Now I want to step back for a second and look at this a little more. The thing that we humans really care about the most in the build system is that very last number: 20181221225447. This signifys the point in time where the build system produced an artifact, and where we have incremented that human-readable integer that we care about. It's nice because just by using the date we can accomplish both of these goals. A new version of that number is the signal for you to go do a diff on the build manifest (the plan.sh / hooks / config files) and see what's different.

But doing this all the time is really hard for humans to keep up with. In other words, I have to audit every piece of code and every dependency and every transitive dependency, and I have to do that ALL the time because the world is always moving forward? But I just wanted to do a new build of MY code, and just use a good set of dependencies that will work. Keeping up CONTINUOUSLY is exhausting, and distracting from the value I'm trying to build for my business.

So I think @smacfarlane 's idea above nails it on the head. What we really need is a regular cadence of moving the world foward. We need a not_exhausting method of publishing these artifacts, so that people can reasonabily audit the code, or pin to a version of the dependencies that they care about.

In this case, while it's helpful to know the time that the artifact was produced, what I really want in this case is just to say something like:

origin/package#tag-base-plans-refresh-20190129

because then I can pin ALL my dependencies to the point in time where the world moved forward instead of worrying about the upstream version.

What this really is, is a bundling of all of those release numbers for every package, because they will all differ, even though they were built with the intent of working together:

#base-plans-refresh-20190129
core/LuaJIT/2.0.5        /20190115225447
core/bzip2/1.0.6         /20190115011950
core/libpthread-stubs/0.4/20190115155413
core/crate/1.1.2         /20190117194021

So what I think is that there should also be a package manifest that goes along with these kinds of tags, that is just a list of all of the packages built for that tag and all of their pkg_ident numbers in one document. I think producing a sort of manifest like this, and doing rebuilds at a regular cadence, is a good, non-exhausting way of moving the world forward.

The idea then, is that you could always move forward faster than this, if you were willing to incur the cost of auditing your code more frequently.

So then, as @chris mentioned, this gives us that place at a tag where we could introduce things like an upgrade/transform/deprecation hook that takes care of the path-dependency problem. Such a hook could modify existing configuration, or prevent the service from starting and throwing an error. This way you don't beat yourself over the head when the world moves forward, and we can create a safe way to deprecate pieces of existing plans for consumers of those plans.

1 Like

So I want to make sure that I'm understanding where you're coming from on this - this quote is the root of the problem you're thinking through is that correct?

Which then distilling down your solution, the idea is to have arbitrary universe rebuilds that can be tagged similar to a release and a new bit of metadata that is effectively a release tag that at build-time would allow users to draw from a fixed set of packages rather than following a rolling release? I think this could be doable with builder/depot outside of that context I'd have to think on it a bit more and with a bit more coffee in my blood!

The only the only thoughts I have with this are around: shipping around a build manifest that contains the entire universe and shipping an update that is a conglomerate of N+1 changes. Effectively right now packages are aware of the software they interact with (at the versions/releases they were built against, in the form of DEPS and TDEPS) because those are the bits of software the package actually needs to know about. Shipping around a manifest that includes every piece of available software in the universe at the point in time the individual package was built for a runtime artifact seems like it would add very little value past having a manifest that exists. I guess my big question here would be, why at runtime would you need every piece of software running in your business to contain the same list of all other software that exists in your business? It's very possible I'm overlooking something here so please correct me if i'm way off base!

Now the shipping change part - As @echohack mentioned above, a release number in habitat is tied to the moment in time that the build-system produced an artifact. So when you're looking at the above example core/openssl/1.0.2q/20181221225447 the last item in that list is the release number that was generated when the package was built. I want to break down a few details for folks reading the thread that maybe don't have as deep an understanding of the internals. Let's trace what the full process is in builder today:

Obviously plans (inclusive of lifecycle hooks and configuration) themselves are un-versioned source code that we suggest live alongside application source code. Let's assume for the sake of discussion that in most cases if a change occurs in a repository a build gets triggered (this is configurable with the .bldr.toml). In builder what happens when that change hits master, is that a "build group" is generated. The build group is effectively all of the reverse dependencies that transitively depend on the piece of software that changed. We create that build group and each version of software that gets built (due to the inbound change) gets added to it. Because of the way we deterministically manage/version packages the build group is how packages get promoted. E.G. Your single change to openssl cascades out to all software that depends on openssl which are deterministically graphed based on the ins and outs of the dependency tree.

In core-plans and in builder specifically the build group is the unit of change being processed. The reason being that if we don't promote the package tied to the inbound change AND all of the packages that get rebuilt from that inbound change users end up with dependency mismatches in their own packages (which is effectively a broken graph which makes us unable to safely guarantee the behavior of those package).

Phew... hopefully you're still with me here. The build group is always tied back to a single change and that's mostly because that was the original design. The build group consists of many packages, each with a release number that was generated when the package was built as part of that build group. All of this is to say - in builder today, each package that has a release is tied back to a single change via the build group. As a user that means if you change your FOO package and BAR, BAZ, and FIZZ are rebuilt because of it, you have the metadata to make a manual determination about what change caused a release of those packages to be generated. We don't unfortunately make this data easily consumed in the UI today though we've discussed at length how to do so.

OK, finally to the point. If we batch changes together into some kind of tag while we get a static list of libraries at specific versions, we lose the ability to easily determine what caused the release to fire and we can't make guarantee's about the viability of the builds of all the software in the group which means a couple things.

  1. Perpetual manual auditing and remediation of builds in a tagged release set.
  2. Higher cognitive load around differing patterns at build-time for users.

The first issue is largely that if we aren't building continuously and deterministically as we have been, then each time we are ready to cut a core-plans release we have even less knowledge about the state of the interdependent builds which means we have to let stuff die and go in and remediate which is a very tedious and time consuming process even as it stands today. I think that means that we effectively need some way to continue building the way we build today and then batch up a distribution of packages that can be tagged which means figuring out some programmatic way of grabbing a bunch of builds at a point in time and tagging those entirely separately from the channel system and entirely separately from the deps/tdeps system. It would need to effectively be a universal snapshot from a point in time.

Which brings us to the second point. Two separate systems by which to pin/release versions of software in the same tool means increased cognitive load and steeper learning curve. It would 100% add to the flexibility of the tools and the way people can build and release with Habitat, but I have no doubt it would cause some users an even more difficult time grasping how to do so effectively.

:+1:

I don't think you need this at runtime per se, but this is a document a build engineer or security engineer would want to analzye from the build to do spelunking should something go wrong. You might not even need it as a "document" itself, you could just click the tag and have it show you all of the packages under that tag.

Yeah I don't think we are ready to do this with the current system. You would want to "cut a tag" when you feel like the packages are ready to go, and have Builder just grab the latest of all the packages you're interested in tagging at that time.

What if it wasn't two separate systems, but just a lookup table? I can effectively pin my packages today to a base plan refresh, if I'm willing to look up all of the pkg_idents myself and update every single one of my plan.sh files with them. This way, when the world moves forward, I don't have to look up a whole bunch of pkg_idents again, I can just update my tag across all my packages.

I'm not advocating for anything in particular here, just taking this line of thinking to its logical conclusion. It's interesting to think about!

Totally agree here. In the UI or a way to pull it as json from builder via the hab cli seems like it could be super useful.

I was wondering about something like this myself. We could figure out an implementation that is similar to this. Do you think this would be a better or worse UX from having a separate system similar to a channel tag? If this is the pattern our users want for managing their software I think we should try to make it a grade A experience. I worry if we were to obligate users to manually pin all deps in their packages based off of the pin they might get annoyed.

Absolutely! IMPO we're better for it, I wanted to jump on this and get some responses in to keep the ol' braintrain chuggin forward!

Core plans versioning, via channels.

Base refresh during 2019-04, Using CalVer (Modified Semver) would be promoted to channel 2019.04, as well as stable.

This allows people to “pin” via channels.

I’ve been thinking about this recently and I think it’s possible that we’re missing a concept in builder. I sort of like the idea of channel pinning but I don’t think it’s the right primitive. If we had a new primitive that allowed users to pin everything they ever build or pull from builder to like a LTS kind of release I wonder if we could resolve some if these things. E.G. you could pin your whole environment to the 2019.06 refresh that includes a bunch of the major testing we’re doing on the core packages and then not accept any new packages into your build environment until you’re ready to move the while distro forward.

Well we could do that, but that still doesn’t allow our plans to build with that channel. It would be nice if we can specify pkg_channel or better yet if the hab build command supported specifying the channel other than stable.

build -c 2019.04 /path/to/plan