Re: Updating the OS data that ostree doesn't manage



On Wed, Feb 19, 2020 at 1:26 AM Will Manley <will williammanley net> wrote:
Thanks for sharing.  It sounds like a sensible design given the constraints you've described.  I've made a 
few comments below about how we handle our deployment process.  Our use-case is a lot simpler than yours as 
we have a lot of control over the devices, and the devices contain little state, but in the spirit of 
sharing I thought I'd write it up:

Your comments are really insightful, thanks for sharing!

Our use-case is embedded.  We want each of our devices to be as similar to each other as possible, to allow 
them to be interchangeable.  They receive configuration from the network on boot.  Our use-case is a lot 
more limited than yours.  We don't need to be too careful about state, because we can always recreate it 
later.  A reboot is almost the same as a factory reset in our case.

Regarding managing /etc: We do have device-specific files in /etc (hostname, keys, certificates, 
machine-id), but they are always new files, not modifications to files that are in our ostree images.  This 
means that for managing /etc the default ostree deploy 3-way merge is fine for us.

So in some ways your network system already takes on several aspects
of the extra configuration system I'm designing here. For example if
you need to change the set of keys placed in /etc, or if you need to
change the format in which they are stored, you can easily handle that
on the server and reboot the clients.

Regarding managing /sysroot/ostree: We have a systemd unit called post-upgrade-cleanup.service which is run 
at boot after a deploy.  This allows us to perform housekeeping including running `ostree admin cleanup`.

We implement this with a marker file: /sysroot/.ostree-cleaned.  The marker is created by 
post-upgrade-cleanup.service after it runs successfully.  Its presence prevents 
post-upgrade-cleanup.service from running again before a deploy because the unit file includes:

     ConditionPathExists=!/sysroot/.ostree-cleaned

We delete /sysroot/.ostree-cleaned as part of our deploy process.

In order to handle some of the cases where we have needed to ship
post-install updates that aren't managed by ostree itself, we also
have added systemd services to our ostree that operate as you
describe. These spawn shell or python scripts. But the issues we have
faced there are:
 1. Over the years we have made far too many mistakes in the scripts
that perform these updates. This has been an area where our human
imperfection has shone through quite strongly, and we haven't achieved
decent enough automated testing to help us out. I'm hoping that
adopting Ansible in the envisioned extra configuration solution will
help us here, because:
  a) Ansible appears to be a well refined tool for handling data
manipulations and updates, hopefully the tool will assist us in being
more careful and correct
  b) Ansible's playbooks will be used for applying these details at
both installation *and* update time. Less duplication and fewer
codepaths hopefully means fewer problems creeping out.
  c) Ansible ends up being a more practical system to build automated
testing around
 2. We only ship a single "generic" ostree but have multiple products,
sometimes we need to update product-specific details but shipping
related stuff in the generic ostree is awkward at best.

I think it's worth splitting changes to /var and to /etc conceptually rather than considering them 
together. It seems to me that they are quite different in the way they're handled by ostree.  In particular 
there is a separate /etc per deploy, which makes changes atomic there, while /var is shared, meaning that 
you have to take a lot more care making modifications there.

That's a really good point at a conceptual level. In our case we do
not consider multiple deployments and have conveniently ignored that
consideration.

I wonder if there are users who really rely on this. I like the
concept but I can imagine it being hit by some awkward realities, like
the fact that the uids and gids are defined in the deployments and
would need to be carefully synchronized if any shared files in /var
need to have some kind of access control or non-root writes allowed.

The points you raise about atomicity and writing to /var which may be
in use at runtime are also valid. I'll have to ponder it a bit more.
The obvious approaches are indeed to defer this to late during
shutdown or early during startup when the runtime environment is
greatly reduced.

I'd be interested to hear more about why you want to have a single ostree image for all your products?  It 
seems to me that it might be less complex to do the merging of the configuration as the last step in your 
build process, rather than on the client?

Using the word "product" to denote a specific set of customisations,
we have a lot of different products, so we have to keep scalability in
mind. We wouldn't be able to build an entire ostree (i.e. from
packages) for each product, something more lightweight is needed.
Taking a base ostree and merging in some configuration could indeed be
cheap enough, but as far as I can see, the only part of the problem
space that would be solved here is for shipping updated files in /etc
in the cases where ostree's 3-way merge can successfully merge user
changes with upstream changes (i.e. currently only in the case where
the user didn't make changes to the files in question). The challenge
I am working on here is wider than that - see subject line "updating
the data that ostree doesn't manage" - including /var, bootloader,
etc.

You mentioned "site-specific networking config" earlier.  When you talk about different products, would a 
different site-specific configuration constitute a different product by your definition?

I don't have a solid definition for the word "product" but yes that is
one way of putting it. If site-specific configuration is needed then
we would need to produce (and maintain) a specific set of extra
configuration data for that site. That comes with challenges too,
which I have some ideas around, but haven't gone into detail in this
thread.

And the inability to update extra configuration after
installation time has become a growing pain over the years. Through
maintaining a fairly broad product over the years we've accumulated
many details to tweak on existing installs, big and small, such as:
 - Adding collection ID to existing ostree/flatpak repos
 - Adding flathub remotes
 - Moving stuff from the core OS into flatpaks, requires the flatpak
to be auto-installed on OS update to avoid loss of functionality

For things like this we like to follow the systemd convention of vendor configuration under /usr which is 
overridden by system-specific config under /etc.  I don't know anything about flatpak, but I imagine you 
could have `/usr/lib/flatpak/collections.d` containing a file per collection, which would be 
overridden/invalidated by a file under `/etc/flatpak/collections.d`.  This way the user can still delete 
pre-installed collections, but new collections will show up naturally.

There is a nice theoretical option that each and every aspect that we
try to manage here is associated with an upstream project that already
allows us to update those aspects later in ways that are compatible
with the product requirements placed upon us. And indeed, having been
shipping our product for a few years now without this extra
configuration system currently being designed, we have naturally tried
to use existing facilities to handle the things we need. But in
reality we have found that things fall too short.

When we started using Flatpak, the capabilities around configuring
remotes were more limited than they are now. So even if flatpak were
to offer something as complete as your description (it doesn't quite
do this - although recent versions are more flexible than they used to
be), we would still have the problem that we have to tend to existing
users who had their systems configured before sufficient flexibility
was implemented.

Additionally the aspects that we intend to control through extra
configuration sometimes have conditions attached, e.g.
- change should only be made at installation time (should not affect
existing installations), or
- change should be performed exactly once (could be at installation
time, or at the time when the config is updated) but never again, or
- change should be applied every time
So if we are to rely on existing facilities, that places even greater
requirements on the flexibility they must offer.

Another example quoted above is the movement of software from OS to
flatpak. We used to ship libreoffice in the ostree, because that was
effectively the only option. Later it became available as a flatpak,
and several benefits emerged of using that version instead. The
obvious thing to do was to remove it from the ostree and preinstall
the flatpak on new installations. However that would result in our
existing users doing an ostree update and having their office suite
disappear. So our product remained "stuck" on the ostree-shipped
version for a long while, because there was no existing facility that
would let us perform this migration in a sane way. Eventually we
designed and implemented our own migration system to handle this
(including nuances that would allow for customers who wanted Endless
product variants without libreoffice, and allowing users to actually
uninstall libreoffice if they didn't want it without having the system
then automatically reinstall it) and pulled it off. It had teething
problems but it worked for that case.

Years later we had another similar case of needing to migrate software
from the ostree to flatpaks, but this one had a few more details
attached. The system we had earlier put in place for libreoffice had
been carefully designed to be generic/configurable/flexible, but when
it came down to it, we actually found ourselves unable to use it for
the 2nd case, it wasn't quite flexible enough.

So, the reality there: first, the upstream project itself didn't
provide facilities for us to handle the update of this aspect on
existing installations, even after waiting for a long time. Secondly,
even though we put plenty of brains in a room to figure out a custom
and generic solution to the problem, ultimately we did not do a
brilliant job there.

I am hopeful that my proposal here would be a step forward, because:
 1. Ansible appears to have a design focused around the problem space
of effectively managing upgrade path complications. It also seems to
offer a great deal of flexibility, and such flexibility has gone
through the test of time (and presumably many design changes and
iterations) and the test of exposure to many users which hopefully
means in terms of maturity and completeness, it is way ahead of what
we can do in terms of custom solutions.
 2. The playbook would be run in the runtime environment of the newly
downloaded ostree, not the current one. (That overcomes the source of
some limitations of our existing custom solution)

 - Tweaking swap setup based on newer learnings

We would manage this with systemd .swap units stored on /usr, so it's applied at boot.  As it is we don't 
use swap :).

systemd is great, but this is another example of where the existing
facilities don't go far enough.
Earlier versions of our product used MBR partition tables and BIOS
boot. Current versions use GPT and UEFI, but of course we maintain
support for the old users. Also we only create swap when there's a
decent amount of disk space, and more recently we added zram into the
mix. Trying to write a single set of systemd units that go in the
ostree that automatically support all these configurations is very
difficult, if not impractical.

Instead, we have managed things like this by making relevant changes
to /etc, but then we end up with the problem later of not having a
mechanism to update such changes...

 - Fixing permissions of stuff in /var

We use systemd-tmpfiles for this with the tmpfiles.d configuration stored under /usr - thus included in the 
ostree image.

That existing permission-management facility is not ideal though,
since the permission changes would be applied on every boot, which may
not be desirable, at best it is a slowdown during boot.

Ideally the changes would only be applied on upgrade, or only be
applied exactly once and only for previous users who had software that
created files with the incorrect permissions, and totally skipped for
new users.


With these kinds of considerations in mind, even though ideal-world
solutions might be different, hopefully you can see why I'm drawn
towards something like Ansible as a mature approach of handling the
many complications with upgrade paths.

I'd be interested in hearing what you think the advantages/disadvantages of using chroot rather than 
rebooting and making the changes at boot time. It seems to me that it's only "safe" to make changes to /etc 
as that forms part of the deploy, and not safe to make changes to /var as this is part of the current 
running system.

One advantage to chroot I can see is that you can download additional data, like new flatpaks before 
rebooting, and use this data in some way to control the networking configuration for the next boot.  This 
might not be possible otherwise if the lack of these changes would cause the device to fail to connect to 
the network.

Good point about potentially causing running-system complications with
a live update of /var, I'll have to give that some careful
consideration.

Indeed, the reason for doing it before reboot is to allow for any
network operations to happen, and if anything takes a long time (e.g.
download and install libreoffice) it wouldn't be great to delay the
next boot while it that happens.


Thanks for raising many good points! Super helpful.

Daniel


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]