Re: How do we store/install apps?



On lör, 2014-10-11 at 23:35 +0200, Lennart Poettering wrote:
On Fri, 10.10.14 13:52, Alexander Larsson (alexl redhat com) wrote:

So, I've got some kind of initial runtime going, and its now time to
look at how we want to package these runtimes/apps. There are a few
requirements, and a bunch of nice to have.

This is what we absolutely require:

* Some kind of format for an application that is delivered over the
  network. This will contain metadata + content (a set of files).

* A format for the application when installed on a system. This has to
  be done in such a way that we can access content via the normal
  kernel fs syscalls.

I am pretty sure these two formats need to be very close to each
other, otherwise all the stuff like signatures that checked on access
area really hard to do.

I agree, the network transport format pretty much follows from the
decision on how the installed form looks.

* Don't pass untrusted data to the kernel. For instance, it is risky
  to download raw filesystem data and then mount that, or mount a
  loopback file that the user can modify. The raw filesystem data is
  directly parsed by the kernel and weird data there can cause kernel
  panics.

Well, this is unavoidable if we ever want to allow fully signed
systems. I mean, again, I would not isolate the problem of app images
so much from the problem of OS images. I want to solve this at the
same time, as the problems with verification, distribution and so on
are pretty much the same. 

In some sense it is unavoidable. We have to tie the exact file data to
the signature. However, does this mean we have to shove random bits at
the kernel rather than going through the syscall interface?

btrfs-receive is a userspace tool that uses the regular userspace i/o
syscalls to do its modifications. How does this propose to handle the
signatures? If it can do it, why would it not be possible to do
ourselves?

I also really don't believe that the kernel would be any worse with
verifying structural integrity of images than userspace code...

I don't think that is a proper comparison. The "verifying" that the
userspace install code does is run in an unprivileged mode that then
feeds the resulting data via the well-tested syscall interface to the
kernel. However, the parsing of the on-disk filesystem structures is
done in a very highly privileged mode in the kernel.

That said, btrfs-recieve is a userspace tool, so it doesn't quite fit
what i talk about above with mounting pre-created filesystem images.

* Regular directory

  We require an install phase that explodes the app bundle into
  separate files.

  For multi-version storage we can use hardlinks which results in
  sharing both disk and page cache between versions at a file-granular
  level.

  Install and mounting is doable as non-root, doesn't pass untrusted
  data to the kernel and once done allows easy access to exported files.

  However, installation is not atomic, and there are no lazy checking
  of checksums or signatures.

Also, the hardlink farms are certainly not pretty.

They are not pretty, sure. However they are very widely available, and
the *only* solution that allows page-cache sharing between images, and
"trivial" deduplication between unrelated images. I don't think we
should to easily dismiss it.

* btrfs volumes

  If the filesystem where we're installing the app is btrfs (either natively
  or via a loopback mounted file) we can install the apps in subvolumes.
  If the root is btrfs this is easy, but the loopback mounted case is pretty
  tricky, as it requires resizing the loopback when needed, etc.

  This is similar to exploding the files, but we can use the subvolume
  to share data between different versions of an app. This will share
  disk space, but not page cache.

  Removal of apps is atomic, although you can't remove a btrfs volume
  until its not mounted anymore (i.e. the app is not in use anymore).

  Also, btrfs volume removal requires root rights, as do mounting a
  loopback btrfs image so some level of setuid helper is needed.

  btrfs also has an interesting feature where you can btrfs-send a
  subvolume, which creates a file describing the diff from the parent
  volume and the subvolume. This can then be applied with
  btrfs-recieve which is a userspace app that applies a set of file
  ops to convert the parent to the new child state. This is imho, not
  super interesting for our usecase. Btrfs-send is rarely what you
  want anyway as a newly built version of an app is built from scratch
  anyway and not based on the previous version. One can use rsync to
  create a new subvolume based on the old one, but then you're using
  rsync, not btrfs-send to generate the diffs.

I absolutely disagree. Kay and I have been discussing this stuff with
the btrfs folks. The thing is that we want the signatures for the
files be transferred in-line. While the signature stuff doesn't exist
right now for btrfs they guys working on it are ensuring that the
signatures can be serialized from btrfs as part of the btrfs send/recv
image, and then deserialized again on the destination, while staying
fully valid.

The signature thing is the one real advantage that the btrfs solution
has, and it is something nothing else gives us. 

Harald has been playing around with some build logic that makes sure
that rebuilt app updates are efficiently shipped as btrfs send/recv,
with stable inode numbers and stuff.

How exactly do you envision this would work in practice for updates? Say
you have an application that receives regular updates (major and minor).
At any time the user comes in an does a fetch-from-scratch, or an update
between two essentially "random" versions.  What does the server store?
A copy of each full image? Only for major versions? Delta inbetween each
consecutive image? Delta between each possible image pair?

It seems to me like a git-like format like ostree would allow a much
easier more efficient distribution model for updates on highly mirrored
dumb servers than this.

You know, this is explicitly something where we shouldn't reinvent the
wheel. It's quite frankly crazy to come up with a new serialization
format, that contains per-file verification data, that then somehow
can be deserialized on some destination system again back into the fs
layer...

The hard part obviously having the kernel verify the signatures, that
requires deep kernel FS works, which doesn't exist yet, and only the
btrfs people are working on. However, when they come up with something
it could very well be that it can be used for other things than
btrfs-recive (as btrfs-recive is just essentially a stream of syscalls).
Is the design discussions on this happening in the open somewhere?

I know that the Red Hat fs crew hates btrfs like it was the devil, and
loves LVM/DM like it was a healthy project. But yuck, just yuck!

I'm not particularly fond of a device-mapper approach either, but I was
listing all options, so it needed to be in there. That said, I'm also a
btrfs user on all my development machines, and I can't say my experience
with it has been exactly stellar...



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]