Re: ostree.sizes metadata issues?



On Mon, Aug 24, 2020 at 7:48 AM Alexander Larsson <alexl redhat com> wrote:

On Mon, 2020-08-24 at 06:58 -0600, Dan Nicholson wrote:
On Mon, Aug 24, 2020, 3:56 AM Alexander Larsson via ostree-list <
ostree-list gnome org> wrote:
I notices that the endless ostree.sizes metadata support landed,
and
I'm interested in using these to improve the flatpak download size
estimations.

However, I seem to remember that there was some kind of issue with
these the last time they were tried. Wasn't there a size limit for
the
commit metadata or something that got overrun by large commits and
made
this not work? Was something done to fix this?

It definitely works and I doubt any flatpaks will be larger than the
OS we ship. I fixed several smaller issues when I was upstreaming it.
The issue is that it makes the commit metadata really large for large
commits. IIRC, our OS commit objects are like 4MB or something like
that. So, it can make things a little show since ostree will always
fetch the commit metadata when pulling. For smaller commits I don't
think it would be much of an issue, but for runtimes you'd probably
see it.

I don't remember the exact details, but OSTree has this:

/**
 * OSTREE_MAX_METADATA_SIZE:
 *
 * Default limit for maximum permitted size in bytes of metadata objects fetched
 * over HTTP (including repo/config files, refs, and commit/dirtree/dirmeta
 * objects). This is an arbitrary number intended to mitigate disk space
 * exhaustion attacks.
 */
#define OSTREE_MAX_METADATA_SIZE (10 * 1024 * 1024)

And I remember endless running into this at some point back when it was originally using ostree.sizes.

As far as I know, it's never happened in the 5 or so years that I've
been around this code. However, very early on I believe the object
checksums were stored as hex strings, which would certainly bloat the
size. In upstream it's simply a byte array. I think that the hex
string format was only in our fork because I don't see any history of
it upstream.

The other thing that's a little funky is that you need to do a
metadata only pull to get it, which leaves the repo in a state where
it thinks it has a partial commit and skips deltas as was found
before. I think there are ways to handle it and flatpak already does
something like this for a different reason I can't recall, but it's
something I thought needed a better solution for in ostree.

I've always wanted to have flatpak use it, but I recall that you were
opposed to it.

We did initially support it, but we stopped using it because it wasn't
very effective. For example, per your above OS commit size of 4M, if we
assume that runtimes are around half that, then for an update operation
we need to download 2MB extra per updated runtime before we can even
display the list of things to download (which has the estimated size).

Our current approach of assuming nothing is shared is not great for
size estimation, but its a lot slimmer than that.

For sure. Just to get an idea, I wrote a script (attached) to pull
some refs from flathub, create new commits with the sizes data and
compare the size of the commit object. Here's some relevant output:

Ref                                              Orig Size    New Size
app/org.gnome.Builder/x86_64/stable              1.8 kB       2.0 MB
app/org.mozilla.firefox/x86_64/stable            1.6 kB       6.6 kB
runtime/org.freedesktop.Platform/x86_64/19.08    2.7 kB       467.3 kB
runtime/org.freedesktop.Sdk/x86_64/19.08         3.8 kB       1.3 MB
runtime/org.gnome.Platform/x86_64/3.36           2.5 kB       1.7 MB
runtime/org.gnome.Sdk/x86_64/3.36                3.5 kB       1.7 MB

It definitely adds a lot of overhead. Interestingly, Builder expands
the most, but that's apparently because it has 47425 objects.
org.gnome.Sdk has 41528. Our OS commit has 97320 objects and the
commit object is 4.1 MB. Yikes.

Maybe there is some in-between option? For example, we could store a
mapping of the size difference between a commit an a number of its
ancestors in a simple commit-id -> download size mapping. It will not
be perfect, because you might have additional objects already
available, but it will be a lot better than the current worst-case
scenario. And it would be a lot cheaper to both download and compute
the download size than with the full object list.

That could work. On the other hand, if you're already iterating
through all the ancestors to figure out how many common objects they
have, you could just create a delta.

Having the full object list has other benefits on the client side. For
instance, I have a hacky branch that looks at the sizes data to try to
figure out whether it would be better to pull objects or a scratch
delta. I.e., if you're missing 90% of the objects, it might be better
to pull the delta even though it duplicates some of the objects you
have.

Attachment: fill-sizes
Description: Binary data



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]