On Mon, Aug 24, 2020 at 7:48 AM Alexander Larsson <alexl redhat com> wrote:
On Mon, 2020-08-24 at 06:58 -0600, Dan Nicholson wrote:On Mon, Aug 24, 2020, 3:56 AM Alexander Larsson via ostree-list < ostree-list gnome org> wrote:I notices that the endless ostree.sizes metadata support landed, and I'm interested in using these to improve the flatpak download size estimations. However, I seem to remember that there was some kind of issue with these the last time they were tried. Wasn't there a size limit for the commit metadata or something that got overrun by large commits and made this not work? Was something done to fix this?It definitely works and I doubt any flatpaks will be larger than the OS we ship. I fixed several smaller issues when I was upstreaming it. The issue is that it makes the commit metadata really large for large commits. IIRC, our OS commit objects are like 4MB or something like that. So, it can make things a little show since ostree will always fetch the commit metadata when pulling. For smaller commits I don't think it would be much of an issue, but for runtimes you'd probably see it.I don't remember the exact details, but OSTree has this: /** * OSTREE_MAX_METADATA_SIZE: * * Default limit for maximum permitted size in bytes of metadata objects fetched * over HTTP (including repo/config files, refs, and commit/dirtree/dirmeta * objects). This is an arbitrary number intended to mitigate disk space * exhaustion attacks. */ #define OSTREE_MAX_METADATA_SIZE (10 * 1024 * 1024) And I remember endless running into this at some point back when it was originally using ostree.sizes.
As far as I know, it's never happened in the 5 or so years that I've been around this code. However, very early on I believe the object checksums were stored as hex strings, which would certainly bloat the size. In upstream it's simply a byte array. I think that the hex string format was only in our fork because I don't see any history of it upstream.
The other thing that's a little funky is that you need to do a metadata only pull to get it, which leaves the repo in a state where it thinks it has a partial commit and skips deltas as was found before. I think there are ways to handle it and flatpak already does something like this for a different reason I can't recall, but it's something I thought needed a better solution for in ostree. I've always wanted to have flatpak use it, but I recall that you were opposed to it.We did initially support it, but we stopped using it because it wasn't very effective. For example, per your above OS commit size of 4M, if we assume that runtimes are around half that, then for an update operation we need to download 2MB extra per updated runtime before we can even display the list of things to download (which has the estimated size). Our current approach of assuming nothing is shared is not great for size estimation, but its a lot slimmer than that.
For sure. Just to get an idea, I wrote a script (attached) to pull some refs from flathub, create new commits with the sizes data and compare the size of the commit object. Here's some relevant output: Ref Orig Size New Size app/org.gnome.Builder/x86_64/stable 1.8 kB 2.0 MB app/org.mozilla.firefox/x86_64/stable 1.6 kB 6.6 kB runtime/org.freedesktop.Platform/x86_64/19.08 2.7 kB 467.3 kB runtime/org.freedesktop.Sdk/x86_64/19.08 3.8 kB 1.3 MB runtime/org.gnome.Platform/x86_64/3.36 2.5 kB 1.7 MB runtime/org.gnome.Sdk/x86_64/3.36 3.5 kB 1.7 MB It definitely adds a lot of overhead. Interestingly, Builder expands the most, but that's apparently because it has 47425 objects. org.gnome.Sdk has 41528. Our OS commit has 97320 objects and the commit object is 4.1 MB. Yikes.
Maybe there is some in-between option? For example, we could store a mapping of the size difference between a commit an a number of its ancestors in a simple commit-id -> download size mapping. It will not be perfect, because you might have additional objects already available, but it will be a lot better than the current worst-case scenario. And it would be a lot cheaper to both download and compute the download size than with the full object list.
That could work. On the other hand, if you're already iterating through all the ancestors to figure out how many common objects they have, you could just create a delta. Having the full object list has other benefits on the client side. For instance, I have a hacky branch that looks at the sizes data to try to figure out whether it would be better to pull objects or a scratch delta. I.e., if you're missing 90% of the objects, it might be better to pull the delta even though it duplicates some of the objects you have.
Attachment:
fill-sizes
Description: Binary data