Re: Static object deltas plan



On Sun, 2013-07-14 at 19:40 -0400, Jasper St. Pierre wrote:
Have you looked at git's packfile format?

Yeah.

 It's a flexible packfile format that supports deltafied and
undeltafied storage, and might be good enough to use on our own
without making an entirely new format.

It's not possible to use unmodified, because OSTree's objects are
different from git (for example, OSTree's checksum is sha256, the object
types are not the same).

http://git.kernel.org/cgit/git/git.git/tree/Documentation/technical/pack-format.txt

OSTree used to have a packfile format modeled after git's, but I deleted
it for a few reasons:
https://git.gnome.org/browse/ostree/commit/?id=2a0601efc790a0c8f783043f2db682eec9ceffaa

Git's packfiles are used mostly for *local* access, plus they're
generated dynamically by the git daemon.

But for OSTree, I really want to optimize first for the static webserver
case.  I'll eventually add a packfile type thing to OSTree (since it'd
likely be a lot faster for local reading of metadata), but that's really
a different problem.

The git tooling also uses a set of simple heuristics for what to put
where and when to use deltas when constructing packfiles, some of them
we could probably use ourselves.

http://git.kernel.org/cgit/git/git.git/tree/Documentation/technical/pack-heuristics.txt

rsync was here first honestly...though it doesn't attempt to search for
"renames" across directories I think.  Both those and the git heuristics
are interesting to use as a basis though.


Anyways so what makes OSTree static deltas very different from git
packfiles is that if a large binary file changes (for example, an
executable with a bunch of embedded GResource), git has two options:
1) xdiff versus an old version
2) Ship the entire new copy of the binary

In contrast, the OSTree static deltas is *much* more flexible because
it's basically an executable diff with high level operations, influenced
by the Chromium autoupdates.  For example, it could contain "apply
xdiff" as a possible operation, just like Chromium has "apply bsdiff".

Now looking at xdiff specifically: the xdiff algorithm is focused on two
things:
1) Speed of diff computation
2) Shortest edit script/smallest diff

Whereas for the static deltas use case we care mainly about a "useful"
level delta compression, but not necessarily finding *minimal* deltas.
This is why the delta format is designed to be used with a bup/librsync
style rolling checksum algorithm for finding common block ranges.

If you haven't seen bup before:
https://github.com/bup/bup
scroll down to "How it works".

Here's a specific example of where the potential flexibility of the
OSTree model is potentially a large win over git packfiles:  A program
that used to have separate data files in a previous build merging into
one executable using GResource in the new build.

In this case, a static delta could read unchanged blocks from the old
executable and data objects, and write them to the new object.  Whereas
all git could do is choose *one* of those old objects to use as an xdiff
target (probably the executable, and thus ship all the data files
again), or just ship an entirely new object.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]