Re: Idea: daily packs



Hey Owen, thanks for looking at this!

On Fri, 2012-08-17 at 20:52 -0400, Owen Taylor wrote:

> The one part of this idea that I haven't figured out is how you would keep
> the next day's 'ostree --pull' from downloading the next huge tarball instead
> of a few thousand smaller files.

I wonder if a simple heuristic like "only use packfile if over 50% of
the desired objects are in it" would work.  The pack indexes have this
information now.  We could be smarter than that if we added the size of
the objects to the index.

Now, there are a few reasons that the .tar.gz is going to be a lot more
efficient for initial download:

* Compression - right now archive mode objects aren't.  We could
  gzip individual objects on the build server, but then we'd have to
  uncompress them when constructing buildroots.  Right now on decent
  hardware, the build system can make buildroots in 2-5 *seconds*.
  Keeping this fast is really important to the developer experience.
 
  We could just accept double disk space usage and have both compressed
  and uncompressed objects - basically equivalent to having a gigantic
  packfile. Or maybe investigate some sort of dynamic cache of
  uncompressed objects.

* HTTP overhead - this is pretty substantial.  I think a typical
  request/response pair is on the order of 100 bytes, which is 5 times
  the size of a file metadata (".file") at 18 bytes.

  The only way to mitigate this is smarter packfile clustering.

* Request/response latency - right now, content fetches
  (.file, .filecontent) are asynchronous, but metadata fetches
  are synchronous.  This is a pretty substantial speed hit
  I'm sure.

  It shouldn't be too hard to make metadata fetches asynchronous
  too.

>  Without a smart server it seems hard to
> optimize

Hard*er* clearly - but if we make more static information available
(object size, etc.) it's definitely possible to push significant amounts
of computation to the client, I think.

> Note: The initial pull time isn't awful and even on slower network is probably only
> a couple of hours - so not a first priority. But factor-of-6 is a pretty big speedup
> to try to get at some point.

Right.  I figure currently interested parties can just do the download
in the background, but long term an improved story here is going to be
important.

> Note: I subsequently tried using 'ostree pack' to create a gigantic single file pack
> and it did give the expected speedup to something similar to the .tar.gz download.

Makes sense.  OSTree's metadata overhead is higher than tar, but the 3
factors above almost certainly dominate.

I think we should investigate the "50% or more packfile" heuristic at
least.

Oh one other thing - to really optimize the initial download, we could
have ostree support downloading pack files via BitTorrent.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]