Re: rsyncing repositories



Back from vacation now, so I'm starting to think about this again.

On Mon, Oct 17, 2016 at 2:00 PM, Colin Walters <walters verbum org> wrote:
[Splitting out this rsync topic]

On Fri, Oct 7, 2016, at 05:12 PM, Dan Nicholson wrote:

Oh, I thought the issues here were well known.

We should have a bug/issue to reference, otherwise they may
get forgotten.  I'm sure we've talked about this in the past,
but let's try to be sure issues get filed.

Now, two things, first I've created:

https://github.com/ostreedev/ostree-releng-scripts

And submitted a PR:

https://github.com/ostreedev/ostree-releng-scripts/pull/2

I've lightly tested this script, and I'd like to support rsyncing
repositories, even though I think we can get sshfs to perform
better ( https://bugzilla.gnome.org/show_bug.cgi?id=756540 )

I looked at the current version of the rsync-repos script, and it does
seem to address some of the shortcomings (particularly with --delete
ordering).

1. Works entirely by chance because objects sorts before refs, which
sorts before summary and rsync publishes in alphabetical sort order.

Yes, but that also isn't a bug =)

Unless you're using --delete, in which case you'd remove an object
before a ref was updated or removed. But you did address that in your
script.

2. Objects are pushed directly into place. If there's a crash or
network interruption in the middle of the sync, you have a possibly
corrupt repo. If you smartly turn on the --delay-updates option so
that files are uploaded to a temporary name and renamed into place,
you now might leave behind a bunch of hidden file cruft if there's an
interruption.

This doesn't seem true to me;  it requires --inplace to be specified.
Were you doing that?

Sorry, I didn't quite say that right. The files aren't truly updated
in place, but a temporary file is created. Or with --delay-updates, a
temporary directory is used. In either case, there are issues, though.

Rsync operates one directory at a time. Since the objects are in
separate subdirectories, you will likely get some objects moved into
place before all of its children are created. --delay-updates helps a
bit by using a temporary directory and moving all the temporary files
into place in one go. I.e., it narrows the window of inconsistency a
little, but not much since rsync still has to iterate through the
other directories. This is not such an issue if the refs update comes
later, though.

However, if there's an interruption during the transfer, the repo will
be left in an inconsistent state. There will also be temporary files
or directories in the repo that may not be cleaned up until the next
run.

3. If you want to support removing commits, then you have to use
--delete. Due to the above ordering, you'll now remove objects before
the refs have been removed and you now have an invalid repository if
anyone pulls during that window.

This indeed is a serious issue, and my rsync PR above doesn't address
it, but I want to lay the groundwork there first, and then handle --delete
afterwards.

4. Since there's no locking of the ostree repo on the source end, you
can publish broken commits. This would happen if a source ref got
updated after rsync had completed part of the objects sync (this has
definitely happened to us before).

Yes, although I think of this as a pipeline...you have multiple internal
workers which are generating content into an internal repo, then
that repo is locked when publishing.

We definitely need locking to do pruning of the repo.  One way to
implement this would simply be to create a temporary snapshot
via pull-local of the repo to do a prune + rsync.  That means the
"base repo" will accumulate space, but now it's possible to
replace it at any point with the "public repo".

I really think that ostree needs a repo locking mechanism for pruning
regardless of whether you want to use rsync or not. I'm pretty sure
we've run into repo corruption at Endless when 2 processes are
operating on the repo at the same time, but I haven't had time to look
closer.

You can probably think of lots more. It's really only safe if you know
both sides won't be accessed during the sync.

Let's leave aside concurrency on the write+sync side for the
moment, and focus on these steps:

1) Implementing an rsync script without delete support
2) Enhance script to do deletes
3) Create a higher level script which does snapshotting for
    clones or something as sketched out above (I think it'll work,
   but baby steps first)

That seems fine. I'm interested to know how fast the "pull-local,
rsync, pull-local" method can go. That was one of the ideas I had
before that I didn't prototype.

That said, the concurrency is the source of the majority of these
issues. It feels like that can't be solved without native support in
ostree. I.e., there's no way for the update to appear atomic to
clients without using ostree to manage it coherently.

The other idea I've been thinking about is another round of
ostree-push script where you use ssh to tunnel a local HTTP port to
the remote and and use pull. Haven't had time to play with that,
though.

What do you think about merging in your push script into
ostree-releng-scripts?  Actually, I'm uncertain - are you using
ostree-push right now?  How do you think of it handling deletions?
Or to flip this around, is it worth doing the push script if we
can enhance the rsync wrapper enough and/or sshfs?

We aren't using it right now. The version with the custom protocol
over SSH is a toy that I haven't played with more. I still think it's
a good approach, but you'd have to duplicate a lot of ostree pull
internals. The version with sshfs is unusable. I'm really doubtful
that it can become usable since a real update will always require
traversal, and that's just way too slow over sshfs. Maybe you can find
a way to make that work.

As mentioned before (and what you suggested in the bug), what I intend
to work on is to trigger a pull from the source to the destination
over HTTP since it's the only way to get reliable repo handling
semantics. So, you'd ssh to the remote host and pull with --url
pointing to the worker. If the worker doesn't have an HTTP server
running or is firewalled, spin up trivial-httpd and tunnel the port to
the remote host and use a localhost --url. The important part is that
each repo actually ends up being operated on by the the local host and
you're not trying to traverse over the network.

--
Dan


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]