Re: [RFC] use reflinks to dedup files with the same content



Giuseppe Scrivano <gscrivano gnu org> writes:

I'd like to take advantage of reflinks where possible so that
deduplication can be achieved also with files that differ only for their
xattrs.

The use case I have in mind is that we will be able to dedup files
coming from a container image that are already present in the ostree
repository but with a different SELinux label.

I think this could both be done either at pull or at checkout:

at pull time: the object with the different inode is created in the
repository and it is a reflink to the object already present.  Nothing
is changed for the checkout phase.  If reflinks are not supported, then
a new copy takes place.

at checkout time: for a file that has the same "content checksum" as a
file in the storage, we create a reflink to the file in the ostree
repository and then the xattrs are set afterwards.

In both cases we need a way to store the "content checksum" of a file so
that we can look it up when we add a new object.

I this the first way is cleaner as it won't change how files are checked
out and we will still be able to validate them (i.e. check if they were
modified).  Also, if reflinks are not possible, we will still keep only
one copy instead of a copy for each checkout (shared via hard links as
usual).

What do you think?  How to store the "content checksum"?

I went forward and started working on the first version.  I've opened a
PR here:

  https://github.com/ostreedev/ostree/pull/1443

Regards,
Giuseppe


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]