This post is not really commenting to the approach we're taking to implement remote execution, but rather what it enables us to do in terms of source distribution.
On Wed, Apr 11, 2018 at 10:37 PM Jürg Billeter <
j bitron ch> wrote:
[...]
Virtual File System API
~~~~~~~~~~~~~~~~~~~~~~~
As a second phase I'm proposing to introduce a virtual file system API that
both BuildStream core and element plugins can use in place of path-based
file system operations provided by the OS. The goal is to allow transparent
and efficient manipulation of Merkle trees.
The API consists of an abstract Directory class that supports creating
directories, copying/renaming/moving files and whole directory trees, and
importing files from a local path. See also the existing `utils` functions
`link_files` and `copy_files`, equivalent operations need to be supported.
Proposed steps with additional details:
* Create abstract Directory class.
* Implement regular OS file system backend.
* Add get_virtual_directory() to Sandbox class returning a Directory object,
still using regular OS file system backend for now.
* Add boolean class variable BST_VIRTUAL_DIRECTORY for Element plugins to
indicate that they only use get_virtual_directory() instead of
get_directory(). Add error to Sandbox class if get_directory() is used
even though BST_VIRTUAL_DIRECTORY is set.
* Port users of Sandbox.get_directory() to get_virtual_directory():
- Element class
- ScriptElement class
- Compose plugin
- Import plugin
- Stack plugin
The steps above do not depend on CAS. However, the following steps do:
* Implement CAS backend for Directory class
* Stage sources into CAS as part of the fetch job
* Use the sources from CAS
Staging sources into CAS is problematic if e.g. the whole .git directory is
included. We should avoid this where possible. This is already a concern
for caching of build trees (#21) as well and can be improved on
independently of any other steps.
If i understand this correctly, we have an opportunity for simplified source "mirroring" here.
If we introduce a SourceCache, which maps a sourcekey to a CAS directory node, the fetch operation becomes:
1) lookup SourceKey in SourceCache
1a.1) when an entry is present, fetch the Directory nodes from CAS and store them in local CAS.
1b.1) when no entry is found, fetch the source in the traditional sense
1b.2) stage the source in a temporary location (assumes #376 is resolved)
1b.3) put the staged source into the local CAS
iff you have write permissions to SourceCache:
1b.3) upload the staged source to CAS
1b.4) put an entry into SourceCache
This does optimize for remote execution, in the sense that actual fetching of the source as well as fetching the files that make up the source is avoided.
I could envision a configuration option, allowing the user to say:
- I am always building locally, make sure that my local CAS has everything I need to build locally.
- I want the actual sources locally, make sure to always fetch in the traditional sense
In terms of source mirroring. This could now be a central instance of bst that is just running bst fetch. It will have all of the fetched sources locally, in case they need to be inspected. All other instances will pull from SourceCache as CAS.
The above would address fetching sources reliably. It would also address:
#261: Investigate the use of git shallow clones (to build instead tarballs)
#330: Support for mirroring of upstream sources
To an extent, as the original format of the sources are not propagated beyond the host running the "mirror". However, without the need to set up anything in terms of serving the sources in their original format.
#328: Support for downloading sources from mirrors
It covers the case of getting source from a local ecosystem; at least for anything recent. Geographical awareness will need to be a higher level concept, that applies to endpoints like ArtifactCache, SourceCache, CAS, etc.
The lifespan of SourceCache/CAS entries might be limited. This can be mitigated by keeping an archive of the original .bst files, and the SourceCache/CAS entries, such that there is always a way to go back [years] in time. Without even having to worry to much about the host tools (git, bzr, etc).
Opening a workspace will still require the traditional source fetch. This should happen on-demand if there is no source present locally. Alternatively the user could force a fetch. This would be an action a user would perform if e.g. when preparing to be offline.
Thoughts?
Cheers,
Sander