[Notes] [Git][BuildStream/buildstream][tpollard/566] 12 commits: Fix sta

Tom Pollard pushed to branch tpollard/566 at BuildStream / buildstream

Commits:

891fcb0e

by Tristan Van Berkom at 2019-01-07T16:47:01Z

Fix stack traces discovered with ^C forceful termination.

  * utils.py:_kill_process_tree(): Ignore NoSuchProcess errors

    These are caused because we issue SIGTERM, and if the process
    has not exited after a timeout, we kill it.

  * _scheduler/jobs/job.py: Stop handling NoSuchProcess errors here
    redundantly, they are already ignored.

It seems that we were ignoring it after sleeping when terminating
tasks from the scheduler... but we were not ignoring it when performing
the same pattern in the `Plugin.call()` -> `utils._call()` path, so
we were still getting these exceptions at termination time from host
tool processes launched by source plugins.

5de42d43

by Tristan Van Berkom at 2019-01-07T18:00:37Z

Merge branch 'tristan/keyboard-interrupt-stack-trace' into 'master'

Fix stack traces discovered with ^C forceful termination.

See merge request BuildStream/buildstream!1043

059035b9

by Tristan Van Berkom at 2019-01-07T18:02:00Z

_scheduler/scheduler.py: Make _schedule_jobs() private

This is not used anywhere outside of the Scheduler, currently
only the Scheduler itself is allowed to queue a job at this level.

If the highlevel business logic for automatic queueing of auxiliary
jobs moves to another location, we can make this public again.

b83d1b1f

by Tristan Van Berkom at 2019-01-07T18:02:00Z

_scheduler/scheduler.py: Only run one cache size job at a time

When queuing the special cache management related cleanup and
cache size jobs, we now treat these jobs as special and do the
following:

  * Avoid queueing a cleanup/cache_size job if one is already queued

    We just drop redundantly queued jobs here.

  * Ensure that jobs of this type only run one at a time

    This could have been done with the Resources mechanics,
    however as these special jobs have the same properties and
    are basically owned by the Scheduler, it seemed more straight
    forward to handle the behaviors of these special jobs together.

This fixes issue #753

16a8816f

by Tristan Van Berkom at 2019-01-07T18:02:00Z

Scheduler: Introduced JobStatus instead of simple success boolean

This changes the deepest callback from when a Job completes to
propagate a JobStatus value instead of a simple boolean, and updates
all of the effected code paths which used to receive a boolean
to now handle the JobStatus values.

This further improves the situation for issue #753, as now we avoid
queueing cache size jobs for pull jobs which are skipped.

c2fc2a5e

by Tristan Van Berkom at 2019-01-07T18:02:00Z

_scheduler/jobs/job.py: Removed 'skipped' property

This is redundant now that we report it through the JobStatus.

3e3984ad

by Tristan Van Berkom at 2019-01-07T18:50:23Z

Merge branch 'tristan/one-cache-size-job' into 'master'

Only queue one cache size job

Closes #753

See merge request BuildStream/buildstream!1040

512c726e

by Tristan Van Berkom at 2019-01-08T03:38:11Z

sandbox/sandbox.py: Fix regression of command logging

Since we added batch commands, the batch commands print the
text of the commands directly in the message text, but this is wrong.

The detail string is the appropriate place for text of unknown lengths
(the user can actually configure how many max lines of commands they
want to see in their log), the message text itself should be controlled
and brief enough to avoid text wrapping.

01171988

by Tristan Van Berkom at 2019-01-08T04:20:14Z

Merge branch 'tristan/fix-command-status-messages' into 'master'

sandbox/sandbox.py: Fix regression of command logging

See merge request BuildStream/buildstream!1044

6c1d06d6

by Phil Dawson at 2019-01-08T10:24:32Z

element.py: remove reference to source bundle command

This command has been replacved by the bst source checkout command

914ecb72

by Jürg Billeter at 2019-01-08T10:54:02Z

Merge branch 'phil/remove-source-bundle-reference' into 'master'

element.py: remove documentation reference to source bundle command

See merge request BuildStream/buildstream!1041

75b0c186

by Tom Pollard at 2019-01-08T12:12:37Z

WIP: Make uploading of build trees configurable

20 changed files:

buildstream/_artifactcache/artifactcache.py
buildstream/_artifactcache/cascache.py
buildstream/_frontend/app.py
buildstream/_scheduler/__init__.py
buildstream/_scheduler/jobs/__init__.py
buildstream/_scheduler/jobs/cachesizejob.py
buildstream/_scheduler/jobs/cleanupjob.py
buildstream/_scheduler/jobs/elementjob.py
buildstream/_scheduler/jobs/job.py
buildstream/_scheduler/queues/buildqueue.py
buildstream/_scheduler/queues/fetchqueue.py
buildstream/_scheduler/queues/pullqueue.py
buildstream/_scheduler/queues/queue.py
buildstream/_scheduler/queues/trackqueue.py
buildstream/_scheduler/scheduler.py
buildstream/element.py
buildstream/sandbox/sandbox.py
buildstream/utils.py
+ tests/integration/pushbuildtrees.py
tests/testutils/runcli.py

Changes:

buildstream/_artifactcache/artifactcache.py

@@ -74,6 +74,7 @@ class ArtifactCache():
          self._has_fetch_remotes = False
          self._has_push_remotes = False
 +        self._has_partial_push_remotes = False
          os.makedirs(self.extractdir, exist_ok=True)
@@ -398,6 +399,8 @@ class ArtifactCache():
                  self._has_fetch_remotes = True
                  if remote_spec.push:
                      self._has_push_remotes = True
 +                    if remote_spec.partial_push:
 +                        self._has_partial_push_remotes = True
                  remotes[remote_spec.url] = CASRemote(remote_spec)
@@ -596,6 +599,31 @@ class ArtifactCache():
              remotes_for_project = self._remotes[element._get_project()]
              return any(remote.spec.push for remote in remotes_for_project)
 +    # has_partial_push_remotes():
 +    #
 +    # Check whether any remote repositories are available for pushing
 +    # non-complete artifacts
 +    #
 +    # Args:
 +    #     element (Element): The Element to check
 +    #
 +    # Returns:
 +    #   (bool): True if any remote repository is configured for optional
 +    #            partial pushes, False otherwise
 +    #
 +    def has_partial_push_remotes(self, *, element=None):
 +        # If there's no partial push remotes available, we can't partial push at all
 +        if not self._has_partial_push_remotes:
 +            return False
 +        elif element is None:
 +            # At least one remote is set to allow partial pushes
 +            return True
 +        else:
 +            # Check whether the specified element's project has push remotes configured
 +            # to not accept partial artifact pushes
 +            remotes_for_project = self._remotes[element._get_project()]
 +            return any(remote.spec.partial_push for remote in remotes_for_project)
++
      # push():
+     #
      # Push committed artifact to remote repository.
@@ -603,6 +631,8 @@ class ArtifactCache():
      # Args:
      #     element (Element): The Element whose artifact is to be pushed
      #     keys (list): The cache keys to use
 +    #     partial(bool): If the artifact is cached in a partial state
 +    #     subdir(string): Optional subdir to not push
+     #
      # Returns:
      #   (bool): True if any remote was updated, False if no pushes were required
@@ -610,12 +640,25 @@ class ArtifactCache():
      # Raises:
      #   (ArtifactError): if there was an error
+     #
 -    def push(self, element, keys):
 +    def push(self, element, keys, partial=False, subdir=None):
          refs = [self.get_artifact_fullname(element, key) for key in list(keys)]
          project = element._get_project()
 -        push_remotes = [r for r in self._remotes[project] if r.spec.push]
 +        push_remotes = []
 +        partial_remotes = []
++
 +        # Create list of remotes to push to, given current element and partial push config
 +        if not partial:
 +            push_remotes = [r for r in self._remotes[project] if (r.spec.push and not r.spec.partial_push)]
++
 +        if self._has_partial_push_remotes:
 +            # Create a specific list of the remotes expecting the artifact to be push in a partial
 +            # state. This list needs to be pushed in a partial state, without the optional subdir if
 +            # exists locally. No need to attempt pushing a partial artifact to a remote that is queued to
 +            # to also recieve a full artifact
 +            partial_remotes = [r for r in self._remotes[project] if (r.spec.partial_push and r.spec.push) and
 +                               r not in push_remotes]
          pushed = False
@@ -624,7 +667,7 @@ class ArtifactCache():
              display_key = element._get_brief_display_key()
              element.status("Pushing artifact {} -> {}".format(display_key, remote.spec.url))
 -            if self.cas.push(refs, remote):
 +            if self.cas.push(refs, remote, subdir=subdir):
                  element.info("Pushed artifact {} -> {}".format(display_key, remote.spec.url))
                  pushed = True
              else:
@@ -632,6 +675,19 @@ class ArtifactCache():
                      remote.spec.url, element._get_brief_display_key()
                  ))
 +        for remote in partial_remotes:
 +            remote.init()
 +            display_key = element._get_brief_display_key()
 +            element.status("Pushing partial artifact {} -> {}".format(display_key, remote.spec.url))
++
 +            if self.cas.push(refs, remote, excluded_subdirs=subdir):
 +                element.info("Pushed partial artifact {} -> {}".format(display_key, remote.spec.url))
 +                pushed = True
 +            else:
 +                element.info("Remote ({}) already has {} partial cached".format(
 +                    remote.spec.url, element._get_brief_display_key()
 +                ))
++
          return pushed
      # pull():
@@ -659,14 +715,23 @@ class ArtifactCache():
                  element.status("Pulling artifact {} <- {}".format(display_key, remote.spec.url))
                  if self.cas.pull(ref, remote, progress=progress, subdir=subdir, excluded_subdirs=excluded_subdirs):
 -                    element.info("Pulled artifact {} <- {}".format(display_key, remote.spec.url))
                      if subdir:
 -                        # Attempt to extract subdir into artifact extract dir if it already exists
 -                        # without containing the subdir. If the respective artifact extract dir does not
 -                        # exist a complete extraction will complete.
 -                        self.extract(element, key, subdir)
 -                    # no need to pull from additional remotes
 -                    return True
 +                        if not self.contains_subdir_artifact(element, key, subdir):
 +                            # The pull was expecting the specific subdir to be present, attempt
 +                            # to find it in other available remotes
 +                            element.info("Pulled partial artifact {} <- {}. Attempting to retrieve {} from remotes"
 +                                         .format(display_key, remote.spec.url, subdir))
 +                        else:
 +                            element.info("Pulled artifact {} <- {}".format(display_key, remote.spec.url))
 +                            # Attempt to extract subdir into artifact extract dir if it already exists
 +                            # without containing the subdir. If the respective artifact extract dir does not
 +                            # exist a complete extraction will complete.
 +                            self.extract(element, key, subdir)
 +                            # no need to pull from additional remotes
 +                            return True
 +                    else:
 +                        element.info("Pulled artifact {} <- {}".format(display_key, remote.spec.url))
 +                        return True
                  else:
                      element.info("Remote ({}) does not have {} cached".format(
                          remote.spec.url, element._get_brief_display_key()

buildstream/_artifactcache/cascache.py

@@ -45,7 +45,8 @@ from .. import _yaml
  _MAX_PAYLOAD_BYTES = 1024 * 1024
 -class CASRemoteSpec(namedtuple('CASRemoteSpec', 'url push server_cert client_key client_cert instance_name')):
 +class CASRemoteSpec(namedtuple('CASRemoteSpec',
 +                               'url push partial_push server_cert client_key client_cert instance_name')):
      # _new_from_config_node
+     #
@@ -53,9 +54,13 @@ class CASRemoteSpec(namedtuple('CASRemoteSpec', 'url push server_cert client_key
+     #
      @staticmethod
      def _new_from_config_node(spec_node, basedir=None):
 -        _yaml.node_validate(spec_node, ['url', 'push', 'server-cert', 'client-key', 'client-cert', 'instance_name'])
 +        _yaml.node_validate(spec_node,
 +                            ['url', 'push', 'allow-partial-push', 'server-cert', 'client-key',
 +                             'client-cert', 'instance_name'])
          url = _yaml.node_get(spec_node, str, 'url')
          push = _yaml.node_get(spec_node, bool, 'push', default_value=False)
 +        partial_push = _yaml.node_get(spec_node, bool, 'allow-partial-push', default_value=False)
++
          if not url:
              provenance = _yaml.node_get_provenance(spec_node, 'url')
              raise LoadError(LoadErrorReason.INVALID_DATA,
@@ -85,10 +90,10 @@ class CASRemoteSpec(namedtuple('CASRemoteSpec', 'url push server_cert client_key
              raise LoadError(LoadErrorReason.INVALID_DATA,
                              "{}: 'client-cert' was specified without 'client-key'".format(provenance))
 -        return CASRemoteSpec(url, push, server_cert, client_key, client_cert, instance_name)
 +        return CASRemoteSpec(url, push, partial_push, server_cert, client_key, client_cert, instance_name)
 -CASRemoteSpec.__new__.__defaults__ = (None, None, None, None)
 +CASRemoteSpec.__new__.__defaults__ = (False, None, None, None, None)
  class BlobNotFound(CASError):
@@ -283,34 +288,47 @@ class CASCache():
      #   (bool): True if pull was successful, False if ref was not available
+     #
      def pull(self, ref, remote, *, progress=None, subdir=None, excluded_subdirs=None):
 -        try:
 -            remote.init()
 -            request = buildstream_pb2.GetReferenceRequest(instance_name=remote.spec.instance_name)
 -            request.key = ref
 -            response = remote.ref_storage.GetReference(request)
 +        tree_found = False
 -            tree = remote_execution_pb2.Digest()
 -            tree.hash = response.digest.hash
 -            tree.size_bytes = response.digest.size_bytes
 +        while True:
 +            try:
 +                if not tree_found:
 +                    remote.init()
 -            # Check if the element artifact is present, if so just fetch the subdir.
 -            if subdir and os.path.exists(self.objpath(tree)):
 -                self._fetch_subdir(remote, tree, subdir)
 -            else:
 -                # Fetch artifact, excluded_subdirs determined in pullqueue
 -                self._fetch_directory(remote, tree, excluded_subdirs=excluded_subdirs)
 +                    request = buildstream_pb2.GetReferenceRequest(instance_name=remote.spec.instance_name)
 +                    request.key = ref
 +                    response = remote.ref_storage.GetReference(request)
 -            self.set_ref(ref, tree)
 +                    tree = remote_execution_pb2.Digest()
 +                    tree.hash = response.digest.hash
 +                    tree.size_bytes = response.digest.size_bytes
 -            return True
 -        except grpc.RpcError as e:
 -            if e.code() != grpc.StatusCode.NOT_FOUND:
 -                raise CASError("Failed to pull ref {}: {}".format(ref, e)) from e
 -            else:
 -                return False
 -        except BlobNotFound as e:
 -            return False
 +                # Check if the element artifact is present, if so just fetch the subdir.
 +                if subdir and os.path.exists(self.objpath(tree)):
 +                    self._fetch_subdir(remote, tree, subdir)
 +                else:
 +                    # Fetch artifact, excluded_subdirs determined in pullqueue
 +                    self._fetch_directory(remote, tree, excluded_subdirs=excluded_subdirs)
++
 +                self.set_ref(ref, tree)
++
 +                return True
 +            except grpc.RpcError as e:
 +                if e.code() != grpc.StatusCode.NOT_FOUND:
 +                    raise CASError("Failed to pull ref {}: {}".format(ref, e)) from e
 +                else:
 +                    return False
 +            except BlobNotFound as e:
 +                if not excluded_subdirs and subdir:
 +                    # The remote has the top level digest but could not complete a full pull,
 +                    # attempt partial without the need to initialise and check for the artifact
 +                    # digest. This default behaviour of dropping back to partial pulls could
 +                    # be made a configurable warning given at artfictcache level.
 +                    tree_found = True
 +                    excluded_subdirs, subdir = subdir, excluded_subdirs
 +                else:
 +                    return False
      # pull_tree():
+     #
@@ -355,6 +373,8 @@ class CASCache():
      # Args:
      #     refs (list): The refs to push
      #     remote (CASRemote): The remote to push to
 +    #     subdir (string): Optional specific subdir to include in the push
 +    #     excluded_subdirs (list): The optional list of subdirs to not push
+     #
      # Returns:
      #   (bool): True if any remote was updated, False if no pushes were required
@@ -362,7 +382,7 @@ class CASCache():
      # Raises:
      #   (CASError): if there was an error
+     #
 -    def push(self, refs, remote):
 +    def push(self, refs, remote, *, subdir=None, excluded_subdirs=None):
          skipped_remote = True
          try:
              for ref in refs:
@@ -376,15 +396,18 @@ class CASCache():
                      response = remote.ref_storage.GetReference(request)
                      if response.digest.hash == tree.hash and response.digest.size_bytes == tree.size_bytes:
 -                        # ref is already on the server with the same tree
 -                        continue
 +                        # ref is already on the server with the same tree, however it might be partially cached.
 +                        # If artifact is not set to be pushed partially attempt to 'complete' the remote artifact if
 +                        # needed, else continue.
 +                        if excluded_subdirs or self.verify_digest_on_remote(remote, self._get_subdir(tree, subdir)):
 +                            continue
                  except grpc.RpcError as e:
                      if e.code() != grpc.StatusCode.NOT_FOUND:
                          # Intentionally re-raise RpcError for outer except block.
                          raise
 -                self._send_directory(remote, tree)
 +                self._send_directory(remote, tree, excluded_dir=excluded_subdirs)
                  request = buildstream_pb2.UpdateReferenceRequest(instance_name=remote.spec.instance_name)
                  request.keys.append(ref)
@@ -866,10 +889,17 @@ class CASCache():
                  a += 1
                  b += 1
 -    def _reachable_refs_dir(self, reachable, tree, update_mtime=False):
 +    def _reachable_refs_dir(self, reachable, tree, update_mtime=False, subdir=False):
          if tree.hash in reachable:
              return
 +        # If looping through subdir digests, skip processing if
 +        # ref path does not exist, allowing for partial objects
 +        if subdir and not os.path.exists(self.objpath(tree)):
 +            return
++
 +        # Raises FileNotFound exception is path does not exist,
 +        # which should only be entered on the top level digest
          if update_mtime:
              os.utime(self.objpath(tree))
@@ -886,9 +916,9 @@ class CASCache():
              reachable.add(filenode.digest.hash)
          for dirnode in directory.directories:
 -            self._reachable_refs_dir(reachable, dirnode.digest, update_mtime=update_mtime)
 +            self._reachable_refs_dir(reachable, dirnode.digest, update_mtime=update_mtime, subdir=True)
 -    def _required_blobs(self, directory_digest):
 +    def _required_blobs(self, directory_digest, excluded_dir=None):
          # parse directory, and recursively add blobs
          d = remote_execution_pb2.Digest()
          d.hash = directory_digest.hash
@@ -907,7 +937,8 @@ class CASCache():
              yield d
          for dirnode in directory.directories:
 -            yield from self._required_blobs(dirnode.digest)
 +            if dirnode.name != excluded_dir:
 +                yield from self._required_blobs(dirnode.digest)
      def _fetch_blob(self, remote, digest, stream):
          resource_name_components = ['blobs', digest.hash, str(digest.size_bytes)]
@@ -1029,6 +1060,7 @@ class CASCache():
              objpath = self._ensure_blob(remote, dir_digest)
              directory = remote_execution_pb2.Directory()
++
              with open(objpath, 'rb') as f:
                  directory.ParseFromString(f.read())
@@ -1104,9 +1136,8 @@ class CASCache():
          assert response.committed_size == digest.size_bytes
 -    def _send_directory(self, remote, digest, u_uid=uuid.uuid4()):
 -        required_blobs = self._required_blobs(digest)
+-
 +    def _send_directory(self, remote, digest, u_uid=uuid.uuid4(), excluded_dir=None):
 +        required_blobs = self._required_blobs(digest, excluded_dir=excluded_dir)
          missing_blobs = dict()
          # Limit size of FindMissingBlobs request
          for required_blobs_group in _grouper(required_blobs, 512):

buildstream/_frontend/app.py

@@ -38,7 +38,7 @@ from .._message import Message, MessageType, unconditional_messages
  from .._stream import Stream
  from .._versions import BST_FORMAT_VERSION
  from .. import _yaml
 -from .._scheduler import ElementJob
 +from .._scheduler import ElementJob, JobStatus
  # Import frontend assets
  from . import Profile, LogLine, Status
@@ -515,13 +515,13 @@ class App():
          self._status.add_job(job)
          self._maybe_render_status()
 -    def _job_completed(self, job, success):
 +    def _job_completed(self, job, status):
          self._status.remove_job(job)
          self._maybe_render_status()
          # Dont attempt to handle a failure if the user has already opted to
          # terminate
 -        if not success and not self.stream.terminated:
 +        if status == JobStatus.FAIL and not self.stream.terminated:
              if isinstance(job, ElementJob):
                  element = job.element

buildstream/_scheduler/__init__.py

@@ -26,4 +26,4 @@ from .queues.pushqueue import PushQueue
  from .queues.pullqueue import PullQueue
  from .scheduler import Scheduler, SchedStatus
 -from .jobs import ElementJob
 +from .jobs import ElementJob, JobStatus

buildstream/_scheduler/jobs/__init__.py

@@ -20,3 +20,4 @@
  from .elementjob import ElementJob
  from .cachesizejob import CacheSizeJob
  from .cleanupjob import CleanupJob
 +from .job import JobStatus

buildstream/_scheduler/jobs/cachesizejob.py

@@ -16,7 +16,7 @@
  #  Author:
  #        Tristan Daniël Maat <tristan maat codethink co uk>
+ #
 -from .job import Job
 +from .job import Job, JobStatus
  class CacheSizeJob(Job):
@@ -30,8 +30,8 @@ class CacheSizeJob(Job):
      def child_process(self):
          return self._artifacts.compute_cache_size()
 -    def parent_complete(self, success, result):
 -        if success:
 +    def parent_complete(self, status, result):
 +        if status == JobStatus.OK:
              self._artifacts.set_cache_size(result)
              if self._complete_cb:

buildstream/_scheduler/jobs/cleanupjob.py

@@ -16,7 +16,7 @@
  #  Author:
  #        Tristan Daniël Maat <tristan maat codethink co uk>
+ #
 -from .job import Job
 +from .job import Job, JobStatus
  class CleanupJob(Job):
@@ -29,6 +29,6 @@ class CleanupJob(Job):
      def child_process(self):
          return self._artifacts.clean()
 -    def parent_complete(self, success, result):
 -        if success:
 +    def parent_complete(self, status, result):
 +        if status == JobStatus.OK:
              self._artifacts.set_cache_size(result)

buildstream/_scheduler/jobs/elementjob.py

@@ -60,7 +60,7 @@ from .job import Job
  #     Args:
  #        job (Job): The job object which completed
  #        element (Element): The element passed to the Job() constructor
 -#        success (bool): True if the action_cb did not raise an exception
 +#        status (JobStatus): The status of whether the workload raised an exception
  #        result (object): The deserialized object returned by the `action_cb`, or None
  #                         if `success` is False
+ #
@@ -93,8 +93,8 @@ class ElementJob(Job):
          # Run the action
          return self._action_cb(self._element)
 -    def parent_complete(self, success, result):
 -        self._complete_cb(self, self._element, success, self._result)
 +    def parent_complete(self, status, result):
 +        self._complete_cb(self, self._element, status, self._result)
      def message(self, message_type, message, **kwargs):
          args = dict(kwargs)

buildstream/_scheduler/jobs/job.py

@@ -28,8 +28,6 @@ import traceback
  import asyncio
  import multiprocessing
 -import psutil
+-
  # BuildStream toplevel imports
  from ..._exceptions import ImplError, BstError, set_last_task_error, SkipJob
  from ..._message import Message, MessageType, unconditional_messages
@@ -43,6 +41,22 @@ RC_PERM_FAIL = 2
  RC_SKIPPED = 3
 +# JobStatus:
 +#
 +# The job completion status, passed back through the
 +# complete callbacks.
 +#
 +class JobStatus():
 +    # Job succeeded
 +    OK = 0
++
 +    # A temporary BstError was raised
 +    FAIL = 1
++
 +    # A SkipJob was raised
 +    SKIPPED = 3
++
++
  # Used to distinguish between status messages and return values
  class Envelope():
      def __init__(self, message_type, message):
@@ -118,7 +132,6 @@ class Job():
          self._max_retries = max_retries        # Maximum number of automatic retries
          self._result = None                    # Return value of child action in the parent
          self._tries = 0                        # Try count, for retryable jobs
 -        self._skipped_flag = False             # Indicate whether the job was skipped.
          self._terminated = False               # Whether this job has been explicitly terminated
          # If False, a retry will not be attempted regardless of whether _tries is less than _max_retries.
@@ -215,17 +228,10 @@ class Job():
      # Forcefully kill the process, and any children it might have.
+     #
      def kill(self):
+-
          # Force kill
          self.message(MessageType.WARN,
                       "{} did not terminate gracefully, killing".format(self.action_name))
+-
 -        try:
 -            utils._kill_process_tree(self._process.pid)
 -        # This can happen if the process died of its own accord before
 -        # we try to kill it
 -        except psutil.NoSuchProcess:
 -            return
 +        utils._kill_process_tree(self._process.pid)
      # suspend()
+     #
@@ -282,18 +288,6 @@ class Job():
      def set_task_id(self, task_id):
          self._task_id = task_id
 -    # skipped
 -    #
 -    # This will evaluate to True if the job was skipped
 -    # during processing, or if it was forcefully terminated.
 -    #
 -    # Returns:
 -    #    (bool): Whether the job should appear as skipped
 -    #
 -    @property
 -    def skipped(self):
 -        return self._skipped_flag or self._terminated
+-
      #######################################################
      #                  Abstract Methods                   #
      #######################################################
@@ -304,10 +298,10 @@ class Job():
      # pass the result to the main thread.
+     #
      # Args:
 -    #    success (bool): Whether the job was successful.
 +    #    status (JobStatus): The job exit status
      #    result (any): The result returned by child_process().
+     #
 -    def parent_complete(self, success, result):
 +    def parent_complete(self, status, result):
          raise ImplError("Job '{kind}' does not implement parent_complete()"
                          .format(kind=type(self).__name__))
@@ -571,16 +565,23 @@ class Job():
+         #
          self._retry_flag = returncode == RC_FAIL
 -        # Set the flag to alert Queue that this job skipped.
 -        self._skipped_flag = returncode == RC_SKIPPED
+-
          if self._retry_flag and (self._tries <= self._max_retries) and not self._scheduler.terminated:
              self.spawn()
              return
 -        success = returncode in (RC_OK, RC_SKIPPED)
 -        self.parent_complete(success, self._result)
 -        self._scheduler.job_completed(self, success)
 +        # Resolve the outward facing overall job completion status
 +        #
 +        if returncode == RC_OK:
 +            status = JobStatus.OK
 +        elif returncode == RC_SKIPPED:
 +            status = JobStatus.SKIPPED
 +        elif returncode in (RC_FAIL, RC_PERM_FAIL):
 +            status = JobStatus.FAIL
 +        else:
 +            status = JobStatus.FAIL
++
 +        self.parent_complete(status, self._result)
 +        self._scheduler.job_completed(self, status)
          # Force the deletion of the queue and process objects to try and clean up FDs
          self._queue = self._process = None

buildstream/_scheduler/queues/buildqueue.py

@@ -21,7 +21,7 @@
  from datetime import timedelta
  from . import Queue, QueueStatus
 -from ..jobs import ElementJob
 +from ..jobs import ElementJob, JobStatus
  from ..resources import ResourceType
  from ..._message import MessageType
@@ -104,7 +104,7 @@ class BuildQueue(Queue):
          if artifacts.has_quota_exceeded():
              self._scheduler.check_cache_size()
 -    def done(self, job, element, result, success):
 +    def done(self, job, element, result, status):
          # Inform element in main process that assembly is done
          element._assemble_done()
@@ -117,5 +117,5 @@ class BuildQueue(Queue):
          #        artifact cache size for a successful build even though we know a
          #        failed build also grows the artifact cache size.
+         #
 -        if success:
 +        if status == JobStatus.OK:
              self._check_cache_size(job, element, result)

buildstream/_scheduler/queues/fetchqueue.py

@@ -24,6 +24,7 @@ from ... import Consistency
  # Local imports
  from . import Queue, QueueStatus
  from ..resources import ResourceType
 +from ..jobs import JobStatus
  # A queue which fetches element sources
@@ -66,9 +67,9 @@ class FetchQueue(Queue):
          return QueueStatus.READY
 -    def done(self, _, element, result, success):
 +    def done(self, _, element, result, status):
 -        if not success:
 +        if status == JobStatus.FAIL:
              return
          element._update_state()

buildstream/_scheduler/queues/pullqueue.py

@@ -21,6 +21,7 @@
  # Local imports
  from . import Queue, QueueStatus
  from ..resources import ResourceType
 +from ..jobs import JobStatus
  from ..._exceptions import SkipJob
@@ -54,9 +55,9 @@ class PullQueue(Queue):
          else:
              return QueueStatus.SKIP
 -    def done(self, _, element, result, success):
 +    def done(self, _, element, result, status):
 -        if not success:
 +        if status == JobStatus.FAIL:
              return
          element._pull_done()
@@ -64,4 +65,5 @@ class PullQueue(Queue):
          # Build jobs will check the "approximate" size first. Since we
          # do not get an artifact size from pull jobs, we have to
          # actually check the cache size.
 -        self._scheduler.check_cache_size()
 +        if status == JobStatus.OK:
 +            self._scheduler.check_cache_size()

buildstream/_scheduler/queues/queue.py

@@ -25,7 +25,7 @@ from enum import Enum
  import traceback
  # Local imports
 -from ..jobs import ElementJob
 +from ..jobs import ElementJob, JobStatus
  from ..resources import ResourceType
  # BuildStream toplevel imports
@@ -133,10 +133,9 @@ class Queue():
      #    job (Job): The job which completed processing
      #    element (Element): The element which completed processing
      #    result (any): The return value of the process() implementation
 -    #    success (bool): True if the process() implementation did not
 -    #                    raise any exception
 +    #    status (JobStatus): The return status of the Job
+     #
 -    def done(self, job, element, result, success):
 +    def done(self, job, element, result, status):
          pass
      #####################################################
@@ -291,7 +290,7 @@ class Queue():
+     #
      # See the Job object for an explanation of the call signature
+     #
 -    def _job_done(self, job, element, success, result):
 +    def _job_done(self, job, element, status, result):
          # Update values that need to be synchronized in the main task
          # before calling any queue implementation
@@ -301,7 +300,7 @@ class Queue():
          # and determine if it should be considered as processed
          # or skipped.
          try:
 -            self.done(job, element, result, success)
 +            self.done(job, element, result, status)
          except BstError as e:
              # Report error and mark as failed
@@ -332,12 +331,10 @@ class Queue():
              # All jobs get placed on the done queue for later processing.
              self._done_queue.append(job)
 -            # A Job can be skipped whether or not it has failed,
 -            # we want to only bookkeep them as processed or failed
 -            # if they are not skipped.
 -            if job.skipped:
 +            # These lists are for bookkeeping purposes for the UI and logging.
 +            if status == JobStatus.SKIPPED:
                  self.skipped_elements.append(element)
 -            elif success:
 +            elif status == JobStatus.OK:
                  self.processed_elements.append(element)
              else:
                  self.failed_elements.append(element)

buildstream/_scheduler/queues/trackqueue.py

@@ -24,6 +24,7 @@ from ...plugin import _plugin_lookup
  # Local imports
  from . import Queue, QueueStatus
  from ..resources import ResourceType
 +from ..jobs import JobStatus
  # A queue which tracks sources
@@ -47,9 +48,9 @@ class TrackQueue(Queue):
          return QueueStatus.READY
 -    def done(self, _, element, result, success):
 +    def done(self, _, element, result, status):
 -        if not success:
 +        if status == JobStatus.FAIL:
              return
          # Set the new refs in the main process one by one as they complete

buildstream/_scheduler/scheduler.py

@@ -38,6 +38,16 @@ class SchedStatus():
      TERMINATED = 1
 +# Our _REDUNDANT_EXCLUSIVE_ACTIONS jobs are special ones
 +# which we launch dynamically, they have the property of being
 +# meaningless to queue if one is already queued, and it also
 +# doesnt make sense to run them in parallel
 +#
 +_ACTION_NAME_CLEANUP = 'cleanup'
 +_ACTION_NAME_CACHE_SIZE = 'cache_size'
 +_REDUNDANT_EXCLUSIVE_ACTIONS = [_ACTION_NAME_CLEANUP, _ACTION_NAME_CACHE_SIZE]
++
++
  # Scheduler()
+ #
  # The scheduler operates on a list queues, each of which is meant to accomplish
@@ -94,6 +104,15 @@ class Scheduler():
          self._suspendtime = None
          self._queue_jobs = True      # Whether we should continue to queue jobs
 +        # Whether our exclusive jobs, like 'cleanup' are currently already
 +        # waiting or active.
 +        #
 +        # This is just a bit quicker than scanning the wait queue and active
 +        # queue and comparing job action names.
 +        #
 +        self._exclusive_waiting = set()
 +        self._exclusive_active = set()
++
          self._resources = Resources(context.sched_builders,
                                      context.sched_fetchers,
                                      context.sched_pushers)
@@ -211,19 +230,6 @@ class Scheduler():
              starttime = timenow
          return timenow - starttime
 -    # schedule_jobs()
 -    #
 -    # Args:
 -    #     jobs ([Job]): A list of jobs to schedule
 -    #
 -    # Schedule 'Job's for the scheduler to run. Jobs scheduled will be
 -    # run as soon any other queueing jobs finish, provided sufficient
 -    # resources are available for them to run
 -    #
 -    def schedule_jobs(self, jobs):
 -        for job in jobs:
 -            self.waiting_jobs.append(job)
+-
      # job_completed():
+     #
      # Called when a Job completes
@@ -231,12 +237,14 @@ class Scheduler():
      # Args:
      #    queue (Queue): The Queue holding a complete job
      #    job (Job): The completed Job
 -    #    success (bool): Whether the Job completed with a success status
 +    #    status (JobStatus): The status of the completed job
+     #
 -    def job_completed(self, job, success):
 +    def job_completed(self, job, status):
          self._resources.clear_job_resources(job)
          self.active_jobs.remove(job)
 -        self._job_complete_callback(job, success)
 +        if job.action_name in _REDUNDANT_EXCLUSIVE_ACTIONS:
 +            self._exclusive_active.remove(job.action_name)
 +        self._job_complete_callback(job, status)
          self._schedule_queue_jobs()
          self._sched()
@@ -246,18 +254,13 @@ class Scheduler():
      # size is calculated, a cleanup job will be run automatically
      # if needed.
+     #
 -    # FIXME: This should ensure that only one cache size job
 -    #        is ever pending at a given time. If a cache size
 -    #        job is already running, it is correct to queue
 -    #        a new one, it is incorrect to have more than one
 -    #        of these jobs pending at a given time, though.
 -    #
      def check_cache_size(self):
 -        job = CacheSizeJob(self, 'cache_size', 'cache_size/cache_size',
 +        job = CacheSizeJob(self, _ACTION_NAME_CACHE_SIZE,
 +                           'cache_size/cache_size',
                             resources=[ResourceType.CACHE,
                                        ResourceType.PROCESS],
                             complete_cb=self._run_cleanup)
 -        self.schedule_jobs([job])
 +        self._schedule_jobs([job])
      #######################################################
      #                  Local Private Methods              #
@@ -276,10 +279,19 @@ class Scheduler():
              if not self._resources.reserve_job_resources(job):
                  continue
 +            # Postpone these jobs if one is already running
 +            if job.action_name in _REDUNDANT_EXCLUSIVE_ACTIONS and \
 +               job.action_name in self._exclusive_active:
 +                continue
++
              job.spawn()
              self.waiting_jobs.remove(job)
              self.active_jobs.append(job)
 +            if job.action_name in _REDUNDANT_EXCLUSIVE_ACTIONS:
 +                self._exclusive_waiting.remove(job.action_name)
 +                self._exclusive_active.add(job.action_name)
++
              if self._job_start_callback:
                  self._job_start_callback(job)
@@ -287,6 +299,33 @@ class Scheduler():
          if not self.active_jobs and not self.waiting_jobs:
              self.loop.stop()
 +    # _schedule_jobs()
 +    #
 +    # The main entry point for jobs to be scheduled.
 +    #
 +    # This is called either as a result of scanning the queues
 +    # in _schedule_queue_jobs(), or directly by the Scheduler
 +    # to insert special jobs like cleanups.
 +    #
 +    # Args:
 +    #     jobs ([Job]): A list of jobs to schedule
 +    #
 +    def _schedule_jobs(self, jobs):
 +        for job in jobs:
++
 +            # Special treatment of our redundant exclusive jobs
 +            #
 +            if job.action_name in _REDUNDANT_EXCLUSIVE_ACTIONS:
++
 +                # Drop the job if one is already queued
 +                if job.action_name in self._exclusive_waiting:
 +                    continue
++
 +                # Mark this action type as queued
 +                self._exclusive_waiting.add(job.action_name)
++
 +            self.waiting_jobs.append(job)
++
      # _schedule_queue_jobs()
+     #
      # Ask the queues what jobs they want to schedule and schedule
@@ -331,7 +370,7 @@ class Scheduler():
              # the next queue and process them.
              process_queues = any(q.dequeue_ready() for q in self.queues)
 -        self.schedule_jobs(ready)
 +        self._schedule_jobs(ready)
          self._sched()
      # _run_cleanup()
@@ -353,11 +392,11 @@ class Scheduler():
          if not artifacts.has_quota_exceeded():
              return
 -        job = CleanupJob(self, 'cleanup', 'cleanup/cleanup',
 +        job = CleanupJob(self, _ACTION_NAME_CLEANUP, 'cleanup/cleanup',
                           resources=[ResourceType.CACHE,
                                      ResourceType.PROCESS],
                           exclusive_resources=[ResourceType.CACHE])
 -        self.schedule_jobs([job])
 +        self._schedule_jobs([job])
      # _suspend_jobs()
+     #

buildstream/element.py

@@ -65,7 +65,7 @@ Miscellaneous abstract methods also exist:
  * :func:`Element.generate_script() <buildstream.element.Element.generate_script>`
 -  For the purpose of ``bst source bundle``, an Element may optionally implement this.
 +  For the purpose of ``bst source checkout --include-build-scripts``, an Element may optionally implement this.
  Class Reference
@@ -1800,13 +1800,19 @@ class Element(Plugin):
      #   (bool): True if this element does not need a push job to be created
+     #
      def _skip_push(self):
++
          if not self.__artifacts.has_push_remotes(element=self):
              # No push remotes for this element's project
              return True
          # Do not push elements that aren't cached, or that are cached with a dangling buildtree
 -        # artifact unless element type is expected to have an an empty buildtree directory
 -        if not self._cached_buildtree():
 +        # artifact unless element type is expected to have an an empty buildtree directory. Check
 +        # that this default behaviour is not overriden via a remote configured to allow pushing
 +        # artifacts without their corresponding buildtree.
 +        if not self._cached():
 +            return True
++
 +        if not self._cached_buildtree() and not self.__artifacts.has_partial_push_remotes(element=self):
              return True
          # Do not push tainted artifact
@@ -1817,7 +1823,8 @@ class Element(Plugin):
      # _push():
+     #
 -    # Push locally cached artifact to remote artifact repository.
 +    # Push locally cached artifact to remote artifact repository. An attempt
 +    # will be made to push partial artifacts given current config
+     #
      # Returns:
      #   (bool): True if the remote was updated, False if it already existed
@@ -1830,8 +1837,19 @@ class Element(Plugin):
              self.warn("Not pushing tainted artifact.")
              return False
 -        # Push all keys used for local commit
 -        pushed = self.__artifacts.push(self, self.__get_cache_keys_for_commit())
 +        # Push all keys used for local commit, this could be full or partial,
 +        # given previous _skip_push() logic. If buildtree isn't cached, then
 +        # set partial push
++
 +        partial = False
 +        subdir = 'buildtree'
 +        if not self._cached_buildtree():
 +            partial = True
++
 +        pushed = self.__artifacts.push(self, self.__get_cache_keys_for_commit(), partial=partial, subdir=subdir)
++
 +        # Artifact might be cached in the server partially with the top level ref existing.
 +        # Check if we need to attempt a push of a locally cached buildtree given current config
          if not pushed:
              return False

buildstream/sandbox/sandbox.py

@@ -592,7 +592,7 @@ class _SandboxBatch():
          if command.label:
              context = self.sandbox._get_context()
              message = Message(self.sandbox._get_plugin_id(), MessageType.STATUS,
 -                              'Running {}'.format(command.label))
 +                              'Running command', detail=command.label)
              context.message(message)
          exitcode = self.sandbox._run(command.command, self.flags, cwd=command.cwd, env=command.env)

buildstream/utils.py

@@ -1050,6 +1050,11 @@ def _kill_process_tree(pid):
              # Ignore this error, it can happen with
              # some setuid bwrap processes.
              pass
 +        except psutil.NoSuchProcess:
 +            # It is certain that this has already been sent
 +            # SIGTERM, so there is a window where the process
 +            # could have exited already.
 +            pass
      # Bloody Murder
      for child in children:

tests/integration/pushbuildtrees.py

 +import os
 +import shutil
 +import pytest
++
 +from tests.testutils import cli_integration as cli, create_artifact_share
 +from tests.testutils.integration import assert_contains
 +from tests.testutils.site import HAVE_BWRAP, IS_LINUX
 +from buildstream._exceptions import ErrorDomain, LoadErrorReason
++
++
 +DATA_DIR = os.path.join(
 +    os.path.dirname(os.path.realpath(__file__)),
 +    "project"
 +)
++
++
 +# Remove artifact cache & set cli.config value of pull-buildtrees
 +# to false, which is the default user context. The cache has to be
 +# cleared as just forcefully removing the refpath leaves dangling objects.
 +def default_state(cli, tmpdir, share):
 +    shutil.rmtree(os.path.join(str(tmpdir), 'artifacts'))
 +    cli.configure({
 +        'artifacts': {'url': share.repo, 'push': False},
 +        'artifactdir': os.path.join(str(tmpdir), 'artifacts'),
 +        'cache': {'pull-buildtrees': False},
 +    })
++
++
 +# Tests to capture the integration of the optionl push of buildtrees.
 +# The behaviour should encompass pushing artifacts that are already cached
 +# without a buildtree as well as artifacts that are cached with their buildtree.
 +# This option is handled via 'allow-partial-push' on a per artifact remote config
 +# node basis. Multiple remote config nodes can point to the same url and as such can
 +# have different 'allow-partial-push' options, tests need to cover this using project
 +# confs.
 +@pytest.mark.integration
 +@pytest.mark.datafiles(DATA_DIR)
 +@pytest.mark.skipif(IS_LINUX and not HAVE_BWRAP, reason='Only available with bubblewrap on Linux')
 +def test_pushbuildtrees(cli, tmpdir, datafiles, integration_cache):
 +    project = os.path.join(datafiles.dirname, datafiles.basename)
 +    element_name = 'autotools/amhello.bst'
++
 +    # Create artifact shares for pull & push testing
 +    with create_artifact_share(os.path.join(str(tmpdir), 'share1')) as share1,\
 +        create_artifact_share(os.path.join(str(tmpdir), 'share2')) as share2,\
 +        create_artifact_share(os.path.join(str(tmpdir), 'share3')) as share3,\
 +        create_artifact_share(os.path.join(str(tmpdir), 'share4')) as share4:
++
 +        cli.configure({
 +            'artifacts': {'url': share1.repo, 'push': True},
 +            'artifactdir': os.path.join(str(tmpdir), 'artifacts')
 +        })
++
 +        cli.configure({'artifacts': [{'url': share1.repo, 'push': True},
 +                                     {'url': share2.repo, 'push': True, 'allow-partial-push': True}]})
++
 +        # Build autotools element, checked pushed, delete local.
 +        # As share 2 has push & allow-partial-push set a true, it
 +        # should have pushed the artifacts, without the cached buildtrees,
 +        # to it.
 +        result = cli.run(project=project, args=['build', element_name])
 +        assert result.exit_code == 0
 +        assert cli.get_element_state(project, element_name) == 'cached'
 +        elementdigest = share1.has_artifact('test', element_name, cli.get_element_key(project, element_name))
 +        buildtreedir = os.path.join(str(tmpdir), 'artifacts', 'extract', 'test', 'autotools-amhello',
 +                                    elementdigest.hash, 'buildtree')
 +        assert os.path.isdir(buildtreedir)
 +        assert element_name in result.get_partial_pushed_elements()
 +        assert element_name in result.get_pushed_elements()
 +        assert share1.has_artifact('test', element_name, cli.get_element_key(project, element_name))
 +        assert share2.has_artifact('test', element_name, cli.get_element_key(project, element_name))
 +        default_state(cli, tmpdir, share1)
++
 +        # Check that after explictly pulling an artifact without it's buildtree,
 +        # we can push it to another remote that is configured to accept the partial
 +        # artifact
 +        result = cli.run(project=project, args=['pull', element_name])
 +        assert element_name in result.get_pulled_elements()
 +        cli.configure({'artifacts': {'url': share3.repo, 'push': True, 'allow-partial-push': True}})
 +        assert cli.get_element_state(project, element_name) == 'cached'
 +        assert not os.path.isdir(buildtreedir)
 +        result = cli.run(project=project, args=['push', element_name])
 +        assert result.exit_code == 0
 +        assert element_name in result.get_partial_pushed_elements()
 +        assert element_name not in result.get_pushed_elements()
 +        assert share3.has_artifact('test', element_name, cli.get_element_key(project, element_name))
 +        default_state(cli, tmpdir, share3)
++
 +        # Delete the local cache and pull the partial artifact from share 3,
 +        # this should not include the buildtree when extracted locally, even when
 +        # pull-buildtrees is given as a cli parameter as no available remotes will
 +        # contain the buildtree
 +        assert not os.path.isdir(buildtreedir)
 +        assert cli.get_element_state(project, element_name) != 'cached'
 +        result = cli.run(project=project, args=['--pull-buildtrees', 'pull', element_name])
 +        assert element_name in result.get_partial_pulled_elements()
 +        assert not os.path.isdir(buildtreedir)
 +        default_state(cli, tmpdir, share3)
++
 +        # Delete the local cache and attempt to pull a 'full' artifact, including its
 +        # buildtree. As with before share3 being the first listed remote will not have
 +        # the buildtree available and should spawn a partial pull. Having share1 as the
 +        # second available remote should allow the buildtree to be pulled thus 'completing'
 +        # the artifact
 +        cli.configure({'artifacts': [{'url': share3.repo, 'push': True, 'allow-partial-push': True},
 +                                     {'url': share1.repo, 'push': True}]})
 +        assert cli.get_element_state(project, element_name) != 'cached'
 +        result = cli.run(project=project, args=['--pull-buildtrees', 'pull', element_name])
 +        assert element_name in result.get_partial_pulled_elements()
 +        assert element_name in result.get_pulled_elements()
 +        assert "Attempting to retrieve buildtree from remotes" in result.stderr
 +        assert os.path.isdir(buildtreedir)
 +        assert cli.get_element_state(project, element_name) == 'cached'
++
 +        # Test that we are able to 'complete' an artifact on a server which is cached partially,
 +        # but has now been configured for full artifact pushing. This should require only pushing
 +        # the missing blobs, which should be those of just the buildtree. In this case changing
 +        # share3 to full pushes should exercise this
 +        cli.configure({'artifacts': {'url': share3.repo, 'push': True}})
 +        result = cli.run(project=project, args=['push', element_name])
 +        assert element_name in result.get_pushed_elements()

tests/testutils/runcli.py

@@ -191,6 +191,13 @@ class Result():
          return list(pushed)
 +    def get_partial_pushed_elements(self):
 +        pushed = re.findall(r'\[\s*push:(\S+)\s*\]\s*INFO\s*Pushed partial artifact', self.stderr)
 +        if pushed is None:
 +            return []
++
 +        return list(pushed)
++
      def get_pulled_elements(self):
          pulled = re.findall(r'\[\s*pull:(\S+)\s*\]\s*INFO\s*Pulled artifact', self.stderr)
          if pulled is None:
@@ -198,6 +205,13 @@ class Result():
          return list(pulled)
 +    def get_partial_pulled_elements(self):
 +        pulled = re.findall(r'\[\s*pull:(\S+)\s*\]\s*INFO\s*Pulled partial artifact', self.stderr)
 +        if pulled is None:
 +            return []
++
 +        return list(pulled)
++
  class Cli():

[Notes] [Git][BuildStream/buildstream][tpollard/566] 12 commits: Fix stack traces discovered with ^C forceful termination.

Tom Pollard pushed to branch tpollard/566 at BuildStream / buildstream

Commits:

20 changed files:

Changes: