GNOME.org

[Notes] [Git][BuildStream/buildstream][danielsilverstone-ct/further-optimisations] 12 commits: contributing: WIP again to be kind to reviewers

From: Daniel Silverstone <gitlab mg gitlab com>
To: buildstream-notifications-list gnome org
Subject: [Notes] [Git][BuildStream/buildstream][danielsilverstone-ct/further-optimisations] 12 commits: contributing: WIP again to be kind to reviewers
Date: Fri, 09 Nov 2018 11:07:36 +0000

Title: GitLab

Daniel Silverstone pushed to branch danielsilverstone-ct/further-optimisations at BuildStream / buildstream

Commits:

c0a8bb66

by Angelos Evripiotis at 2018-11-08T15:49:16Z

contributing: WIP again to be kind to reviewers

As someone coming from GitHub to GitLab, I was pleasantly surprised by
the 'filter by WIP status' option.

To make sure we get the most out of it, add a guideline to keep the
filter clean, to reduce reviewer burden.

f7231e90

by Angelos Evripiotis at 2018-11-08T15:49:16Z

contributing: non-WIP should always be landable

I was recently surprised that we don't prefer addressing review comments
in 'fixup!' commits.

Coming from GitHub, I've found that fixup commits make it easier for
reviewers to see what has changed since their last review. The idea is
to use '--autosquash' before landing to clean up the history again.

It's a pleasant surprise that it's easy to keep track of what changed in
merge-requests between pushes, so we can always keep the history clean.

Document this, so folks like me can see the light sooner.

f7643440

by Angelos Evripiotis at 2018-11-08T15:49:16Z

contributing: emphasise the 'why' in commits

Explain why it's useful to mention the decisions made for a change and
provide links for background, e.g. issue numbers.

My general experience of git histories is that folks don't do enough
explaining, so it's great we have a contributing section for it.

I've found it's easier for folks to stick to rules when they know the
practical reasons for them, so provide some.

dd5e7b04

by Angelos Evripiotis at 2018-11-08T16:17:04Z

Merge branch 'aevri/contributing_gitlab' into 'master'

Add more to GitLab-relevant parts of contributing

See merge request BuildStream/buildstream!935

fe33e328

by James Ennis at 2018-11-08T17:54:18Z

using_config.rst: Add documentation to showing how to impose quotas on the local cache

This patch partially resolves #700

09faf002

by James Ennis at 2018-11-08T17:54:18Z

artifactcache.py: Fix misleading error message when using % cache quota

Due to the changed Exception message, this patch also changes
the test_parse_size_over_1024T test in misc.py

d153453c

by Javier Jardón at 2018-11-08T18:22:54Z

Merge branch 'jennis/quota_declaration_fix' into 'master'

Add local cache expiry documentation and fix misleading error message when specifying a percentage cache quota

Closes #700

See merge request BuildStream/buildstream!939

7b6c4a8c

by Daniel Silverstone at 2018-11-09T11:07:20Z

_yaml.py: Only retrieve provenance in node_get() when needed

We were indiscriminately retrieving the node's provenance data in the
`node_get()` function which was accounting for approximately a third of
the total runtime of `node_get()` which dominates pre-scheduler time in
`bst build`.  This change ameliorates that situation by only retrieving
the provenance data when it's actually needed.

Signed-off-by: Daniel Silverstone <daniel silverstone codethink co uk>

3043c8b7

by Daniel Silverstone at 2018-11-09T11:07:20Z

_yaml.py: Reduce use of `isinstance()` in `node_sanitize()`

We know that nodes are typically one of:
str, list, dict, bool, tuple, NoneType or our ChainMap
Of these, dict and ChainMap are Mapping, only list is list
and the rest are returned unchanged.  We can reduce/defer our use
of isinstance here, dramatically, improving performance.

Signed-off-by: Daniel Silverstone <daniel silverstone codethink co uk>

b1c4e727

by Daniel Silverstone at 2018-11-09T11:07:20Z

_cachekey.py: Add a variant of the calculator which take pre-sanitized input

In order to support a case where the caller already has pre-sanitized input
this variant of the cache key generator needs to exist.

Signed-off-by: Daniel Silverstone <daniel silverstone codethink co uk>

6a8abb31

by Daniel Silverstone at 2018-11-09T11:07:20Z

element.py: Alter Element.__calculate_cache_key() to pre-santize inputs

In order to reduce the effort we spend in `_yaml.node_sanitize()` take
advantage of how `OrderedDict` works and pre-sanitize the majority of
the cache-key once.  This approximately halves the amount of effort we
spend in `_yaml.node_sanitize()` in pre-scheduler build scenarios.

Signed-off-by: Daniel Silverstone <daniel silverstone codethink co uk>

10312be5

by Daniel Silverstone at 2018-11-09T11:07:20Z

_yaml.py: Remove use of isinstance() in `{node,list}_{chain_,}copy`

A non-trivial proportion of time pre-scheduler in `bst build` is spent
copying (or chain-copying) nodes.  Approximately a quarter of the time
spent in that effort is in `isinstance()`.  This removes that CPU load.

Signed-off-by: Daniel Silverstone <daniel silverstone codethink co uk>

7 changed files:

CONTRIBUTING.rst
buildstream/_artifactcache/artifactcache.py
buildstream/_cachekey.py
buildstream/_yaml.py
buildstream/element.py
doc/source/using_config.rst
tests/utils/misc.py

Changes:

CONTRIBUTING.rst

@@ -97,7 +97,13 @@ a new merge request. You can also `create a merge request for an existing branch
  You may open merge requests for the branches you create before you are ready
  to have them reviewed and considered for inclusion if you like. Until your merge
  request is ready for review, the merge request title must be prefixed with the
 -``WIP:`` identifier.
 +``WIP:`` identifier. GitLab `treats this specially
 +<https://docs.gitlab.com/ee/user/project/merge_requests/work_in_progress_merge_requests.html>`_,
 +which helps reviewers.
++
 +Consider marking a merge request as WIP again if you are taking a while to
 +address a review point. This signals that the next action is on you, and it
 +won't appear in a reviewer's search for non-WIP merge requests to review.
  Organized commits
@@ -122,6 +128,12 @@ If a commit in your branch modifies behavior such that a test must also
  be changed to match the new behavior, then the tests should be updated
  with the same commit, so that every commit passes its own tests.
 +These principles apply whenever a branch is non-WIP. So for example, don't push
 +'fixup!' commits when addressing review comments, instead amend the commits
 +directly before pushing. GitLab has `good support
 +<https://docs.gitlab.com/ee/user/project/merge_requests/versions.html>`_ for
 +diffing between pushes, so 'fixup!' commits are not necessary for reviewers.
++
  Commit messages
  ~~~~~~~~~~~~~~~
@@ -144,6 +156,16 @@ number must be referenced in the commit message.
    Fixes #123
 +Note that the 'why' of a change is as important as the 'what'.
++
 +When reviewing this, folks can suggest better alternatives when they know the
 +'why'. Perhaps there are other ways to avoid an error when things are not
 +frobnicated.
++
 +When folks modify this code, there may be uncertainty around whether the foos
 +should always be frobnicated. The comments, the commit message, and issue #123
 +should shed some light on that.
++
  In the case that you have a commit which necessarily modifies multiple
  components, then the summary line should still mention generally what
  changed (if possible), followed by a colon and a brief summary.

buildstream/_artifactcache/artifactcache.py

@@ -937,15 +937,22 @@ class ArtifactCache():
                              "Invalid cache quota ({}): ".format(utils._pretty_size(cache_quota)) +
                              "BuildStream requires a minimum cache quota of 2G.")
          elif cache_quota > cache_size + available_space:  # Check maximum
 +            if '%' in self.context.config_cache_quota:
 +                available = (available_space / (stat.f_blocks * stat.f_bsize)) * 100
 +                available = '{}% of total disk space'.format(round(available, 1))
 +            else:
 +                available = utils._pretty_size(available_space)
++
              raise LoadError(LoadErrorReason.INVALID_DATA,
                              ("Your system does not have enough available " +
                               "space to support the cache quota specified.\n" +
 -                             "You currently have:\n" +
 -                             "- {used} of cache in use at {local_cache_path}\n" +
 -                             "- {available} of available system storage").format(
 -                                 used=utils._pretty_size(cache_size),
 -                                 local_cache_path=self.context.artifactdir,
 -                                 available=utils._pretty_size(available_space)))
 +                             "\nYou have specified a quota of {quota} total disk space.\n" +
 +                             "- The filesystem containing {local_cache_path} only " +
 +                             "has: {available_size} available.")
 +                            .format(
 +                                quota=self.context.config_cache_quota,
 +                                local_cache_path=self.context.artifactdir,
 +                                available_size=available))
          # Place a slight headroom (2e9 (2GB) on the cache_quota) into
          # cache_quota to try and avoid exceptions.

buildstream/_cachekey.py

@@ -40,3 +40,20 @@ def generate_key(value):
      ordered = _yaml.node_sanitize(value)
      string = pickle.dumps(ordered)
      return hashlib.sha256(string).hexdigest()
++
++
 +# generate_key_pre_sanitized()
 +#
 +# Generate an sha256 hex digest from the given value. The value
 +# must be (a) compatible with generate_key() and (b) already have
 +# been passed through _yaml.node_sanitize()
 +#
 +# Args:
 +#    value: A sanitized value to get a key for
 +#
 +# Returns:
 +#    (str): An sha256 hex digest of the given value
 +#
 +def generate_key_pre_sanitized(value):
 +    string = pickle.dumps(value)
 +    return hashlib.sha256(string).hexdigest()

buildstream/_yaml.py

@@ -363,8 +363,8 @@ _sentinel = object()
+ #
  def node_get(node, expected_type, key, indices=None, default_value=_sentinel):
      value = node.get(key, default_value)
 -    provenance = node_get_provenance(node)
      if value is _sentinel:
 +        provenance = node_get_provenance(node)
          raise LoadError(LoadErrorReason.INVALID_DATA,
                          "{}: Dictionary did not contain expected key '{}'".format(provenance, key))
@@ -914,9 +914,20 @@ RoundTripRepresenter.add_representer(SanitizedDict,
  # Only dicts are ordered, list elements are left in order.
+ #
  def node_sanitize(node):
 +    # Short-circuit None which occurs ca. twice per element
 +    if node is None:
 +        return node
++
 +    node_type = type(node)
 +    # Next short-circuit integers, floats, strings, booleans, and tuples
 +    if node_type in (int, float, str, bool, tuple):
 +        return node
 +    # Now short-circuit lists
 +    elif node_type is list:
 +        return [node_sanitize(elt) for elt in node]
 -    if isinstance(node, collections.Mapping):
+-
 +    # Finally ChainMap and dict, and other Mappings need special handling
 +    if node_type in (dict, ChainMap) or isinstance(node, collections.Mapping):
          result = SanitizedDict()
          key_list = [key for key, _ in node_items(node)]
@@ -924,10 +935,10 @@ def node_sanitize(node):
              result[key] = node_sanitize(node[key])
          return result
+-
      elif isinstance(node, list):
          return [node_sanitize(elt) for elt in node]
 +    # Everything else (such as commented scalars) just gets returned as-is.
      return node
@@ -1055,16 +1066,52 @@ class ChainMap(collections.ChainMap):
          except KeyError:
              return default
 +# Node copying
 +#
 +# Unfortunately we copy nodes a *lot* and `isinstance()` is super-slow when
 +# things from collections.abc get involved.  The result is the following
 +# intricate but substantially faster group of tuples and the use of `in`.
 +#
 +# If any of the {node,list}_{chain_,}_copy routines raise a ValueError
 +# then it's likely additional types need adding to these tuples.
++
 +# String types are directly copied in a lot of places
 +__string_types = (str,
 +                  yaml.scalarstring.PreservedScalarString,
 +                  yaml.scalarstring.SingleQuotedScalarString,
 +                  yaml.scalarstring.DoubleQuotedScalarString)
++
 +# When chaining a copy, these types are skipped since the ChainMap will
 +# retrieve them from the source node when needed.
 +__chain_skipped_types = (str, bool,
 +                         yaml.scalarstring.PreservedScalarString,
 +                         yaml.scalarstring.SingleQuotedScalarString,
 +                         yaml.scalarstring.DoubleQuotedScalarString)
++
 +# These types have to be iterated like a dictionary
 +__dict_types = (dict, ChainMap, yaml.comments.CommentedMap)
++
 +# These types have to be iterated like a list
 +__list_types = (list, yaml.comments.CommentedSeq)
++
 +# These are the provenance types, which have to be cloned rather than any other
 +# copying tactic.
 +__provenance_types = (Provenance, DictProvenance, MemberProvenance, ElementProvenance)
  def node_chain_copy(source):
      copy = ChainMap({}, source)
      for key, value in source.items():
 -        if isinstance(value, collections.Mapping):
 +        value_type = type(value)
 +        if value_type in __dict_types:
              copy[key] = node_chain_copy(value)
 -        elif isinstance(value, list):
 +        elif value_type in __list_types:
              copy[key] = list_chain_copy(value)
 -        elif isinstance(value, Provenance):
 +        elif value_type in __provenance_types:
              copy[key] = value.clone()
 +        elif value_type in __chain_skipped_types:
 +            pass # No need to copy these, the chainmap deals with it
 +        else:
 +            raise ValueError("Unable to be quick about node_chain_copy of {}".format(value_type))
      return copy
@@ -1072,14 +1119,17 @@ def node_chain_copy(source):
  def list_chain_copy(source):
      copy = []
      for item in source:
 -        if isinstance(item, collections.Mapping):
 +        item_type = type(item)
 +        if item_type in __dict_types:
              copy.append(node_chain_copy(item))
 -        elif isinstance(item, list):
 +        elif item_type in __list_types:
              copy.append(list_chain_copy(item))
 -        elif isinstance(item, Provenance):
 +        elif item_type in __provenance_types:
              copy.append(item.clone())
 -        else:
 +        elif item_type in __string_types:
              copy.append(item)
 +        else: # Fallback
 +            raise ValueError("Unable to be quick about list_chain_copy of {}".format(item_type))
      return copy
@@ -1087,14 +1137,17 @@ def list_chain_copy(source):
  def node_copy(source):
      copy = {}
      for key, value in source.items():
 -        if isinstance(value, collections.Mapping):
 +        value_type = type(value)
 +        if value_type in __dict_types:
              copy[key] = node_copy(value)
 -        elif isinstance(value, list):
 +        elif value_type in __list_types:
              copy[key] = list_copy(value)
 -        elif isinstance(value, Provenance):
 +        elif value_type in __provenance_types:
              copy[key] = value.clone()
 -        else:
 +        elif value_type in __string_types:
              copy[key] = value
 +        else:
 +            raise ValueError("Unable to be quick about node_copy of {}".format(value_type))
      ensure_provenance(copy)
@@ -1104,14 +1157,17 @@ def node_copy(source):
  def list_copy(source):
      copy = []
      for item in source:
 -        if isinstance(item, collections.Mapping):
 +        item_type = type(item)
 +        if item_type in __dict_types:
              copy.append(node_copy(item))
 -        elif isinstance(item, list):
 +        elif item_type in __list_types:
              copy.append(list_copy(item))
 -        elif isinstance(item, Provenance):
 +        elif item_type in __provenance_types:
              copy.append(item.clone())
 -        else:
 +        elif item_type in __string_types:
              copy.append(item)
 +        else:
 +            raise ValueError("Unable to be quick about list_copy of {}".format(item_type))
      return copy

buildstream/element.py

@@ -2053,11 +2053,14 @@ class Element(Plugin):
+             }
              self.__cache_key_dict['fatal-warnings'] = sorted(project._fatal_warnings)
 +            self.__cache_key_dict['dependencies'] = []
 +            self.__cache_key_dict = _yaml.node_sanitize(self.__cache_key_dict)
 -        cache_key_dict = self.__cache_key_dict.copy()
 -        cache_key_dict['dependencies'] = dependencies
 +        # This replacement is safe since OrderedDict replaces the value,
 +        # leaving its location in the dictionary alone.
 +        self.__cache_key_dict['dependencies'] = _yaml.node_sanitize(dependencies)
 -        return _cachekey.generate_key(cache_key_dict)
 +        return _cachekey.generate_key_pre_sanitized(self.__cache_key_dict)
      # __can_build_incrementally()
+     #

doc/source/using_config.rst

@@ -147,6 +147,44 @@ The default mirror is defined by its name, e.g.
     ``--default-mirror`` command-line option.
 +Local cache expiry
 +~~~~~~~~~~~~~~~~~~
 +BuildStream locally caches artifacts, build trees, log files and sources within a
 +cache located at ``~/.cache/buildstream`` (unless a $XDG_CACHE_HOME environment
 +variable exists). When building large projects, this cache can get very large,
 +thus BuildStream will attempt to clean up the cache automatically by expiring the least
 +recently *used* artifacts.
++
 +By default, cache expiry will begin once the file system which contains the cache
 +approaches maximum usage. However, it is also possible to impose a quota on the local
 +cache in the user configuration. This can be done in two ways:
++
 +1. By restricting the maximum size of the cache directory itself.
++
 +For example, to ensure that BuildStream's cache does not grow beyond 100 GB,
 +simply declare the following in your user configuration (``~/.config/buildstream.conf``):
++
 +.. code:: yaml
++
 +  cache:
 +    quota: 100G
++
 +This quota defines the maximum size of the artifact cache in bytes.
 +Other accepted values are: K, M, G or T (or you can simply declare the value in bytes, without the suffix).
 +This uses the same format as systemd's
 +`resource-control <https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html>`_.
++
 +2. By expiring artifacts once the file system which contains the cache exceeds a specified usage.
++
 +To ensure that we start cleaning the cache once we've used 80% of local disk space (on the file system
 +which mounts the cache):
++
 +.. code:: yaml
++
 +  cache:
 +    quota: 80%
++
++
  Default configuration
  ---------------------
  The default BuildStream configuration is specified here for reference:

tests/utils/misc.py

@@ -27,4 +27,5 @@ def test_parse_size_over_1024T(cli, tmpdir):
      patched_statvfs = mock_os.mock_statvfs(f_bavail=bavail, f_bsize=BLOCK_SIZE)
      with mock_os.monkey_patch("statvfs", patched_statvfs):
          result = cli.run(project, args=["build", "file.bst"])
 -        assert "1025T of available system storage" in result.stderr
 +        failure_msg = 'Your system does not have enough available space to support the cache quota specified.'
 +        assert failure_msg in result.stderr

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]