Re: [BuildStream] buildbox-casd timeout issues during BuildStream startup



Hi,

Seems like we are generally in agreement so I've created
https://gitlab.com/BuildGrid/buildbox/buildbox-casd/-/issues/59 to track the
issue, and discuss implementation details.

To duplicate from the issue, this is how I think it would work:

* If the file is there, casd trusts it and assumes the disk usage is what's
written in the file.
* If the file is not there, casd calculates disk usage, similar to how it's done
today
* To protect against corruption or ungraceful shutdowns, casd can delete the
metadata file on the first write.
* When casd is shutting down (gracefully), it atomically writes the metadata
file.

Separately, the mechanism to calculate disk usage and write the metadata file
can be made into a standalone script as well.

---

Moving over to BuildStream side of things, I think we will need to implement the
first solution I suggested originally, i.e. to relax/remove the timeout, since
it's possible for casd startup to take longer if the metadata file was absent
for whatever reason.

So, I think we should make it more visible by making it a timed activity, extend
the timeout to something reasonably big, like a few minutes. And while waiting
for casd to get ready, BuildStream can make sure that the process is still
alive.

It seems to me that we can do this even today, independently from
improving casd startup performance. Which would at least alleviate the
pain of people experiencing this issue due to slow filesystems or cold
kernel caches.

---

Does this plan seem good to people?

Cheers,
Chandan


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]