Re: Beast Feature Thresholds

On 10.09.2015 16:08, Stefan Westerfeld wrote:

On Wed, Sep 09, 2015 at 10:02:12AM +0200, Tim Janik wrote:
I'd like to make another Beast release, but some tests are currently failing.
In particular some of the audio feature tests are breaking and need threshold
adjustments to pass.

I'd like to get your opinion on the threshold adjustments, so for your convenience
I've appended:
a) the threshold diff required for a successful release;
b) a build log from the feature test dir in case you want a peek at the feature values.

Some tests vary by several percent (syndrum) while e.g. partymonster contantly
reaches a solid 100% similarity.

FYI, the bse2wav.scm script has been replaced by " render2wav" for porting
reasons, but that's not related to audio processing.
Did you keep the deterministic random that --bse-disable-randomization ususally
provided? This is just a guess, but removing it could cause such problems.

Yes, --bse-disable-randomization was provied as can be seen from the log I had attached.

diff --git tests/audio/ tests/audio/
index 7805187..3c6dbde 100644
--- tests/audio/
+++ tests/audio/
@@ -58,7 +58,7 @@ minisong-test:
     $(BSE2WAV) $(srcdir)/minisong.bse $(@F).wav
     $(BSEFEXTRACT) $(@F).wav --cut-zeros --channel 0 --avg-spectrum --spectrum --avg-energy  > $(@F).tmp
     $(BSEFEXTRACT) $(@F).wav --cut-zeros --channel 1 --avg-spectrum --spectrum --avg-energy >> $(@F).tmp
-    $(BSEFCOMPARE) $(srcdir)/minisong.ref $(@F).tmp --threshold 99.99
+    $(BSEFCOMPARE) $(srcdir)/minisong.ref $(@F).tmp --threshold 98.00
     rm -f $(@F).tmp $(@F).wav

The tests should ensure that we don't accidentally break how things sound like.
So ideally our goal is that things sound exactly the same (100.00). This cannot
always be done, so 99.99 is generally used.

Yes, however AFAIK, the synthesis bits and timing bits weren't touched
since things passed the last time and now I see random failures.

However the thresholds you used are nowhere near 99.99, so most likely things
don't sound the same, and we should investigate why, to ensure that we didn't
break things.

To understand why I say things may be broken, and we should check this, it is
important to know that although scores are between 0.00 and 100.00, only those
scores very similar to 100.00 ensure that things really sound the same. So a
score of 98.00 already tolerates a lot of difference to the original. I'm not
sure if the difference in any of the files is audible, but it is significant.

I'm aware of all this, but there's no clear way to trace such errors AFAICS.
I.e. I can't currently tell why *some* feature tests sometimes fail and
sometimes not.
As a reminder, partymonster always passes 100%, so there's no general
(timing) brokenness at play here...

   Cu... Stefan

Yours sincerely,
Tim Janik
Free software author and speaker.

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]