Re: Beast Feature Thresholds


On Wed, Sep 09, 2015 at 10:02:12AM +0200, Tim Janik wrote:
I'd like to make another Beast release, but some tests are currently failing.
In particular some of the audio feature tests are breaking and need threshold
adjustments to pass.

I'd like to get your opinion on the threshold adjustments, so for your convenience
I've appended:
a) the threshold diff required for a successful release;
b) a build log from the feature test dir in case you want a peek at the feature values.

Some tests vary by several percent (syndrum) while e.g. partymonster contantly
reaches a solid 100% similarity.

FYI, the bse2wav.scm script has been replaced by " render2wav" for porting
reasons, but that's not related to audio processing.

Did you keep the deterministic random that --bse-disable-randomization ususally
provided? This is just a guess, but removing it could cause such problems.

diff --git tests/audio/ tests/audio/
index 7805187..3c6dbde 100644
--- tests/audio/
+++ tests/audio/
@@ -58,7 +58,7 @@ minisong-test:
     $(BSE2WAV) $(srcdir)/minisong.bse $(@F).wav
     $(BSEFEXTRACT) $(@F).wav --cut-zeros --channel 0 --avg-spectrum --spectrum --avg-energy  > $(@F).tmp
     $(BSEFEXTRACT) $(@F).wav --cut-zeros --channel 1 --avg-spectrum --spectrum --avg-energy >> $(@F).tmp
-    $(BSEFCOMPARE) $(srcdir)/minisong.ref $(@F).tmp --threshold 99.99
+    $(BSEFCOMPARE) $(srcdir)/minisong.ref $(@F).tmp --threshold 98.00
     rm -f $(@F).tmp $(@F).wav
 FEATURE_TESTS += syndrum-test
@@ -67,7 +67,7 @@ syndrum-test:
     $(BSE2WAV) $(srcdir)/syndrum.bse $(@F).wav
     $(BSEFEXTRACT) $(@F).wav --cut-zeros --channel 0 --avg-spectrum --spectrum --avg-energy  > $(@F).tmp
     $(BSEFEXTRACT) $(@F).wav --cut-zeros --channel 1 --avg-spectrum --spectrum --avg-energy >> $(@F).tmp
-    $(BSEFCOMPARE) $(srcdir)/syndrum.ref $(@F).tmp --threshold 99.99
+    $(BSEFCOMPARE) $(srcdir)/syndrum.ref $(@F).tmp --threshold 91.00
     rm -f $(@F).tmp $(@F).wav
 FEATURE_TESTS += velocity-test
@@ -85,7 +85,7 @@ EXTRA_DIST += organsong.bse organsong.ref
     $(BSE2WAV) $(srcdir)/organsong.bse $(@F).wav
     $(BSEFEXTRACT) $(@F).wav --cut-zeros --channel 0 --avg-spectrum --spectrum --avg-energy  > $(@F).tmp
-    $(BSEFCOMPARE) $(srcdir)/organsong.ref $(@F).tmp --threshold 99.99
+    $(BSEFCOMPARE) $(srcdir)/organsong.ref $(@F).tmp --threshold 98.00
     rm -f $(@F).tmp $(@F).wav
 # ADSR Test checks the mono channel envelope rendering
@@ -120,7 +120,7 @@ EXTRA_DIST += xtalstringssong.bse xtalstringssong.ref
     $(BSE2WAV) $(srcdir)/xtalstringssong.bse $(@F).wav
     $(BSEFEXTRACT) $(@F).wav --cut-zeros --channel 0 --avg-spectrum --spectrum --avg-energy  > $(@F).tmp
-    $(BSEFCOMPARE) $(srcdir)/xtalstringssong.ref $(@F).tmp --threshold 99.99
+    $(BSEFCOMPARE) $(srcdir)/xtalstringssong.ref $(@F).tmp --threshold 99.90
     rm -f $(@F).tmp $(@F).wav

The tests should ensure that we don't accidentally break how things sound like.
So ideally our goal is that things sound exactly the same (100.00). This cannot
always be done, so 99.99 is generally used.

However the thresholds you used are nowhere near 99.99, so most likely things
don't sound the same, and we should investigate why, to ensure that we didn't
break things.

To understand why I say things may be broken, and we should check this, it is
important to know that although scores are between 0.00 and 100.00, only those
scores very similar to 100.00 ensure that things really sound the same. So a
score of 98.00 already tolerates a lot of difference to the original. I'm not
sure if the difference in any of the files is audible, but it is significant.

   Cu... Stefan
Stefan Westerfeld,

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]