Re: Gtk+ unit tests (brainstorming)

On Thu, 26 Oct 2006, Iago Toral Quiroga wrote:

- in the common case, test results should be reduced to a single boolean:
     "all tests passed" vs. "at least one test failed"
   many test frameworks provide means to count and report failing tests
   (even automake's standard check:-rule), there's little to no merit to
   this functionality though.
   having/letting more than one test fail and to continue work in an
   unrelated area rapidly leads to confusion about which tests are
   supposed to work and which aren't, especially in multi-contributor setups.
   figuring whether the right test passed, suddenly requires scanning of
   the test logs and remembering the last count of tests that may validly
   fail. this defeats the purpose using a single quick make check run to
   be confident that one's changes didn't introduce breakage.
   as a result, the whole test harness should always either succeed or
   be immediately fixed.

I understand your point, however I still think that being able to get a
wider report with all the tests failing at a given moment is also
interesting (for example in a buildbot continuous integration loop, like
the one being prepared by the build-brigade). Besides, if there is a
group of people that want to work on fixing bugs at the same time, they
would need to get a list of tests failing, not only the first one.

well, you can get that to some extend automatically, if you invoke
  $ make -k check
going beyond that would be a bad trade off i think because:
a) tests are primarily in place to ensure certain functionality is
   implemented correctly and continues to work;
b) if things break, tests need to be easy to debug. basically, most of the
   time a test failes, you have to engage the debugger, read code/docs,
   analyse and fix. tests that need forking or are hard to understand
   get into the way of this process, so should be avoided.
c) implementation and test code often has dependencies that won't allow
   to test beyond the occourance of an error. a simple example is:
     o = object_new();
     ASSERT (o != NULL); /* no point to continue beyond this on error */
     test_function (o);
   a more intimidating case is:
     main() {
       // ...
   if any of those test functions (say test_gtk_3) produces a gtk/glib
   error/assertion/warning/critical, the remaining test functions (4, 5, ...)
   are likely to fail for bogus reasons because the libraries entered
   undefined state.
   reports of those subsequent errors (which are likely to be very
   misleading) is useless at best and confusing (in terms of what error really
   matters) at worst.
   yes, forking for each of the test functions works around that (provided
   they are as independent of one another as in the example above), but again,
   this complicates the test implementation (it's not an easy to understand
   test program anymore) and debuggability, i.e. affectes the 2 main
   properties of a good test program.

to sum this up, reporting multiple fine grained test failures may have some
benefits, mainly those you outlined. but it comes at a certain cost, i.e.
test code complexity and debugging hinderance which are both important
properties of good test programs.
also, consider that "make -k check" can still get you reports on multiple
test failures, just at a somewhat lower granularity. in fact, it's just low
enough to avoid bogus reports.
so options face-to-face, adding fork mode when you don't have to (i.e. other
than checking g_error implementation) provides questionable benefits at
significant costs.
that's not an optimal trade off for gtk test programs i'd say, and i'd expect
the same to hold for most other projects.

- GLib based test programs should never produce a "CRITICAL **:" or
   "WARNING **:" message and succeed. the reasoning here is that CRITICALs
   and WARNINGs are indicators for an invalid program or library state,
   anything can follow from this.
   since tests are in place to verify correct implementation/operation, an
   invalid program state should never be reached. as a consequence, all tests
   should upon initialization make CRITICALs and WARNINGs fatal (as if
   --g-fatal-warnings was given).

Maybe you would like to test how the library handles invalid input. For
example, let's say we have a function that accepts a pointer as
parameter, I think it is worth knowing if that function handles safely
the case when that pointer is NULL (if that is a not allowed value for
that parameter) or if it produces a segmentation fault in that case.

no, it really doesn't make sense to test functions outside the defined
value ranges. that's because when implementing, the only thing you need
to actually care about from an API perspective is: the defined value ranges.
besides that, value rtanges may compatibly be *extended* in future versions,
which would make value range restriction tests break unecessarily.
if a funciton is not defined for say (char*)0, adding a test that asserts
certain behaviour for (char*)0 is effectively *extending* the current
value range to include (char*)0 and then testing the proper implementation
of this extended case. the outcome of which would be a CRITICAL or a segfault
though, and with the very exception of g_critical(), *no* glib/gtk function
implements this behaviour purposefully, compatibly, or in a documented way.
so such a test would at best be bogus and uneccessary.

I'll add here some points supporting Check ;):

ok, adressing them one by one, since i see multiple reasons for not
using Check ;)

As I said in another email, it wouldn't be a dependency to build GTK+,
it would be only a dependency to run the tests inside GTK+.

i think having the test suite available and built at all levels other than
 --enable-debug=no would be good, because Gtk+ has many outside contributors
for which testing (read: make check) should be as easy as possible.
it's not clear that Check (besides than being an additional dependency in
itself) fullfils all the portability requirements of glib/gtk+ for these
cases though.

Check is widely used and having a standard tool for testing, instead of
doing something ad-hoc, has its advantages too.

i agree that using something "standard" instead of ad-hoc solutions can in
general have benefits. but IMHO this doesn't apply to Check, because:
- test "frameworks" really don't have much to do in general. it's the test
  themselves that matter and consume the developer time.
- tests are very project specific, we've had a good range of introductionary
  posts in this thread already (cairo, dbus, beast, etc.).
  so if not too much code is required, writing a few auxillary routines for
  project specific testing can be plus and achieved with reasonable effort.
- Check may be widely used, but is presented as "[...] at the moment only
  sporadically maintained" (
  that alone causes me to veto any Check dependency for glib/gtk already ;)

You never know when you would need another feature for your testing. If
you use a tool maybe it provides it and if does not, you can always
request it. But if you don't use a testing tool you will always be in
the need to implement it yourself.

as i said, this can be a plus. and testing can be forseen with considerable
confidence, to not need specialized rocket science anytime soon.
that being said, we can still opt to depend or integrate with any test
"framework" out there at any future point if rocket science requirements
indeed do emerge ;)

Anyway, I agree that the most important thing here are the tests, not
test framework. Whether you finally decide to go with Check or without
it, count on me to collaborate! :)

great encouragement, thanks. ;)

Here are some bits I would like to add to this brainstorming, most of
them come from my work on the tests I've already done:

- As Federico said, I think the tests should be splitted into
independent programs that test independent components. This way,
developers making changes in one widget would be able to run only the
tests that deal with the component they are modifiying.

jup, i agree. i'd expect this to occour pretty naturally however, e.g.
if Kris works on some treeview test suite program, Federico on a file
chooser test program and Mitch on a key navigation test program, you get
this kind of split up pretty automatically.

- I think unit tests for an interface should consider 3 cases:
  1. Test of regular values: how does the interface behave when I try
it normally.
  2. Test of limit values: how does the interface behave when I try it
with a limit value? (for example the last and the first elements of an
array, etc)

jup, with the param specs in glib, we have a good chance of covering most
interesting limits and by random selection also many intermediate values
for properties with ordered value ranges.

  3. Test of invalid values: how does the interface handle invalid
input? does it handle it safely or does it break completely?

as i described earlier, i don't think this is a good idea.
testing undefined inputs/states can and often will lead to undefined
behaviour, and that means: in such scenrios there exists no defined
behaviour a test could reliably check.

- The tests should be homogeneous, I mean, it would be nice that all
look the same. That would make it really easy to read and understand
them. It will also make it really easy to provide new tests, there could
also be a template with the main parts of a widget test file, that's
what I did while developing my tests (I attach that template in case you
like to have a look, the <> means something that needs to be inserted
there, usually something like the name of the component you are testing,
or its type, etc)

about having homogeneously looking tests, enforcing this is really no
different than sticking to the usual gtk coding conventions when integrating
regular core code patches.

- Related to the above point, it would be nice, on top of unit tests, to
provide use case tests. I mean, not only to test interfaces, but
complete use cases for widgets. For example, one could test the complete
process of opening a file: open the open file dialog, select directory,
select a file in that directory, click Open button., etc. Dogtail would
work nice here.

yes, that's certainly nice to have, but requires significant work.
(for Qt, whole companies have been formed around just providing/adding
this functionality.)
so it's great that dogtail started development work in this field already.

- Should we test signal emission? I think they are part of the API
contract, so we should test if a function is emitting a signal when it
is supossed to emit it.

yes, that is definitely a good point that my list missed out. thanks for
bringing it up.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]