Re: [Tracker] Use of fork() in tracker-extract-pdf



Hi Martyn

Am 12.08.2013 um 11:32 schrieb Martyn Russell <martyn lanedo com>:
On 11/08/13 09:24, Ralph Böhme wrote:
As I'm only really at the beginning of an in depth analysis, I can't
say for sure that the hangs I see are the cause of this, but knowing
there seems to exist a fundamental design flaw in tracker-extra-pdf,
I'm asking for thoughts on this.

So we use the parent/child set up because some PDFs take a REALLY long time to process and we have a 10 
second window for them to be indexed.

Yes, I know. My test data contains many, many very large PDFs (up 1000 pages).

After that we kill the child process and return.

Understood. Just for references, this happens in

  src/tracker-extract/tracker-extract-pdf.c:extract_content().

We did this because we didn't want to kill the tracker-extract process all the time.

I just stumbled across this one, didn't notice before. A 20 seconds watchdog timeout setup in

  src/tracker-extract/tracker-controller.c:metadata_data_new().

In reality, this is actually what tracker-extract was built to do, so ...

It seems in reality the 20 s watchdog timer is setup too early! Afair it is not setup before the actual 
extraction is started, but already at the moment the extraction event is received.

Now if several extraction requests a received and all of these are files (e.g. PDFs) that take long to 
extract, the watchdog timers for requests still on the pending queue are already expiring, I see this 
sequence of events in the debug log:



  12 Aug 2013, 11:55:49: Tracker: <--- [12|0] handle_method_call_get_metadata_fast
    (uri:'file:///tank/test/PDF/Security/Network%20Security%20With%20Openssl.pdf',
    mime:'application/pdf', index_fd:0)

Extraction requests 1-11 are still being processes, so it takes some time before this one gets it shot here:

  12 Aug 2013, 11:56:06: Tracker: Dispatching
    'file:///tank/test/PDF/Security/Network%20Security%20With%20Openssl.pdf'
    in main thread

Unfortunately this one takes more then 4 seconds, so the watchdog killer steps in:

  12 Aug 2013, 11:56:10: Tracker-Critical **: Extraction task for
    'file:///tank/test/PDF/Security/Network%20Security%20With%20Openssl.pdf'
    went rogue and took more than 20 seconds. Forcing exit.

...and tracker-extract gets restarted.

Increasing WATCHDOG_TIMEOUT from 20 to a large value will workaround this, but a proper fix would be to setup 
the timer in src/tracker-extract/tracker-extract.c:dispatch_task_cb() or similar.

Afaict, the right design would involve an exec() in the child and
using some other IPC channel. I'll happily volunteer.

Yea, so we are actually calling exit() in the child. See:

 extract_content_child_process()

What has the exit() call to do with this? Afaict that's completely unrelated to the issue of using fork() in 
programs using glib.

Thoughts?

Are you sure it isn't a difficult PDF taking too long?

Yes, heres a SBT from a stuck process:

root beast:~# pstack 1794
1794:   /opt/csw/libexec/tracker-extract
 fe6c4749 lwp_park (0, 0, 0)
 fe6bc93b mutex_lock_impl (816a288, 0, 966948c, fe447474, 80, 893fb58) + 166
 fe6bca7e mutex_lock (816a288, 893fc78, feffe638, 800, fe3df319, fdff0a10) + 16
 fe3df332 g_mutex_lock (fdffab2c, 1, 0, fe447474, feffe6ac, 81cdc48) + 22
 fdfd0f37 tracker_log_handler (fcff4fa9, 80, 893fb58, 0, 8116888, fcfd95c4) + a7
 fe39ee6b g_logv   (fcff4fa9, 80, fcff4eb0, feffe77c, 100000) + 1fb
 fe39f022 g_log    (fcff4fa9, 80, fcff4eb0, 658, 1, ff9a8) + 32
 fcff47fe tracker_extract_get_metadata (81dd9a0, 80, 805da80, 8197cd8, fe5a2a40, fe762000) + 17be
 0805899b get_file_metadata (8153500, 0, feffe9a8, fe447474, 81418e0, 81d7910) + db
 08059092 get_metadata (81879b0, 20, 805e910, 87c1908, 80de700, 0) + 62
 08059405 dispatch_task_cb (81879b0, fe447474, feffea78, fe447474, fe3df359) + 2b5
 fe393c71 g_idle_dispatch (8380688, 8059150, 81879b0, feffea50, 81e4f10, ffffffff) + 21
 fe397285 g_main_context_dispatch (8144f48, c8, 81e4f10, 2, 2, 0) + 135
 fe397648 g_main_context_iterate.isra.21 (1, 81c6074, 0, fe397aaa, 8153c58, 1e) + 218
 fe397b23 g_main_loop_run (81e4ee8, 0, feffeb08, 100000, fe790a20, 28) + 83
 0805d3fe main     (1, feffeb78, feffeb80) + 52e
 080565c0 _start   (1, feffec7c, 0, feffec9d, feffecb6, feffecc2) + 80

root beast:~# ps -efl | grep tracker
 0 S     root  1863     1   0  97 39        ?  17304        ? 12:04:19 ?           0:01 
/opt/csw/libexec/tracker-miner-fs
 0 S     root  1865     1   0  40 20        ?  32177        ? 12:04:19 ?           0:05 
/opt/csw/libexec/tracker-store
 0 S     root  1891     1   0  97 39        ?  20136        ? 12:06:59 ?           0:01 
/opt/csw/libexec/tracker-extract
 0 S     root  1794     1   0  97 39        ?  19318        ? 11:56:38 ?           0:00 
/opt/csw/libexec/tracker-extract
 0 S     root  1967  1723   0  50 20        ?   2149        ? 13:12:44 pts/2       0:00 grep tracker

I had restarted tracker at ~ 12:04:19, so process 1794 is a older stuck tracker-extract process. Fortunately 
I could locate it in the debug log:

  12 Aug 2013, 11:56:38: Tracker: Parent: Content extraction now starting in child process (pid = 1794)

Shortly thereafter:

  12 Aug 2013, 11:56:44: Tracker-Critical **: Extraction task for 
    'file:///tank/test/PDF/Security:Kryptographie/Thawte/PaxHeader/Pa%C3%9Fwort%20Fragen.pdf'
    went rogue and took more than 20 seconds. Forcing exit.
  12 Aug 2013, 11:56:49: Tracker: Starting tracker-extract 0.16.2

I guess this demonstrates the effect of both issues I outlined:
1) pid 1794 is deadlocked in a mutex that is probably still held by the (now non existent) tracker-extract 
parent
2) the parent of 1794 can't kill it off because it was itself killed by the 20 seconds watchdog timer

:)

-r

-- 
Ralph Böhme <rb netafp com>
Netatalk Developer | Support | Services
Curslacker Deich 254, 21039 Hamburg, Germany
http://www.netafp.com/

Attachment: smime.p7s
Description: S/MIME cryptographic signature



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]