[Tracker] [PATCH] Improve oasis extractor to handle embedded tabs and line breaks



Hi all

The following patch improves the oasis extractor on odt documents so
that it keeps extracting plain text content even when there are embedded
tab and line-break xml tags. Without this patch the extractor stops when
such a tag is encountered, and resumes typically at the next paragraph
or style/format change. This means extractable text is missed.

Karl

--- tracker-1.0.1.orig/src/tracker-extract/tracker-extract-oasis.c      2014-07-09 19:15:21.798461185 +0100
+++ tracker-1.0.1/src/tracker-extract/tracker-extract-oasis.c   2014-07-09 19:12:15.774452182 +0100
@@ -395,7 +395,10 @@ xml_start_element_handler_content (GMark
                    (g_ascii_strcasecmp (element_name, "text:h") == 0) ||
                    (g_ascii_strcasecmp (element_name, "text:a") == 0) ||
                    (g_ascii_strcasecmp (element_name, "text:span") == 0) ||
-                   (g_ascii_strcasecmp (element_name, "table:table-cell")) == 0) {
+                   (g_ascii_strcasecmp (element_name, "table:table-cell") == 0) ||
+                   (g_ascii_strcasecmp (element_name, "text:s") == 0) ||
+                   (g_ascii_strcasecmp (element_name, "text:tab") == 0) ||
+                   (g_ascii_strcasecmp (element_name, "text:line-break") == 0)) {
                        data->current = ODT_TAG_TYPE_WORD_TEXT;
                } else {
                        data->current = -1;
@@ -436,7 +439,13 @@ xml_end_element_handler_content (GMarkup
 {
        ODTContentParseInfo *data = user_data;
 
-       data->current = -1;
+       /* Don't stop processing if it was a so-called 'empty' tag (e.g. <text:tab/>) */
+       if (!((g_ascii_strcasecmp (element_name, "text:s") == 0)   ||
+             (g_ascii_strcasecmp (element_name, "text:tab") == 0) ||
+             (g_ascii_strcasecmp (element_name, "text:line-break") == 0))) {
+               data->current = -1;
+       }
+
 }
 
 static void




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]