[Tracker] [PATCH] Improve oasis extractor to handle embedded tabs and line breaks
- From: Karl Relton <karllinuxtest relton ntlworld com>
- To: tracker-list gnome org
- Subject: [Tracker] [PATCH] Improve oasis extractor to handle embedded tabs and line breaks
- Date: Wed, 09 Jul 2014 19:26:00 +0100
Hi all
The following patch improves the oasis extractor on odt documents so
that it keeps extracting plain text content even when there are embedded
tab and line-break xml tags. Without this patch the extractor stops when
such a tag is encountered, and resumes typically at the next paragraph
or style/format change. This means extractable text is missed.
Karl
--- tracker-1.0.1.orig/src/tracker-extract/tracker-extract-oasis.c 2014-07-09 19:15:21.798461185 +0100
+++ tracker-1.0.1/src/tracker-extract/tracker-extract-oasis.c 2014-07-09 19:12:15.774452182 +0100
@@ -395,7 +395,10 @@ xml_start_element_handler_content (GMark
(g_ascii_strcasecmp (element_name, "text:h") == 0) ||
(g_ascii_strcasecmp (element_name, "text:a") == 0) ||
(g_ascii_strcasecmp (element_name, "text:span") == 0) ||
- (g_ascii_strcasecmp (element_name, "table:table-cell")) == 0) {
+ (g_ascii_strcasecmp (element_name, "table:table-cell") == 0) ||
+ (g_ascii_strcasecmp (element_name, "text:s") == 0) ||
+ (g_ascii_strcasecmp (element_name, "text:tab") == 0) ||
+ (g_ascii_strcasecmp (element_name, "text:line-break") == 0)) {
data->current = ODT_TAG_TYPE_WORD_TEXT;
} else {
data->current = -1;
@@ -436,7 +439,13 @@ xml_end_element_handler_content (GMarkup
{
ODTContentParseInfo *data = user_data;
- data->current = -1;
+ /* Don't stop processing if it was a so-called 'empty' tag (e.g. <text:tab/>) */
+ if (!((g_ascii_strcasecmp (element_name, "text:s") == 0) ||
+ (g_ascii_strcasecmp (element_name, "text:tab") == 0) ||
+ (g_ascii_strcasecmp (element_name, "text:line-break") == 0))) {
+ data->current = -1;
+ }
+
}
static void
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]