Re: searching for parts of words
- From: Sean Carlos <sean carlos gmail com>
- To: dashboard-hackers gnome org
- Subject: Re: searching for parts of words
- Date: Tue, 15 Nov 2005 13:39:12 -0500
Joe Shaw wrote:
[cut]
We need a strategy for how to determine the language of a document. I
don't think that just going with LANG is the right thing, because the
number of English documents a user will encounter on the Internet is
significant, and we'd really like those to be indexed correctly.
This is a tough nut to crack.
A stab might be made for documents which carry language clues in their
meta information. This of course will fail when meta information is
incorrect; mixed language documents will carry their own challenges.
For what it's worth, here is my understanding of what is available as
meta information for html documents.
HTML documents may carry a lang tag:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">
indicates the US dialect of English.
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="it" lang="it">
indicates Italian. There can be a document level declaration as well as
element level declarations.
Ref: http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1
Sometimes this is specified in the http response headers:
Content-Language: de
when specified in the server configuration (Apache example):
AddLanguage de .html
This is sometimes specified as
<META HTTP-EQUIV="Content-Language" CONTENT="de">
in the html.
- Sean Carlos
http://www.antezeta.com/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]