Re: searching for parts of words

From: Sean Carlos <sean carlos gmail com>
To: dashboard-hackers gnome org
Subject: Re: searching for parts of words
Date: Tue, 15 Nov 2005 13:39:12 -0500

Joe Shaw wrote:

[cut]


We need a strategy for how to determine the language of a document.  I
don't think that just going with LANG is the right thing, because the
number of English documents a user will encounter on the Internet is
significant, and we'd really like those to be indexed correctly.


This is a tough nut to crack.

A stab might be made for documents which carry language clues in their
meta information.  This of course will fail when meta information is
incorrect; mixed language documents will carry their own challenges.

For what it's worth, here is my understanding of what is available asmeta information for html documents.


HTML documents may carry a lang tag:

<html xmlns="http://www.w3.org/1999/xhtml"; xml:lang="en-US" lang="en-US">

indicates the US dialect of English.

<html xmlns="http://www.w3.org/1999/xhtml"; xml:lang="it" lang="it">

indicates Italian. There can be a document level declaration as well aselement level declarations.


Ref: http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1

Sometimes this is specified in the http response headers:

Content-Language: de

when specified in the server configuration (Apache example):

AddLanguage de .html

This is sometimes specified as

<META HTTP-EQUIV="Content-Language" CONTENT="de">

in the html.


- Sean Carlos

http://www.antezeta.com/

Follow-Ups:
- Re: searching for parts of words
  - From: D Bera

References:
- Re: searching for parts of words
  - From: Bernhard Kleine
- Re: searching for parts of words
  - From: Joe Shaw

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]