Re: searching for parts of words



Joe Shaw wrote:

[cut]

We need a strategy for how to determine the language of a document.  I
don't think that just going with LANG is the right thing, because the
number of English documents a user will encounter on the Internet is
significant, and we'd really like those to be indexed correctly.


This is a tough nut to crack.

A stab might be made for documents which carry language clues in their
meta information.  This of course will fail when meta information is
incorrect; mixed language documents will carry their own challenges.

For what it's worth, here is my understanding of what is available as meta information for html documents.

HTML documents may carry a lang tag:

<html xmlns="http://www.w3.org/1999/xhtml"; xml:lang="en-US" lang="en-US">

indicates the US dialect of English.

<html xmlns="http://www.w3.org/1999/xhtml"; xml:lang="it" lang="it">

indicates Italian. There can be a document level declaration as well as element level declarations.

Ref: http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1

Sometimes this is specified in the http response headers:

Content-Language: de

when specified in the server configuration (Apache example):

AddLanguage de .html

This is sometimes specified as

<META HTTP-EQUIV="Content-Language" CONTENT="de">

in the html.


- Sean Carlos

http://www.antezeta.com/






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]