Re: searching for parts of words

Joe Shaw wrote:


We need a strategy for how to determine the language of a document.  I
don't think that just going with LANG is the right thing, because the
number of English documents a user will encounter on the Internet is
significant, and we'd really like those to be indexed correctly.

This is a tough nut to crack.

A stab might be made for documents which carry language clues in their
meta information.  This of course will fail when meta information is
incorrect; mixed language documents will carry their own challenges.

For what it's worth, here is my understanding of what is available as meta information for html documents.

HTML documents may carry a lang tag:

<html xmlns=""; xml:lang="en-US" lang="en-US">

indicates the US dialect of English.

<html xmlns=""; xml:lang="it" lang="it">

indicates Italian. There can be a document level declaration as well as element level declarations.


Sometimes this is specified in the http response headers:

Content-Language: de

when specified in the server configuration (Apache example):

AddLanguage de .html

This is sometimes specified as

<META HTTP-EQUIV="Content-Language" CONTENT="de">

in the html.

- Sean Carlos

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]