[xml] HTML-parser: encoding?
- From: Elizabeth Mattijsen <liz dijkmat nl>
- To: xml gnome org
- Subject: [xml] HTML-parser: encoding?
- Date: Thu, 29 Nov 2001 15:52:04 +0100
The HTML-parser of libxml2 is very nice. But I wonder what the real goal
of that parser is (has there been a discussion about that?: if so, I seemed
to have missed that).
If it is there to allow you to take _any_ (dirty) HTML-file and turn it
into a valid XML-dom, then its functionality is still not complete.
Currently, if there is no encoding specification found in an HTML-file,
ISO-Latin-1 is assumed. However, no check is performed whether all text
characters actually fall within ISO-Latin-1!
This causes an HTML-file without an encoding specification, but which
contains characters that are _not_ part of ISO-Latin-1, to be serialised
incorrectly. See attached file:
# xmllint --html 61.html >1 2>/dev/null
# xmllint 1
1:20: error: Input is not proper UTF-8, indicate encoding !
[*snip*]
1:20: error: Bytes: 0xC3 0xC3 0xC2 0xC2
[*snip*]
I would propose that _if_ the HTML-parser is used _and_ there is _no_
encoding specification found, that libxml _then_ would check all of the
text in the tree for characters illegal for the ISO-Latin-1 encoding and
replace these with spaces (so that the size of the buffer used is not changed).
Comments anyone?
Elizabeth Mattijsen
Title:
°³¹ßµµ»ó±¹ ºÒ¿ìÇÑ ¾î¸°À̸¦ À§ÇØ ÀÏÇÏ´Â Ç÷£ ÀÎÅͳ»¼Å³¯
Ç÷£ÀÎÅͳ»¼Å³¯Àº Áö³ 60¿©³â°£ °³¹ßµµ»ó±¹ÀÇ ºÒ¿ìÇÑ ¾î¸°À̸¦ µ½°í ÀÖ´Â ±¹Á¦¾Æµ¿ÈÄ¿ø´Üü·Î UN °æÁ¦»çȸÀÌ»çȸÀÇ ÇùÀDZⱸÀÔ´Ï´Ù.
ÇöÀç ¿ì¸®³ª¶ó¸¦ ºñ·ÔÇÏ¿© ¹Ì±¹, ¿µ±¹, ÇÁ¶û½º, ÀϺ» µî ¼±Áø 14°³±¹ÀÇ 100¸¸¿©¸íÀÇ ÈÄ¿øÀÚµéÀÌ Àü¼¼°è 110¸¸¸íÀÇ ¾î¸°À̵éÀ» ÈÄ¿øÇÏ°í ÀÖ½À´Ï´Ù.
Ç÷£ ÀÎÅͳ»¼Å³¯ÀÇ Æ¯º°ÇÑ ÈÄ¿øÇÁ·Î±×·¥À¸·Î ÈÄ¿øÀÚ´Â ÀÚ½ÅÀÌ ´©±¸¸¦ µ½´ÂÁö ¾Ë ¼ö ÀÖ°í °æÁ¦Àû µµ¿ò°ü°è¿¡¼ ³ª¾Æ°¡ µµ¿òÀ»
¹Þ´Â ¾î¸°ÀÌ¿Í ÆíÁö, Ä«µå, »çÁø, ±×¸²À» ÁÖ°í ¹ÞÀ½À¸·Î½á Áö¼ÓÀûÀÌ°í Àΰ£ÀûÀÎ ÈÄ¿ø°ü°è¸¦ ¸Î°Ô µË´Ï´Ù.
 Ç÷£ ÄÚ¸®¾Æ
13ÀÏ ¹ß»ýÇÑ °ÁøÀ¸·Î ÀÎÇÑ  »ê»çÅ°¡ ¸¶À»À» µ¤Ä£ ¸ð½À
|
Ç÷£ ÄÚ¸®¾Æ´Â Å« ÁöÁøÇÇÇظ¦ ´çÇÑ ¿¤»ì¹Ùµµ¸£¸¦ µ½±â À§ÇÑ ¸ð±ÝÈ°µ¿À» Àü°³ÇÕ´Ï´Ù. Ç÷£ ¿¤»ì¹Ùµµ¸£ »ç¹«¼ÒÀÇ Á÷¿øµéÀº ÇöÁö¿¡¼ ±ä±Þ±¸È£ÆÀÀ» Á¶Á÷ÇÏ¿©
ÇÇÇرԸð µîÀ» Á¶»ç, ÇâÈÄ 48½Ã°£³»¿¡
ÇÇÇØÁö¿ª³»ÀÇ ¾î¸°ÀÌ ±¸È£¸¦ ÃÖ¿ì¼±À¸·Î Á¤ÇÏ°í ÇÇÇغ¹±¸ ÀÛ¾÷À» ½Ç½ÃÇÏ°í ÀÖ½À´Ï´Ù.
¿¤»ì¹Ùµµ¸£ÀÇ ¾î¸°À̸¦ ÈÄ¿øÇϽðųª
¾Æ·¡ÀÇ °èÁ¹øÈ£·Î ¼º±ÝÀ» º¸³¿À¸·Î½á À̵鿡°Ô ¿ë±â¸¦ ÁÖ½Ç ¼ö ÀÖ½À´Ï´Ù.
±¹¹ÎÀºÇà: 815-01-0385-973
¹®ÀÇ ÀüÈ: 080-980-9809
|
°ü·Ã ±â»ç º¸±â
"ÁöÁøÇÇÇØ Áß³²¹Ì ¾î¸°ÀÌ¿¡°Ô µµ¿òÀ»"
<Çѱ¹ÀϺ¸ 2001 1.19>
[ÀÌ »ç¶÷] Ç÷£ ÄÚ¸®¾Æ - ÀÌ»óÁÖ Çѱ¹À§¿øȸ »ç¹«±¹ ´ëÇ¥
<Á¶¼±ÀϺ¸ 2001. 1.22>
¡ß
Ç÷£ÀÇ ÈÄ¿øÀÚ°¡ µÇ´Â ±æ
¼¼°èÀÇ ¾î´À °÷¿¡¼±°¡ ÇÑ ¾ÆÀÌ°¡ ³ª¿Í Ä£±¸°¡ µÇ¾î ¿ìÁ¤À» ½×°í, ³ª·Î ÀÎÇØ ¿ë±â¿Í ÈûÀ» ¾ò´Â´Ù°í »ý°¢ÇØ º¸¼¼¿ä. Ç÷£Àº ´ç½ÅÀÌ ÀÌ ¾ÆÀ̵é°ú ÇÔ²² Çϱ⸦ ±â¿øÇÕ´Ï´Ù.  
ÈÄ¿øÀÚ ½Åû
Àüȷεµ ½Åû°¡´ÉÇÕ´Ï´Ù.(080-980-9809)
¡ß
ÀÚ¿øºÀ»ç Âü¿©¹æ¹ý
¿ìÆí¹° ¹ß¼ÛÀÛ¾÷, µ¥ÀÌŸ ÀÔ·Â ÀÛ¾÷, ¹ø¿ª ÀÛ¾÷ µî Àá±ñÀÇ ½Ã°£À» ³»ÁֽŴٸé Á¦3¼¼°èÀÇ ºÒ¿ìÇÑ ¾î¸°À̵éÀ» À§Çؼ ±â»Ú°Ô ½Ã°£À» º¸³»½Ç ¼ö ÀÖ´Â ¹æ¹ýÀÌ ÀÖ½À´Ï´Ù.
 
|
|
|
|
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]