[xml] HTML-parser: encoding?



The HTML-parser of libxml2 is very nice. But I wonder what the real goal of that parser is (has there been a discussion about that?: if so, I seemed to have missed that).

If it is there to allow you to take _any_ (dirty) HTML-file and turn it into a valid XML-dom, then its functionality is still not complete.

Currently, if there is no encoding specification found in an HTML-file, ISO-Latin-1 is assumed. However, no check is performed whether all text characters actually fall within ISO-Latin-1!

This causes an HTML-file without an encoding specification, but which contains characters that are _not_ part of ISO-Latin-1, to be serialised incorrectly. See attached file:

# xmllint --html 61.html >1 2>/dev/null
# xmllint 1
1:20: error: Input is not proper UTF-8, indicate encoding !
[*snip*]
1:20: error: Bytes: 0xC3 0xC3 0xC2 0xC2
[*snip*]

I would propose that _if_ the HTML-parser is used _and_ there is _no_ encoding specification found, that libxml _then_ would check all of the text in the tree for characters illegal for the ISO-Latin-1 encoding and replace these with spaces (so that the size of the buffer used is not changed).


Comments anyone?


Elizabeth Mattijsen
Title: °³¹ßµµ»ó±¹ ºÒ¿ìÇÑ ¾î¸°À̸¦ À§ÇØ ÀÏÇÏ´Â Ç÷£ ÀÎÅͳ»¼Å³¯

Ç÷£ÀÎÅͳ»¼Å³¯Àº Áö³­ 60¿©³â°£ °³¹ßµµ»ó±¹ÀÇ ºÒ¿ìÇÑ ¾î¸°À̸¦ µ½°í ÀÖ´Â ±¹Á¦¾Æµ¿ÈÄ¿ø´Üü·Î UN °æÁ¦»çȸÀÌ»çȸÀÇ ÇùÀDZⱸÀÔ´Ï´Ù.

ÇöÀç ¿ì¸®³ª¶ó¸¦ ºñ·ÔÇÏ¿© ¹Ì±¹, ¿µ±¹, ÇÁ¶û½º, ÀϺ» µî ¼±Áø 14°³±¹ÀÇ 100¸¸¿©¸íÀÇ ÈÄ¿øÀÚµéÀÌ Àü¼¼°è 110¸¸¸íÀÇ ¾î¸°À̵éÀ» ÈÄ¿øÇÏ°í ÀÖ½À´Ï´Ù.

Ç÷£ ÀÎÅͳ»¼Å³¯ÀÇ Æ¯º°ÇÑ ÈÄ¿øÇÁ·Î±×·¥À¸·Î ÈÄ¿øÀÚ´Â ÀÚ½ÅÀÌ ´©±¸¸¦ µ½´ÂÁö ¾Ë ¼ö ÀÖ°í °æÁ¦Àû µµ¿ò°ü°è¿¡¼­ ³ª¾Æ°¡ µµ¿òÀ» ¹Þ´Â ¾î¸°ÀÌ¿Í ÆíÁö, Ä«µå, »çÁø, ±×¸²À» ÁÖ°í ¹ÞÀ½À¸·Î½á Áö¼ÓÀûÀÌ°í Àΰ£ÀûÀÎ ÈÄ¿ø°ü°è¸¦ ¸Î°Ô µË´Ï´Ù.


 Ç÷£ ÄÚ¸®¾Æ
  ¿¤»ì¹Ùµµ¸£ ÁöÁøÇÇÇØ ¾î¸°À̵½±â Ä·ÆäÀÎ Àü°³


  13ÀÏ ¹ß»ýÇÑ °­ÁøÀ¸·Î ÀÎÇÑ
  »ê»çÅ°¡ ¸¶À»À» µ¤Ä£ ¸ð½À

Ç÷£ ÄÚ¸®¾Æ´Â Å« ÁöÁøÇÇÇظ¦ ´çÇÑ ¿¤»ì¹Ùµµ¸£¸¦
µ½±â À§ÇÑ ¸ð±ÝÈ°µ¿À» Àü°³ÇÕ´Ï´Ù.
Ç÷£ ¿¤»ì¹Ùµµ¸£ »ç¹«¼ÒÀÇ Á÷¿øµéÀº ÇöÁö¿¡¼­
±ä±Þ±¸È£ÆÀÀ» Á¶Á÷ÇÏ¿© ÇÇÇرԸð µîÀ» Á¶»ç,
ÇâÈÄ 48½Ã°£³»¿¡ ÇÇÇØÁö¿ª³»ÀÇ ¾î¸°ÀÌ ±¸È£¸¦
ÃÖ¿ì¼±À¸·Î Á¤ÇÏ°í ÇÇÇغ¹±¸ ÀÛ¾÷À» ½Ç½ÃÇÏ°í
ÀÖ½À´Ï´Ù.
¿¤»ì¹Ùµµ¸£ÀÇ ¾î¸°À̸¦ ÈÄ¿øÇϽðųª
¾Æ·¡ÀÇ °èÁ¹øÈ£·Î ¼º±ÝÀ» º¸³¿À¸·Î½á
À̵鿡°Ô ¿ë±â¸¦ ÁÖ½Ç ¼ö ÀÖ½À´Ï´Ù.
±¹¹ÎÀºÇà: 815-01-0385-973

¹®ÀÇ ÀüÈ­: 080-980-9809


°ü·Ã ±â»ç º¸±â

"ÁöÁøÇÇÇØ Áß³²¹Ì ¾î¸°ÀÌ¿¡°Ô µµ¿òÀ»" <Çѱ¹ÀϺ¸ 2001 1.19>
[ÀÌ »ç¶÷] Ç÷£ ÄÚ¸®¾Æ - ÀÌ»óÁÖ Çѱ¹À§¿øȸ »ç¹«±¹ ´ëÇ¥
<Á¶¼±ÀϺ¸ 2001. 1.22>

¡ß Ç÷£ÀÇ ÈÄ¿øÀÚ°¡ µÇ´Â ±æ

¼¼°èÀÇ ¾î´À °÷¿¡¼±°¡ ÇÑ ¾ÆÀÌ°¡ ³ª¿Í Ä£±¸°¡ µÇ¾î ¿ìÁ¤À» ½×°í, ³ª·Î ÀÎÇØ ¿ë±â¿Í ÈûÀ» ¾ò´Â´Ù°í »ý°¢ÇØ º¸¼¼¿ä. Ç÷£Àº ´ç½ÅÀÌ ÀÌ ¾ÆÀ̵é°ú ÇÔ²² Çϱ⸦ ±â¿øÇÕ´Ï´Ù.   ÈÄ¿øÀÚ ½Åû
ÀüÈ­·Îµµ ½Åû°¡´ÉÇÕ´Ï´Ù.(080-980-9809)

¡ß ÀÚ¿øºÀ»ç Âü¿©¹æ¹ý

¿ìÆí¹° ¹ß¼ÛÀÛ¾÷, µ¥ÀÌŸ ÀÔ·Â ÀÛ¾÷, ¹ø¿ª ÀÛ¾÷ µî Àá±ñÀÇ ½Ã°£À» ³»ÁֽŴٸé Á¦3¼¼°èÀÇ ºÒ¿ìÇÑ ¾î¸°À̵éÀ» À§Çؼ­ ±â»Ú°Ô ½Ã°£À» º¸³»½Ç ¼ö ÀÖ´Â ¹æ¹ýÀÌ ÀÖ½À´Ï´Ù.    Áö±Ý Âü¿©ÇØÁÖ¼¼¿ä



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]