13415 – Add UTF-32 support for html/xml documents

RESOLVED FIXED 13415

Add UTF-32 support for html/xml documents

https://bugs.webkit.org/show_bug.cgi?id=13415

Summary Add UTF-32 support for html/xml documents

Jungshik Shin

Reported 2007-04-19 21:17:01 PDT

bug 10709 is about parsing CSS stylesheets in UTF-16/32. For html/xml documents, UTF-16 is detected and supported, but UTF-32 (BE and LE) is not.

Attachments
a preliminary patch (6.32 KB, patch) 2007-04-19 21:38 PDT, Jungshik Shin	ddkilzer: review-	Details Formatted Diff Diff
patch with test cases and changelog entries (23.10 KB, patch) 2007-04-26 17:37 PDT, Jungshik Shin	mjs: review-	Details Formatted Diff Diff
a new patch with updated test cases (30.33 KB, patch) 2007-04-27 00:03 PDT, Jungshik Shin	ap: review-	Details Formatted Diff Diff
patch updated (31.52 KB, patch) 2007-06-28 20:05 PDT, Jungshik Shin	ap: review-	Details Formatted Diff Diff
patch updated once more (31.23 KB, patch) 2007-06-29 15:54 PDT, Jungshik Shin	mrowe: review-	Details Formatted Diff Diff
patch without tabs (31.00 KB, patch) 2007-07-06 01:20 PDT, Jungshik Shin	mjs: review+	Details Formatted Diff Diff
Show Obsolete (5) View All Add attachment proposed patch, testcase, etc.

Jungshik Shin

Comment 1 2007-04-19 21:38:05 PDT

Created attachment 14103 [details] a preliminary patch

Dave Hyatt

Comment 2 2007-04-19 22:01:59 PDT

Patch looks good to me.

Jungshik Shin

Comment 3 2007-04-20 05:40:27 PDT

test cases. Not all test cases work because some of them rely on a 'psuedo-BOM for HTML', which is not supported by WebKit, perhaps intentionally (HTML standard does not have such a provision while XML has that in the appendix). Test cases for 'pseudo-BOM for XML' (supported by WebKit) need to be added.

David Kilzer (:ddkilzer)

Comment 4 2007-04-20 12:26:33 PDT

Comment on attachment 14103 [details] a preliminary patch Per Comment #2, the basic patch is good, but it needs a ChangeLog entry and test cases. Please include layout test cases (with the patch) when you resumit the patch. More information on creating a ChangeLog entry and a patch is here: http://webkit.org/coding/contributing.html Scripts are in the WebKitTools/Scripts directory. Thanks!

Jungshik Shin

Comment 5 2007-04-20 19:48:30 PDT

David, according to the document you mentioned, I'm not supposed to use fonts that don't come with Mac OS X. Is there any Mac OS X standard font supporting non-BMP characters? Because WebKit renders 'U+0000' with zero-width glyph, it's impossible to tell whether UTF-32LE is interpreted as UTF-32LE or as UTF-16LE unless I include some non-BMP characters. A Japanese font (to support JIS X 213 characters) may cover some CJK ideographs in plane 2. I have to check that out. Alternatively, I might be able to use a Javascript to tell which is the case.

Alexey Proskuryakov

Comment 6 2007-04-21 00:57:58 PDT

(In reply to comment #5) > Alternatively, I might be able to use a Javascript to tell which is the case. That looks like the easiest alternative - document.inputEncoding will contain the detected encoding name.

Jungshik Shin

Comment 7 2007-04-26 17:37:43 PDT

Created attachment 14218 [details] patch with test cases and changelog entries Added test cases for UTF-32BE (html with BOM and xml without BOM) and UTF-32LE (html with BOM and xml without BOM). I can't make a xml test case for UTF-32LE that prints out document.inputEncoding. I tried document.write() and a few DOM-based methods to add a node, "<p>{the value of document.inputEncoding}</p>", but I couldn't make it work in Safari. For UTF-32BE, just comparing the output (without document.inputEncoding) is sufficient because it cannot be mistaken for UTF-16BE or other encodings. For UTF-32LE, we'd better print out the value of document.inputEncoding.

Alexey Proskuryakov

Comment 8 2007-04-26 21:40:05 PDT

To dynamically create a <p> element, be sure to use createElementNS with XHTML namespace (unlike Firefox, WebKit does not try to guess what namespace to use with createElement; we have a bug tracking this). Alternatively, the following should work: --------------------- <div id="result">FAILURE: script did't run</div> <script>document.getElementById("result").firstChild.nodeValue = document.inputEncoding;</script> ---------------------

Maciej Stachowiak

Comment 9 2007-04-26 22:31:40 PDT

Comment on attachment 14218 [details] patch with test cases and changelog entries r- for now to consider Alexey's comment, please flag for review with new patch.

Jungshik Shin

Comment 10 2007-04-27 00:03:34 PDT

Created attachment 14220 [details] a new patch with updated test cases Modified test cases per ap's suggestion. Both XML and HTML print out the encoding determined. > --------------------- > <div id="result">FAILURE: script did't run</div> > <script>document.getElementById("result").firstChild.nodeValue = > document.inputEncoding;</script> > --------------------- Actually, I tried the above as well. However, I forgot that I had to make it run after the loading is complete (the script was put in the <head> without an onload triggering.). As a result, it didn't run even in Firefox. Thanks to your suggestion, I realized that :-)

Jungshik Shin

Comment 11 2007-04-30 17:08:40 PDT

Alexey (or anyone else), can you review the latest patch?

Alexey Proskuryakov

Comment 12 2007-05-01 01:32:17 PDT

Comment on attachment 14220 [details] a new patch with updated test cases r- for now, but the patch looks really good to me! + unsigned char c3 = buf1Len ? (--buf1Len, *buf1++) : buf2Len ? (--buf2Len, *buf2++) : 0; + unsigned char c4 = buf2Len ? (--buf2Len, *buf2++) : 0; ... + if (c1 == 0xFF && c2 == 0xFE) { + if (c3 != 0 || c4 != 0) + encodingConsideringBOM = &UTF16LittleEndianEncoding(); + else + encodingConsideringBOM = &UTF32LittleEndianEncoding(); + } You need to check why c3 and c4 are zero - if there are only two bytes available yet, it may be too early to assume UTF-32. else if (numBufferedBytes + length <= sizeof(m_bufferedBytes) && !flush) { // Continue to look for the BOM. I believe m_bufferedBytes size should be increased to three bytes, now that we deal with 4-byte BOMs.

Jungshik Shin

Comment 13 2007-06-28 20:05:11 PDT

Created attachment 15303 [details] patch updated patch updated per comments.

Alexey Proskuryakov

Comment 14 2007-06-29 02:12:59 PDT

Comment on attachment 15303 [details] patch updated It seems that the change to html4.css does not belong to this patch. + setEncoding(((ptr - m_buffer.data()) % 4) ? "UTF-32LE" : "UTF-32BE", AutoDetectedEncoding); Since you've replaced encoding names with function calls in several places, I suggest using those here, as well. + //else if (numBufferedBytes >= 3 || length >= 3) We really do not like commented-out code in sources, please remove this line. Coming to more important issues: I don't think I understand the algorithm in TextDecoder::checkForBOM(). Why did you change m_bufferedBytes size to 4 bytes? If we have already seen 4 bytes of the input, then we are done with BOM detection, so only 3 bytes need be buffered. And the "encodingConsideringBOM != &m_encoding" check looks wrong, comparing TextEncoding by address is not a good way to detect that we have successfully found a BOM. This is extremely close to an r+, but I think that another iteration is needed to iron out these issues.

Jungshik Shin

Comment 15 2007-06-29 15:54:12 PDT

Created attachment 15320 [details] patch updated once more thanks for the review. pls, take another look

Alexey Proskuryakov

Comment 16 2007-07-02 03:51:19 PDT

Comment on attachment 15320 [details] patch updated once more Looks great, r=me! Attention committer: there is a tab character in the patch.

Mark Rowe (bdash)

Comment 17 2007-07-05 07:27:03 PDT

Comment on attachment 15320 [details] patch updated once more I went to land this patch but the tabs in the various UTF-32 files made it rather tricky. After futzing around for a few minutes trying to find an editor that'd handle all four UTF-32 variations, I gave up. Can you please fix the four UTF-32 files, and the tab that's in the source file while you're at it :) It's also unclear to me why the some of the tests are .xml and others are .html. Is there some reason for this?

Jungshik Shin

Comment 18 2007-07-05 23:58:09 PDT

What do you mean by tabs in 4 UTF-32 files? They're NOT ASCII-compatible and they should be treated as binary. As for xml vs html, WebKit currently use different encoding sniffing rules for html and xml, which is why I have separate test cases for html and xml. The same is true of UTF-16 test cases. I'll get rid of tabs in the source code and upload a new patch.

Jungshik Shin

Comment 19 2007-07-06 00:10:45 PDT

Ahah. If it's required that even html/xml files for test are tab-free (in whatever encoding), I'll get rid of them in 4 UTF-32 files.

Jungshik Shin

Comment 20 2007-07-06 01:20:12 PDT

Created attachment 15413 [details] patch without tabs tabs are removed in the source code as well as in UTF-32 files.

Maciej Stachowiak

Comment 21 2007-07-06 02:44:14 PDT

Comment on attachment 15413 [details] patch without tabs Thanks for removing the tabs.

Mark Rowe (bdash)

Comment 22 2007-07-06 02:59:54 PDT

My question about HTML vs XML was that you have only four test cases -- UTF-32BE with and without BOM, UTF-32LE with and without BOM. The BOM-less cases are both XML, the ones with BOM are both HTML. What about BOM-less HTML? Or XML with a BOM?

Mark Rowe (bdash)

Comment 23 2007-07-06 03:02:18 PDT

Landed in r24052.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P3

Severity Minor

Classification Unclassified

Version 523.x (Safari 3)

Hardware All

OS All

Product WebKit

Component Page Loading

Assignee

Nobody

Reported

2007-04-19 21:17 PDT

Modified

2007-07-06 03:02 PDT History

CC List

2 users Show

URL

http://jshin.net/i18n/utftest

Keywords

Depends on

Blocks