Bug 13415

Summary:

Add UTF-32 support for html/xml documents

Product:

WebKit

Reporter:

Jungshik Shin <jshin>

Component:

Page Loading

Assignee:

Nobody <webkit-unassigned>

Status:

RESOLVED FIXED

Severity:

Minor

CC:

ap, mrowe

Priority:

Version:

523.x (Safari 3)

Hardware:

All

OS:

All

URL:

http://jshin.net/i18n/utftest

Attachments:

Description	Flags
a preliminary patch	ddkilzer: review-
patch with test cases and changelog entries	mjs: review-
a new patch with updated test cases	ap: review-
patch updated	ap: review-
patch updated once more	mrowe: review-
patch without tabs	mjs: review+

Jungshik Shin

Reported 2007-04-19 21:17:01 PDT

bug 10709 is about parsing CSS stylesheets in UTF-16/32. For html/xml documents, UTF-16 is detected and supported, but UTF-32 (BE and LE) is not.

Attachments
a preliminary patch (6.32 KB, patch) 2007-04-19 21:38 PDT, Jungshik Shin	ddkilzer: review-	Details Formatted Diff Diff
patch with test cases and changelog entries (23.10 KB, patch) 2007-04-26 17:37 PDT, Jungshik Shin	mjs: review-	Details Formatted Diff Diff
a new patch with updated test cases (30.33 KB, patch) 2007-04-27 00:03 PDT, Jungshik Shin	ap: review-	Details Formatted Diff Diff
patch updated (31.52 KB, patch) 2007-06-28 20:05 PDT, Jungshik Shin	ap: review-	Details Formatted Diff Diff
patch updated once more (31.23 KB, patch) 2007-06-29 15:54 PDT, Jungshik Shin	mrowe: review-	Details Formatted Diff Diff
patch without tabs (31.00 KB, patch) 2007-07-06 01:20 PDT, Jungshik Shin	mjs: review+	Details Formatted Diff Diff
Show Obsolete (5) View All Add attachment proposed patch, testcase, etc.

Jungshik Shin

Comment 1 2007-04-19 21:38:05 PDT

Created attachment 14103 [details] a preliminary patch

Dave Hyatt

Comment 2 2007-04-19 22:01:59 PDT

Patch looks good to me.

Jungshik Shin

Comment 3 2007-04-20 05:40:27 PDT

test cases. Not all test cases work because some of them rely on a 'psuedo-BOM for HTML', which is not supported by WebKit, perhaps intentionally (HTML standard does not have such a provision while XML has that in the appendix). Test cases for 'pseudo-BOM for XML' (supported by WebKit) need to be added.

David Kilzer (:ddkilzer)

Comment 4 2007-04-20 12:26:33 PDT

Comment on attachment 14103 [details] a preliminary patch Per Comment #2, the basic patch is good, but it needs a ChangeLog entry and test cases. Please include layout test cases (with the patch) when you resumit the patch. More information on creating a ChangeLog entry and a patch is here: http://webkit.org/coding/contributing.html Scripts are in the WebKitTools/Scripts directory. Thanks!

Jungshik Shin

Comment 5 2007-04-20 19:48:30 PDT

David, according to the document you mentioned, I'm not supposed to use fonts that don't come with Mac OS X. Is there any Mac OS X standard font supporting non-BMP characters? Because WebKit renders 'U+0000' with zero-width glyph, it's impossible to tell whether UTF-32LE is interpreted as UTF-32LE or as UTF-16LE unless I include some non-BMP characters. A Japanese font (to support JIS X 213 characters) may cover some CJK ideographs in plane 2. I have to check that out. Alternatively, I might be able to use a Javascript to tell which is the case.

Alexey Proskuryakov

Comment 6 2007-04-21 00:57:58 PDT

(In reply to comment #5) > Alternatively, I might be able to use a Javascript to tell which is the case. That looks like the easiest alternative - document.inputEncoding will contain the detected encoding name.

Jungshik Shin

Comment 7 2007-04-26 17:37:43 PDT

Created attachment 14218 [details] patch with test cases and changelog entries Added test cases for UTF-32BE (html with BOM and xml without BOM) and UTF-32LE (html with BOM and xml without BOM). I can't make a xml test case for UTF-32LE that prints out document.inputEncoding. I tried document.write() and a few DOM-based methods to add a node, "<p>{the value of document.inputEncoding}</p>", but I couldn't make it work in Safari. For UTF-32BE, just comparing the output (without document.inputEncoding) is sufficient because it cannot be mistaken for UTF-16BE or other encodings. For UTF-32LE, we'd better print out the value of document.inputEncoding.

Alexey Proskuryakov

Comment 8 2007-04-26 21:40:05 PDT

To dynamically create a <p> element, be sure to use createElementNS with XHTML namespace (unlike Firefox, WebKit does not try to guess what namespace to use with createElement; we have a bug tracking this). Alternatively, the following should work: --------------------- <div id="result">FAILURE: script did't run</div> <script>document.getElementById("result").firstChild.nodeValue = document.inputEncoding;</script> ---------------------

Maciej Stachowiak

Comment 9 2007-04-26 22:31:40 PDT

Comment on attachment 14218 [details] patch with test cases and changelog entries r- for now to consider Alexey's comment, please flag for review with new patch.

Jungshik Shin

Comment 10 2007-04-27 00:03:34 PDT

Created attachment 14220 [details] a new patch with updated test cases Modified test cases per ap's suggestion. Both XML and HTML print out the encoding determined. > --------------------- > <div id="result">FAILURE: script did't run</div> > <script>document.getElementById("result").firstChild.nodeValue = > document.inputEncoding;</script> > --------------------- Actually, I tried the above as well. However, I forgot that I had to make it run after the loading is complete (the script was put in the <head> without an onload triggering.). As a result, it didn't run even in Firefox. Thanks to your suggestion, I realized that :-)

Jungshik Shin

Comment 11 2007-04-30 17:08:40 PDT

Alexey (or anyone else), can you review the latest patch?

Alexey Proskuryakov

Comment 12 2007-05-01 01:32:17 PDT

Comment on attachment 14220 [details] a new patch with updated test cases r- for now, but the patch looks really good to me! + unsigned char c3 = buf1Len ? (--buf1Len, *buf1++) : buf2Len ? (--buf2Len, *buf2++) : 0; + unsigned char c4 = buf2Len ? (--buf2Len, *buf2++) : 0; ... + if (c1 == 0xFF && c2 == 0xFE) { + if (c3 != 0 || c4 != 0) + encodingConsideringBOM = &UTF16LittleEndianEncoding(); + else + encodingConsideringBOM = &UTF32LittleEndianEncoding(); + } You need to check why c3 and c4 are zero - if there are only two bytes available yet, it may be too early to assume UTF-32. else if (numBufferedBytes + length <= sizeof(m_bufferedBytes) && !flush) { // Continue to look for the BOM. I believe m_bufferedBytes size should be increased to three bytes, now that we deal with 4-byte BOMs.

Jungshik Shin

Comment 13 2007-06-28 20:05:11 PDT

Created attachment 15303 [details] patch updated patch updated per comments.

Alexey Proskuryakov

Comment 14 2007-06-29 02:12:59 PDT

Comment on attachment 15303 [details] patch updated It seems that the change to html4.css does not belong to this patch. + setEncoding(((ptr - m_buffer.data()) % 4) ? "UTF-32LE" : "UTF-32BE", AutoDetectedEncoding); Since you've replaced encoding names with function calls in several places, I suggest using those here, as well. + //else if (numBufferedBytes >= 3 || length >= 3) We really do not like commented-out code in sources, please remove this line. Coming to more important issues: I don't think I understand the algorithm in TextDecoder::checkForBOM(). Why did you change m_bufferedBytes size to 4 bytes? If we have already seen 4 bytes of the input, then we are done with BOM detection, so only 3 bytes need be buffered. And the "encodingConsideringBOM != &m_encoding" check looks wrong, comparing TextEncoding by address is not a good way to detect that we have successfully found a BOM. This is extremely close to an r+, but I think that another iteration is needed to iron out these issues.

Jungshik Shin

Comment 15 2007-06-29 15:54:12 PDT

Created attachment 15320 [details] patch updated once more thanks for the review. pls, take another look

Alexey Proskuryakov

Comment 16 2007-07-02 03:51:19 PDT

Comment on attachment 15320 [details] patch updated once more Looks great, r=me! Attention committer: there is a tab character in the patch.

Mark Rowe (bdash)

Comment 17 2007-07-05 07:27:03 PDT

Comment on attachment 15320 [details] patch updated once more I went to land this patch but the tabs in the various UTF-32 files made it rather tricky. After futzing around for a few minutes trying to find an editor that'd handle all four UTF-32 variations, I gave up. Can you please fix the four UTF-32 files, and the tab that's in the source file while you're at it :) It's also unclear to me why the some of the tests are .xml and others are .html. Is there some reason for this?

Jungshik Shin

Comment 18 2007-07-05 23:58:09 PDT

What do you mean by tabs in 4 UTF-32 files? They're NOT ASCII-compatible and they should be treated as binary. As for xml vs html, WebKit currently use different encoding sniffing rules for html and xml, which is why I have separate test cases for html and xml. The same is true of UTF-16 test cases. I'll get rid of tabs in the source code and upload a new patch.

Jungshik Shin

Comment 19 2007-07-06 00:10:45 PDT

Ahah. If it's required that even html/xml files for test are tab-free (in whatever encoding), I'll get rid of them in 4 UTF-32 files.

Jungshik Shin

Comment 20 2007-07-06 01:20:12 PDT

Created attachment 15413 [details] patch without tabs tabs are removed in the source code as well as in UTF-32 files.

Maciej Stachowiak

Comment 21 2007-07-06 02:44:14 PDT

Comment on attachment 15413 [details] patch without tabs Thanks for removing the tabs.

Mark Rowe (bdash)

Comment 22 2007-07-06 02:59:54 PDT

My question about HTML vs XML was that you have only four test cases -- UTF-32BE with and without BOM, UTF-32LE with and without BOM. The BOM-less cases are both XML, the ones with BOM are both HTML. What about BOM-less HTML? Or XML with a BOM?

Mark Rowe (bdash)

Comment 23 2007-07-06 03:02:18 PDT

Landed in r24052.

Note You need to log in before you can comment on or make changes to this bug.