Bug 13415 - Add UTF-32 support for html/xml documents
Summary: Add UTF-32 support for html/xml documents
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Page Loading (show other bugs)
Version: 523.x (Safari 3)
Hardware: All All
: P3 Minor
Assignee: Nobody
URL: http://jshin.net/i18n/utftest
Keywords:
Depends on:
Blocks:
 
Reported: 2007-04-19 21:17 PDT by Jungshik Shin
Modified: 2007-07-06 03:02 PDT (History)
2 users (show)

See Also:


Attachments
a preliminary patch (6.32 KB, patch)
2007-04-19 21:38 PDT, Jungshik Shin
ddkilzer: review-
Details | Formatted Diff | Diff
patch with test cases and changelog entries (23.10 KB, patch)
2007-04-26 17:37 PDT, Jungshik Shin
mjs: review-
Details | Formatted Diff | Diff
a new patch with updated test cases (30.33 KB, patch)
2007-04-27 00:03 PDT, Jungshik Shin
ap: review-
Details | Formatted Diff | Diff
patch updated (31.52 KB, patch)
2007-06-28 20:05 PDT, Jungshik Shin
ap: review-
Details | Formatted Diff | Diff
patch updated once more (31.23 KB, patch)
2007-06-29 15:54 PDT, Jungshik Shin
mrowe: review-
Details | Formatted Diff | Diff
patch without tabs (31.00 KB, patch)
2007-07-06 01:20 PDT, Jungshik Shin
mjs: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jungshik Shin 2007-04-19 21:17:01 PDT
bug 10709 is about parsing CSS stylesheets in UTF-16/32. For html/xml documents, UTF-16 is detected and supported, but UTF-32 (BE and LE) is not.
Comment 1 Jungshik Shin 2007-04-19 21:38:05 PDT
Created attachment 14103 [details]
a preliminary patch
Comment 2 Dave Hyatt 2007-04-19 22:01:59 PDT
Patch looks good to me.
Comment 3 Jungshik Shin 2007-04-20 05:40:27 PDT
test cases. Not all test cases work because some of them rely on a 'psuedo-BOM for HTML', which is not supported by WebKit, perhaps intentionally (HTML standard does not have such a provision while XML has that in the appendix). Test cases for 'pseudo-BOM for XML' (supported by WebKit) need to be added. 
Comment 4 David Kilzer (:ddkilzer) 2007-04-20 12:26:33 PDT
Comment on attachment 14103 [details]
a preliminary patch

Per Comment #2, the basic patch is good, but it needs a ChangeLog entry and test cases.

Please include layout test cases (with the patch) when you resumit the patch.

More information on creating a ChangeLog entry and a patch is here:

http://webkit.org/coding/contributing.html

Scripts are in the WebKitTools/Scripts directory.  Thanks!
Comment 5 Jungshik Shin 2007-04-20 19:48:30 PDT
David, according to the document you mentioned, I'm not supposed to use fonts that don't  come with Mac OS X. Is there any Mac OS X standard font supporting non-BMP characters? Because WebKit renders 'U+0000' with zero-width glyph, it's impossible to tell whether UTF-32LE is interpreted as UTF-32LE or as UTF-16LE unless I include some non-BMP characters.  

A Japanese font (to support JIS X 213 characters) may cover some CJK ideographs in plane 2. I have to check that out. 

Alternatively, I might be able to use a Javascript to tell which is the case. 
Comment 6 Alexey Proskuryakov 2007-04-21 00:57:58 PDT
(In reply to comment #5)
> Alternatively, I might be able to use a Javascript to tell which is the case. 

That looks like the easiest alternative - document.inputEncoding will contain the detected encoding name.
Comment 7 Jungshik Shin 2007-04-26 17:37:43 PDT
Created attachment 14218 [details]
patch with test cases and changelog entries

Added test cases for UTF-32BE (html  with BOM and xml without BOM) and UTF-32LE (html with BOM and xml without BOM).  I can't make a xml test case for UTF-32LE that prints out document.inputEncoding.  I tried document.write() and a few DOM-based methods to add a node, "<p>{the value of document.inputEncoding}</p>", but I couldn't make it work in Safari. 

For UTF-32BE, just comparing the output (without document.inputEncoding) is sufficient because it cannot be mistaken for UTF-16BE or other encodings. For UTF-32LE, we'd better print out the value of document.inputEncoding.
Comment 8 Alexey Proskuryakov 2007-04-26 21:40:05 PDT
To dynamically create a <p> element, be sure to use createElementNS with XHTML namespace (unlike Firefox, WebKit does not try to guess what namespace to use with createElement; we have a bug tracking this).

Alternatively, the following should work:

---------------------
<div id="result">FAILURE: script did't run</div>
<script>document.getElementById("result").firstChild.nodeValue = document.inputEncoding;</script>
---------------------
Comment 9 Maciej Stachowiak 2007-04-26 22:31:40 PDT
Comment on attachment 14218 [details]
patch with test cases and changelog entries

r- for now to consider Alexey's comment, please flag for review with new patch.
Comment 10 Jungshik Shin 2007-04-27 00:03:34 PDT
Created attachment 14220 [details]
a new patch with updated test cases

Modified test cases per ap's suggestion. Both XML and HTML print out the encoding determined. 

> ---------------------
> <div id="result">FAILURE: script did't run</div>
> <script>document.getElementById("result").firstChild.nodeValue =
> document.inputEncoding;</script>
> ---------------------

Actually, I tried the above as well. However, I forgot that I had to make it run after the loading is complete (the script was put in the <head> without an onload triggering.). As a result, it didn't run even in Firefox. 
 
Thanks to your suggestion, I realized that :-)
Comment 11 Jungshik Shin 2007-04-30 17:08:40 PDT
Alexey (or anyone else), can you review the latest patch? 
Comment 12 Alexey Proskuryakov 2007-05-01 01:32:17 PDT
Comment on attachment 14220 [details]
a new patch with updated test cases

r- for now, but the patch looks really good to me!

+    unsigned char c3 = buf1Len ? (--buf1Len, *buf1++) : buf2Len ? (--buf2Len, *buf2++) : 0;
+    unsigned char c4 = buf2Len ? (--buf2Len, *buf2++) : 0;
...
+    if (c1 == 0xFF && c2 == 0xFE) {
+        if (c3 != 0 || c4 != 0) 
+            encodingConsideringBOM = &UTF16LittleEndianEncoding();
+        else 
+            encodingConsideringBOM = &UTF32LittleEndianEncoding();
+    }

You need to check why c3 and c4 are zero - if there are only two bytes available yet, it may be too early to assume UTF-32.

     else if (numBufferedBytes + length <= sizeof(m_bufferedBytes) && !flush) {
         // Continue to look for the BOM.

I believe m_bufferedBytes size should be increased to three bytes, now that we deal with 4-byte BOMs.
Comment 13 Jungshik Shin 2007-06-28 20:05:11 PDT
Created attachment 15303 [details]
patch updated

patch updated per comments.
Comment 14 Alexey Proskuryakov 2007-06-29 02:12:59 PDT
Comment on attachment 15303 [details]
patch updated

It seems that the change to html4.css does not belong to this patch.

+ setEncoding(((ptr - m_buffer.data()) % 4) ? "UTF-32LE" : "UTF-32BE", AutoDetectedEncoding);

Since you've replaced encoding names with function calls in several places, I suggest using those here, as well.

+ //else if (numBufferedBytes >= 3 || length >= 3) 

We really do not like commented-out code in sources, please remove this line.

Coming to more important issues: I don't think I understand the algorithm in TextDecoder::checkForBOM(). Why did you change m_bufferedBytes size to 4 bytes? If we have already seen 4 bytes of the input, then we are done with BOM detection, so only 3 bytes need be buffered. And the "encodingConsideringBOM != &m_encoding" check looks wrong, comparing TextEncoding by address is not a good way to detect that we have successfully found a BOM.

This is extremely close to an r+, but I think that another iteration is needed to iron out these issues.
Comment 15 Jungshik Shin 2007-06-29 15:54:12 PDT
Created attachment 15320 [details]
patch updated once more

thanks for the review. pls, take another look
Comment 16 Alexey Proskuryakov 2007-07-02 03:51:19 PDT
Comment on attachment 15320 [details]
patch updated once more

Looks great, r=me!

Attention committer: there is a tab character in the patch.
Comment 17 Mark Rowe (bdash) 2007-07-05 07:27:03 PDT
Comment on attachment 15320 [details]
patch updated once more

I went to land this patch but the tabs in the various UTF-32 files made it rather tricky.  After futzing around for a few minutes trying to find an editor that'd handle all four UTF-32 variations, I gave up.  Can you please fix the four UTF-32 files, and the tab that's in the source file while you're at it :)

It's also unclear to me why the some of the tests are .xml and others are .html.  Is there some reason for this?
Comment 18 Jungshik Shin 2007-07-05 23:58:09 PDT
What do you mean by tabs in 4 UTF-32 files? They're NOT ASCII-compatible and they should be treated as binary. 

As for xml vs html, WebKit currently use different encoding sniffing rules for html and xml, which is why I have separate test cases for html and xml. The same is true of UTF-16 test cases. 

I'll get rid of tabs in the source code and upload a new patch. 
Comment 19 Jungshik Shin 2007-07-06 00:10:45 PDT
Ahah. If it's required that even html/xml files for test are tab-free (in whatever encoding), I'll get rid of them in 4 UTF-32 files. 


 
Comment 20 Jungshik Shin 2007-07-06 01:20:12 PDT
Created attachment 15413 [details]
patch without tabs

tabs are removed in the source code as well as in UTF-32 files.
Comment 21 Maciej Stachowiak 2007-07-06 02:44:14 PDT
Comment on attachment 15413 [details]
patch without tabs

Thanks for removing the tabs.
Comment 22 Mark Rowe (bdash) 2007-07-06 02:59:54 PDT
My question about HTML vs XML was that you have only four test cases -- UTF-32BE with and without BOM, UTF-32LE with and without BOM.  The BOM-less cases are both XML, the ones with BOM are both HTML.  What about BOM-less HTML?  Or XML with a BOM?
Comment 23 Mark Rowe (bdash) 2007-07-06 03:02:18 PDT
Landed in r24052.