Bug 13415 - Add UTF-32 support for html/xml documents
: Add UTF-32 support for html/xml documents
Status: RESOLVED FIXED
: WebKit
Page Loading
: 523.x (Safari 3)
: All All
: P3 Minor
Assigned To:
: http://jshin.net/i18n/utftest
:
:
:
  Show dependency treegraph
 
Reported: 2007-04-19 21:17 PST by
Modified: 2007-07-06 03:02 PST (History)


Attachments
a preliminary patch (6.32 KB, patch)
2007-04-19 21:38 PST, Jungshik Shin
ddkilzer: review-
Review Patch | Details | Formatted Diff | Diff
patch with test cases and changelog entries (23.10 KB, patch)
2007-04-26 17:37 PST, Jungshik Shin
mjs: review-
Review Patch | Details | Formatted Diff | Diff
a new patch with updated test cases (30.33 KB, patch)
2007-04-27 00:03 PST, Jungshik Shin
ap: review-
Review Patch | Details | Formatted Diff | Diff
patch updated (31.52 KB, patch)
2007-06-28 20:05 PST, Jungshik Shin
ap: review-
Review Patch | Details | Formatted Diff | Diff
patch updated once more (31.23 KB, patch)
2007-06-29 15:54 PST, Jungshik Shin
mrowe: review-
Review Patch | Details | Formatted Diff | Diff
patch without tabs (31.00 KB, patch)
2007-07-06 01:20 PST, Jungshik Shin
mjs: review+
Review Patch | Details | Formatted Diff | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2007-04-19 21:17:01 PST
bug 10709 is about parsing CSS stylesheets in UTF-16/32. For html/xml documents, UTF-16 is detected and supported, but UTF-32 (BE and LE) is not.
------- Comment #1 From 2007-04-19 21:38:05 PST -------
Created an attachment (id=14103) [details]
a preliminary patch
------- Comment #2 From 2007-04-19 22:01:59 PST -------
Patch looks good to me.
------- Comment #3 From 2007-04-20 05:40:27 PST -------
test cases. Not all test cases work because some of them rely on a 'psuedo-BOM for HTML', which is not supported by WebKit, perhaps intentionally (HTML standard does not have such a provision while XML has that in the appendix). Test cases for 'pseudo-BOM for XML' (supported by WebKit) need to be added. 
------- Comment #4 From 2007-04-20 12:26:33 PST -------
(From update of attachment 14103 [details])
Per Comment #2, the basic patch is good, but it needs a ChangeLog entry and test cases.

Please include layout test cases (with the patch) when you resumit the patch.

More information on creating a ChangeLog entry and a patch is here:

http://webkit.org/coding/contributing.html

Scripts are in the WebKitTools/Scripts directory.  Thanks!
------- Comment #5 From 2007-04-20 19:48:30 PST -------
David, according to the document you mentioned, I'm not supposed to use fonts that don't  come with Mac OS X. Is there any Mac OS X standard font supporting non-BMP characters? Because WebKit renders 'U+0000' with zero-width glyph, it's impossible to tell whether UTF-32LE is interpreted as UTF-32LE or as UTF-16LE unless I include some non-BMP characters.  

A Japanese font (to support JIS X 213 characters) may cover some CJK ideographs in plane 2. I have to check that out. 

Alternatively, I might be able to use a Javascript to tell which is the case. 
------- Comment #6 From 2007-04-21 00:57:58 PST -------
(In reply to comment #5)
> Alternatively, I might be able to use a Javascript to tell which is the case. 

That looks like the easiest alternative - document.inputEncoding will contain the detected encoding name.
------- Comment #7 From 2007-04-26 17:37:43 PST -------
Created an attachment (id=14218) [details]
patch with test cases and changelog entries

Added test cases for UTF-32BE (html  with BOM and xml without BOM) and UTF-32LE (html with BOM and xml without BOM).  I can't make a xml test case for UTF-32LE that prints out document.inputEncoding.  I tried document.write() and a few DOM-based methods to add a node, "<p>{the value of document.inputEncoding}</p>", but I couldn't make it work in Safari. 

For UTF-32BE, just comparing the output (without document.inputEncoding) is sufficient because it cannot be mistaken for UTF-16BE or other encodings. For UTF-32LE, we'd better print out the value of document.inputEncoding. 
------- Comment #8 From 2007-04-26 21:40:05 PST -------
To dynamically create a <p> element, be sure to use createElementNS with XHTML namespace (unlike Firefox, WebKit does not try to guess what namespace to use with createElement; we have a bug tracking this).

Alternatively, the following should work:

---------------------
<div id="result">FAILURE: script did't run</div>
<script>document.getElementById("result").firstChild.nodeValue = document.inputEncoding;</script>
---------------------
------- Comment #9 From 2007-04-26 22:31:40 PST -------
(From update of attachment 14218 [details])
r- for now to consider Alexey's comment, please flag for review with new patch.
------- Comment #10 From 2007-04-27 00:03:34 PST -------
Created an attachment (id=14220) [details]
a new patch with updated test cases

Modified test cases per ap's suggestion. Both XML and HTML print out the encoding determined. 

> ---------------------
> <div id="result">FAILURE: script did't run</div>
> <script>document.getElementById("result").firstChild.nodeValue =
> document.inputEncoding;</script>
> ---------------------

Actually, I tried the above as well. However, I forgot that I had to make it run after the loading is complete (the script was put in the <head> without an onload triggering.). As a result, it didn't run even in Firefox. 

Thanks to your suggestion, I realized that :-) 
------- Comment #11 From 2007-04-30 17:08:40 PST -------
Alexey (or anyone else), can you review the latest patch? 
------- Comment #12 From 2007-05-01 01:32:17 PST -------
(From update of attachment 14220 [details])
r- for now, but the patch looks really good to me!

+    unsigned char c3 = buf1Len ? (--buf1Len, *buf1++) : buf2Len ? (--buf2Len, *buf2++) : 0;
+    unsigned char c4 = buf2Len ? (--buf2Len, *buf2++) : 0;
...
+    if (c1 == 0xFF && c2 == 0xFE) {
+        if (c3 != 0 || c4 != 0) 
+            encodingConsideringBOM = &UTF16LittleEndianEncoding();
+        else 
+            encodingConsideringBOM = &UTF32LittleEndianEncoding();
+    }

You need to check why c3 and c4 are zero - if there are only two bytes available yet, it may be too early to assume UTF-32.

     else if (numBufferedBytes + length <= sizeof(m_bufferedBytes) && !flush) {
         // Continue to look for the BOM.

I believe m_bufferedBytes size should be increased to three bytes, now that we deal with 4-byte BOMs.
------- Comment #13 From 2007-06-28 20:05:11 PST -------
Created an attachment (id=15303) [details]
patch updated

patch updated per comments.
------- Comment #14 From 2007-06-29 02:12:59 PST -------
(From update of attachment 15303 [details])
It seems that the change to html4.css does not belong to this patch.

+ setEncoding(((ptr - m_buffer.data()) % 4) ? "UTF-32LE" : "UTF-32BE", AutoDetectedEncoding);

Since you've replaced encoding names with function calls in several places, I suggest using those here, as well.

+ //else if (numBufferedBytes >= 3 || length >= 3) 

We really do not like commented-out code in sources, please remove this line.

Coming to more important issues: I don't think I understand the algorithm in TextDecoder::checkForBOM(). Why did you change m_bufferedBytes size to 4 bytes? If we have already seen 4 bytes of the input, then we are done with BOM detection, so only 3 bytes need be buffered. And the "encodingConsideringBOM != &m_encoding" check looks wrong, comparing TextEncoding by address is not a good way to detect that we have successfully found a BOM.

This is extremely close to an r+, but I think that another iteration is needed to iron out these issues.
------- Comment #15 From 2007-06-29 15:54:12 PST -------
Created an attachment (id=15320) [details]
patch updated once more

thanks for the review. pls, take another look
------- Comment #16 From 2007-07-02 03:51:19 PST -------
(From update of attachment 15320 [details])
Looks great, r=me!

Attention committer: there is a tab character in the patch.
------- Comment #17 From 2007-07-05 07:27:03 PST -------
(From update of attachment 15320 [details])
I went to land this patch but the tabs in the various UTF-32 files made it rather tricky.  After futzing around for a few minutes trying to find an editor that'd handle all four UTF-32 variations, I gave up.  Can you please fix the four UTF-32 files, and the tab that's in the source file while you're at it :)

It's also unclear to me why the some of the tests are .xml and others are .html.  Is there some reason for this?
------- Comment #18 From 2007-07-05 23:58:09 PST -------
What do you mean by tabs in 4 UTF-32 files? They're NOT ASCII-compatible and they should be treated as binary. 

As for xml vs html, WebKit currently use different encoding sniffing rules for html and xml, which is why I have separate test cases for html and xml. The same is true of UTF-16 test cases. 

I'll get rid of tabs in the source code and upload a new patch. 
------- Comment #19 From 2007-07-06 00:10:45 PST -------
Ahah. If it's required that even html/xml files for test are tab-free (in whatever encoding), I'll get rid of them in 4 UTF-32 files. 


------- Comment #20 From 2007-07-06 01:20:12 PST -------
Created an attachment (id=15413) [details]
patch without tabs

tabs are removed in the source code as well as in UTF-32 files. 
------- Comment #21 From 2007-07-06 02:44:14 PST -------
(From update of attachment 15413 [details])
Thanks for removing the tabs.
------- Comment #22 From 2007-07-06 02:59:54 PST -------
My question about HTML vs XML was that you have only four test cases -- UTF-32BE with and without BOM, UTF-32LE with and without BOM.  The BOM-less cases are both XML, the ones with BOM are both HTML.  What about BOM-less HTML?  Or XML with a BOM?
------- Comment #23 From 2007-07-06 03:02:18 PST -------
Landed in r24052.