13415 – Add UTF-32 support for html/xml documents

Bug 13415 - Add UTF-32 support for html/xml documents

Summary: Add UTF-32 support for html/xml documents

Status:	RESOLVED FIXED

Alias:	None

Product:	WebKit
Classification:	Unclassified
Component:	Page Loading (show other bugs)
Version:	523.x (Safari 3)
Hardware:	All All

Importance:	P3 Minor
Assignee:	Nobody

URL:	http://jshin.net/i18n/utftest
Keywords:

Depends on:
Blocks:

Reported:	2007-04-19 21:17 PDT by Jungshik Shin
Modified:	2007-07-06 03:02 PDT (History)
CC List:	2 users (show)

See Also:

Attachments
a preliminary patch (6.32 KB, patch) 2007-04-19 21:38 PDT, Jungshik Shin	ddkilzer: review-	Details \| Formatted Diff \| Diff
patch with test cases and changelog entries (23.10 KB, patch) 2007-04-26 17:37 PDT, Jungshik Shin	mjs: review-	Details \| Formatted Diff \| Diff
a new patch with updated test cases (30.33 KB, patch) 2007-04-27 00:03 PDT, Jungshik Shin	ap: review-	Details \| Formatted Diff \| Diff
patch updated (31.52 KB, patch) 2007-06-28 20:05 PDT, Jungshik Shin	ap: review-	Details \| Formatted Diff \| Diff
patch updated once more (31.23 KB, patch) 2007-06-29 15:54 PDT, Jungshik Shin	mrowe: review-	Details \| Formatted Diff \| Diff
patch without tabs (31.00 KB, patch) 2007-07-06 01:20 PDT, Jungshik Shin	mjs: review+	Details \| Formatted Diff \| Diff
Show Obsolete (5) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jungshik Shin 2007-04-19 21:17:01 PDT

bug 10709 is about parsing CSS stylesheets in UTF-16/32. For html/xml documents, UTF-16 is detected and supported, but UTF-32 (BE and LE) is not.

Comment 1 Jungshik Shin 2007-04-19 21:38:05 PDT

Created attachment 14103 [details]
a preliminary patch

Comment 2 Dave Hyatt 2007-04-19 22:01:59 PDT

Patch looks good to me.

Comment 3 Jungshik Shin 2007-04-20 05:40:27 PDT

test cases. Not all test cases work because some of them rely on a 'psuedo-BOM for HTML', which is not supported by WebKit, perhaps intentionally (HTML standard does not have such a provision while XML has that in the appendix). Test cases for 'pseudo-BOM for XML' (supported by WebKit) need to be added.

Comment 4 David Kilzer (:ddkilzer) 2007-04-20 12:26:33 PDT

Comment on attachment 14103 [details]
a preliminary patch

Per Comment #2, the basic patch is good, but it needs a ChangeLog entry and test cases.

Please include layout test cases (with the patch) when you resumit the patch.

More information on creating a ChangeLog entry and a patch is here:

http://webkit.org/coding/contributing.html

Scripts are in the WebKitTools/Scripts directory.  Thanks!

Comment 5 Jungshik Shin 2007-04-20 19:48:30 PDT

David, according to the document you mentioned, I'm not supposed to use fonts that don't  come with Mac OS X. Is there any Mac OS X standard font supporting non-BMP characters? Because WebKit renders 'U+0000' with zero-width glyph, it's impossible to tell whether UTF-32LE is interpreted as UTF-32LE or as UTF-16LE unless I include some non-BMP characters.  

A Japanese font (to support JIS X 213 characters) may cover some CJK ideographs in plane 2. I have to check that out. 

Alternatively, I might be able to use a Javascript to tell which is the case.

Comment 6 Alexey Proskuryakov 2007-04-21 00:57:58 PDT

(In reply to comment #5)
> Alternatively, I might be able to use a Javascript to tell which is the case. 

That looks like the easiest alternative - document.inputEncoding will contain the detected encoding name.

Comment 7 Jungshik Shin 2007-04-26 17:37:43 PDT

Created attachment 14218 [details]
patch with test cases and changelog entries

Added test cases for UTF-32BE (html  with BOM and xml without BOM) and UTF-32LE (html with BOM and xml without BOM).  I can't make a xml test case for UTF-32LE that prints out document.inputEncoding.  I tried document.write() and a few DOM-based methods to add a node, "<p>{the value of document.inputEncoding}</p>", but I couldn't make it work in Safari. 

For UTF-32BE, just comparing the output (without document.inputEncoding) is sufficient because it cannot be mistaken for UTF-16BE or other encodings. For UTF-32LE, we'd better print out the value of document.inputEncoding.

Comment 8 Alexey Proskuryakov 2007-04-26 21:40:05 PDT

To dynamically create a <p> element, be sure to use createElementNS with XHTML namespace (unlike Firefox, WebKit does not try to guess what namespace to use with createElement; we have a bug tracking this).

Alternatively, the following should work:

---------------------
<div id="result">FAILURE: script did't run</div>
<script>document.getElementById("result").firstChild.nodeValue = document.inputEncoding;</script>
---------------------

Comment 9 Maciej Stachowiak 2007-04-26 22:31:40 PDT

Comment on attachment 14218 [details]
patch with test cases and changelog entries

r- for now to consider Alexey's comment, please flag for review with new patch.

Comment 10 Jungshik Shin 2007-04-27 00:03:34 PDT

Created attachment 14220 [details]
a new patch with updated test cases

Modified test cases per ap's suggestion. Both XML and HTML print out the encoding determined. 

> ---------------------
> <div id="result">FAILURE: script did't run</div>
> <script>document.getElementById("result").firstChild.nodeValue =
> document.inputEncoding;</script>
> ---------------------

Actually, I tried the above as well. However, I forgot that I had to make it run after the loading is complete (the script was put in the <head> without an onload triggering.). As a result, it didn't run even in Firefox. 
 
Thanks to your suggestion, I realized that :-)

Comment 11 Jungshik Shin 2007-04-30 17:08:40 PDT

Alexey (or anyone else), can you review the latest patch?

Comment 12 Alexey Proskuryakov 2007-05-01 01:32:17 PDT

Comment on attachment 14220 [details]
a new patch with updated test cases

r- for now, but the patch looks really good to me!

+    unsigned char c3 = buf1Len ? (--buf1Len, *buf1++) : buf2Len ? (--buf2Len, *buf2++) : 0;
+    unsigned char c4 = buf2Len ? (--buf2Len, *buf2++) : 0;
...
+    if (c1 == 0xFF && c2 == 0xFE) {
+        if (c3 != 0 || c4 != 0) 
+            encodingConsideringBOM = &UTF16LittleEndianEncoding();
+        else 
+            encodingConsideringBOM = &UTF32LittleEndianEncoding();
+    }

You need to check why c3 and c4 are zero - if there are only two bytes available yet, it may be too early to assume UTF-32.

     else if (numBufferedBytes + length <= sizeof(m_bufferedBytes) && !flush) {
         // Continue to look for the BOM.

I believe m_bufferedBytes size should be increased to three bytes, now that we deal with 4-byte BOMs.

Comment 13 Jungshik Shin 2007-06-28 20:05:11 PDT

Created attachment 15303 [details]
patch updated

patch updated per comments.

Comment 14 Alexey Proskuryakov 2007-06-29 02:12:59 PDT

Comment on attachment 15303 [details]
patch updated

It seems that the change to html4.css does not belong to this patch.

+ setEncoding(((ptr - m_buffer.data()) % 4) ? "UTF-32LE" : "UTF-32BE", AutoDetectedEncoding);

Since you've replaced encoding names with function calls in several places, I suggest using those here, as well.

+ //else if (numBufferedBytes >= 3 || length >= 3) 

We really do not like commented-out code in sources, please remove this line.

Coming to more important issues: I don't think I understand the algorithm in TextDecoder::checkForBOM(). Why did you change m_bufferedBytes size to 4 bytes? If we have already seen 4 bytes of the input, then we are done with BOM detection, so only 3 bytes need be buffered. And the "encodingConsideringBOM != &m_encoding" check looks wrong, comparing TextEncoding by address is not a good way to detect that we have successfully found a BOM.

This is extremely close to an r+, but I think that another iteration is needed to iron out these issues.

Comment 15 Jungshik Shin 2007-06-29 15:54:12 PDT

Created attachment 15320 [details]
patch updated once more

thanks for the review. pls, take another look

Comment 16 Alexey Proskuryakov 2007-07-02 03:51:19 PDT

Comment on attachment 15320 [details]
patch updated once more

Looks great, r=me!

Attention committer: there is a tab character in the patch.

Comment 17 Mark Rowe (bdash) 2007-07-05 07:27:03 PDT

Comment on attachment 15320 [details]
patch updated once more

I went to land this patch but the tabs in the various UTF-32 files made it rather tricky.  After futzing around for a few minutes trying to find an editor that'd handle all four UTF-32 variations, I gave up.  Can you please fix the four UTF-32 files, and the tab that's in the source file while you're at it :)

It's also unclear to me why the some of the tests are .xml and others are .html.  Is there some reason for this?

Comment 18 Jungshik Shin 2007-07-05 23:58:09 PDT

What do you mean by tabs in 4 UTF-32 files? They're NOT ASCII-compatible and they should be treated as binary. 

As for xml vs html, WebKit currently use different encoding sniffing rules for html and xml, which is why I have separate test cases for html and xml. The same is true of UTF-16 test cases. 

I'll get rid of tabs in the source code and upload a new patch.

Comment 19 Jungshik Shin 2007-07-06 00:10:45 PDT

Ahah. If it's required that even html/xml files for test are tab-free (in whatever encoding), I'll get rid of them in 4 UTF-32 files.

Comment 20 Jungshik Shin 2007-07-06 01:20:12 PDT

Created attachment 15413 [details]
patch without tabs

tabs are removed in the source code as well as in UTF-32 files.

Comment 21 Maciej Stachowiak 2007-07-06 02:44:14 PDT

Comment on attachment 15413 [details]
patch without tabs

Thanks for removing the tabs.

Comment 22 Mark Rowe (bdash) 2007-07-06 02:59:54 PDT

My question about HTML vs XML was that you have only four test cases -- UTF-32BE with and without BOM, UTF-32LE with and without BOM.  The BOM-less cases are both XML, the ones with BOM are both HTML.  What about BOM-less HTML?  Or XML with a BOM?

Comment 23 Mark Rowe (bdash) 2007-07-06 03:02:18 PDT

Landed in r24052.