WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
13415
Add UTF-32 support for html/xml documents
https://bugs.webkit.org/show_bug.cgi?id=13415
Summary
Add UTF-32 support for html/xml documents
Jungshik Shin
Reported
2007-04-19 21:17:01 PDT
bug 10709
is about parsing CSS stylesheets in UTF-16/32. For html/xml documents, UTF-16 is detected and supported, but UTF-32 (BE and LE) is not.
Attachments
a preliminary patch
(6.32 KB, patch)
2007-04-19 21:38 PDT
,
Jungshik Shin
ddkilzer
: review-
Details
Formatted Diff
Diff
patch with test cases and changelog entries
(23.10 KB, patch)
2007-04-26 17:37 PDT
,
Jungshik Shin
mjs
: review-
Details
Formatted Diff
Diff
a new patch with updated test cases
(30.33 KB, patch)
2007-04-27 00:03 PDT
,
Jungshik Shin
ap
: review-
Details
Formatted Diff
Diff
patch updated
(31.52 KB, patch)
2007-06-28 20:05 PDT
,
Jungshik Shin
ap
: review-
Details
Formatted Diff
Diff
patch updated once more
(31.23 KB, patch)
2007-06-29 15:54 PDT
,
Jungshik Shin
mrowe
: review-
Details
Formatted Diff
Diff
patch without tabs
(31.00 KB, patch)
2007-07-06 01:20 PDT
,
Jungshik Shin
mjs
: review+
Details
Formatted Diff
Diff
Show Obsolete
(5)
View All
Add attachment
proposed patch, testcase, etc.
Jungshik Shin
Comment 1
2007-04-19 21:38:05 PDT
Created
attachment 14103
[details]
a preliminary patch
Dave Hyatt
Comment 2
2007-04-19 22:01:59 PDT
Patch looks good to me.
Jungshik Shin
Comment 3
2007-04-20 05:40:27 PDT
test cases. Not all test cases work because some of them rely on a 'psuedo-BOM for HTML', which is not supported by WebKit, perhaps intentionally (HTML standard does not have such a provision while XML has that in the appendix). Test cases for 'pseudo-BOM for XML' (supported by WebKit) need to be added.
David Kilzer (:ddkilzer)
Comment 4
2007-04-20 12:26:33 PDT
Comment on
attachment 14103
[details]
a preliminary patch Per
Comment #2
, the basic patch is good, but it needs a ChangeLog entry and test cases. Please include layout test cases (with the patch) when you resumit the patch. More information on creating a ChangeLog entry and a patch is here:
http://webkit.org/coding/contributing.html
Scripts are in the WebKitTools/Scripts directory. Thanks!
Jungshik Shin
Comment 5
2007-04-20 19:48:30 PDT
David, according to the document you mentioned, I'm not supposed to use fonts that don't come with Mac OS X. Is there any Mac OS X standard font supporting non-BMP characters? Because WebKit renders 'U+0000' with zero-width glyph, it's impossible to tell whether UTF-32LE is interpreted as UTF-32LE or as UTF-16LE unless I include some non-BMP characters. A Japanese font (to support JIS X 213 characters) may cover some CJK ideographs in plane 2. I have to check that out. Alternatively, I might be able to use a Javascript to tell which is the case.
Alexey Proskuryakov
Comment 6
2007-04-21 00:57:58 PDT
(In reply to
comment #5
)
> Alternatively, I might be able to use a Javascript to tell which is the case.
That looks like the easiest alternative - document.inputEncoding will contain the detected encoding name.
Jungshik Shin
Comment 7
2007-04-26 17:37:43 PDT
Created
attachment 14218
[details]
patch with test cases and changelog entries Added test cases for UTF-32BE (html with BOM and xml without BOM) and UTF-32LE (html with BOM and xml without BOM). I can't make a xml test case for UTF-32LE that prints out document.inputEncoding. I tried document.write() and a few DOM-based methods to add a node, "<p>{the value of document.inputEncoding}</p>", but I couldn't make it work in Safari. For UTF-32BE, just comparing the output (without document.inputEncoding) is sufficient because it cannot be mistaken for UTF-16BE or other encodings. For UTF-32LE, we'd better print out the value of document.inputEncoding.
Alexey Proskuryakov
Comment 8
2007-04-26 21:40:05 PDT
To dynamically create a <p> element, be sure to use createElementNS with XHTML namespace (unlike Firefox, WebKit does not try to guess what namespace to use with createElement; we have a bug tracking this). Alternatively, the following should work: --------------------- <div id="result">FAILURE: script did't run</div> <script>document.getElementById("result").firstChild.nodeValue = document.inputEncoding;</script> ---------------------
Maciej Stachowiak
Comment 9
2007-04-26 22:31:40 PDT
Comment on
attachment 14218
[details]
patch with test cases and changelog entries r- for now to consider Alexey's comment, please flag for review with new patch.
Jungshik Shin
Comment 10
2007-04-27 00:03:34 PDT
Created
attachment 14220
[details]
a new patch with updated test cases Modified test cases per ap's suggestion. Both XML and HTML print out the encoding determined.
> --------------------- > <div id="result">FAILURE: script did't run</div> > <script>document.getElementById("result").firstChild.nodeValue = > document.inputEncoding;</script> > ---------------------
Actually, I tried the above as well. However, I forgot that I had to make it run after the loading is complete (the script was put in the <head> without an onload triggering.). As a result, it didn't run even in Firefox. Thanks to your suggestion, I realized that :-)
Jungshik Shin
Comment 11
2007-04-30 17:08:40 PDT
Alexey (or anyone else), can you review the latest patch?
Alexey Proskuryakov
Comment 12
2007-05-01 01:32:17 PDT
Comment on
attachment 14220
[details]
a new patch with updated test cases r- for now, but the patch looks really good to me! + unsigned char c3 = buf1Len ? (--buf1Len, *buf1++) : buf2Len ? (--buf2Len, *buf2++) : 0; + unsigned char c4 = buf2Len ? (--buf2Len, *buf2++) : 0; ... + if (c1 == 0xFF && c2 == 0xFE) { + if (c3 != 0 || c4 != 0) + encodingConsideringBOM = &UTF16LittleEndianEncoding(); + else + encodingConsideringBOM = &UTF32LittleEndianEncoding(); + } You need to check why c3 and c4 are zero - if there are only two bytes available yet, it may be too early to assume UTF-32. else if (numBufferedBytes + length <= sizeof(m_bufferedBytes) && !flush) { // Continue to look for the BOM. I believe m_bufferedBytes size should be increased to three bytes, now that we deal with 4-byte BOMs.
Jungshik Shin
Comment 13
2007-06-28 20:05:11 PDT
Created
attachment 15303
[details]
patch updated patch updated per comments.
Alexey Proskuryakov
Comment 14
2007-06-29 02:12:59 PDT
Comment on
attachment 15303
[details]
patch updated It seems that the change to html4.css does not belong to this patch. + setEncoding(((ptr - m_buffer.data()) % 4) ? "UTF-32LE" : "UTF-32BE", AutoDetectedEncoding); Since you've replaced encoding names with function calls in several places, I suggest using those here, as well. + //else if (numBufferedBytes >= 3 || length >= 3) We really do not like commented-out code in sources, please remove this line. Coming to more important issues: I don't think I understand the algorithm in TextDecoder::checkForBOM(). Why did you change m_bufferedBytes size to 4 bytes? If we have already seen 4 bytes of the input, then we are done with BOM detection, so only 3 bytes need be buffered. And the "encodingConsideringBOM != &m_encoding" check looks wrong, comparing TextEncoding by address is not a good way to detect that we have successfully found a BOM. This is extremely close to an r+, but I think that another iteration is needed to iron out these issues.
Jungshik Shin
Comment 15
2007-06-29 15:54:12 PDT
Created
attachment 15320
[details]
patch updated once more thanks for the review. pls, take another look
Alexey Proskuryakov
Comment 16
2007-07-02 03:51:19 PDT
Comment on
attachment 15320
[details]
patch updated once more Looks great, r=me! Attention committer: there is a tab character in the patch.
Mark Rowe (bdash)
Comment 17
2007-07-05 07:27:03 PDT
Comment on
attachment 15320
[details]
patch updated once more I went to land this patch but the tabs in the various UTF-32 files made it rather tricky. After futzing around for a few minutes trying to find an editor that'd handle all four UTF-32 variations, I gave up. Can you please fix the four UTF-32 files, and the tab that's in the source file while you're at it :) It's also unclear to me why the some of the tests are .xml and others are .html. Is there some reason for this?
Jungshik Shin
Comment 18
2007-07-05 23:58:09 PDT
What do you mean by tabs in 4 UTF-32 files? They're NOT ASCII-compatible and they should be treated as binary. As for xml vs html, WebKit currently use different encoding sniffing rules for html and xml, which is why I have separate test cases for html and xml. The same is true of UTF-16 test cases. I'll get rid of tabs in the source code and upload a new patch.
Jungshik Shin
Comment 19
2007-07-06 00:10:45 PDT
Ahah. If it's required that even html/xml files for test are tab-free (in whatever encoding), I'll get rid of them in 4 UTF-32 files.
Jungshik Shin
Comment 20
2007-07-06 01:20:12 PDT
Created
attachment 15413
[details]
patch without tabs tabs are removed in the source code as well as in UTF-32 files.
Maciej Stachowiak
Comment 21
2007-07-06 02:44:14 PDT
Comment on
attachment 15413
[details]
patch without tabs Thanks for removing the tabs.
Mark Rowe (bdash)
Comment 22
2007-07-06 02:59:54 PDT
My question about HTML vs XML was that you have only four test cases -- UTF-32BE with and without BOM, UTF-32LE with and without BOM. The BOM-less cases are both XML, the ones with BOM are both HTML. What about BOM-less HTML? Or XML with a BOM?
Mark Rowe (bdash)
Comment 23
2007-07-06 03:02:18 PDT
Landed in
r24052
.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug