Summary: | Add UTF-32 support for html/xml documents | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Jungshik Shin <jshin> | ||||||||||||||
Component: | Page Loading | Assignee: | Nobody <webkit-unassigned> | ||||||||||||||
Status: | RESOLVED FIXED | ||||||||||||||||
Severity: | Minor | CC: | ap, mrowe | ||||||||||||||
Priority: | P3 | ||||||||||||||||
Version: | 523.x (Safari 3) | ||||||||||||||||
Hardware: | All | ||||||||||||||||
OS: | All | ||||||||||||||||
URL: | http://jshin.net/i18n/utftest | ||||||||||||||||
Attachments: |
|
Description
Jungshik Shin
2007-04-19 21:17:01 PDT
Created attachment 14103 [details]
a preliminary patch
Patch looks good to me. test cases. Not all test cases work because some of them rely on a 'psuedo-BOM for HTML', which is not supported by WebKit, perhaps intentionally (HTML standard does not have such a provision while XML has that in the appendix). Test cases for 'pseudo-BOM for XML' (supported by WebKit) need to be added. Comment on attachment 14103 [details] a preliminary patch Per Comment #2, the basic patch is good, but it needs a ChangeLog entry and test cases. Please include layout test cases (with the patch) when you resumit the patch. More information on creating a ChangeLog entry and a patch is here: http://webkit.org/coding/contributing.html Scripts are in the WebKitTools/Scripts directory. Thanks! David, according to the document you mentioned, I'm not supposed to use fonts that don't come with Mac OS X. Is there any Mac OS X standard font supporting non-BMP characters? Because WebKit renders 'U+0000' with zero-width glyph, it's impossible to tell whether UTF-32LE is interpreted as UTF-32LE or as UTF-16LE unless I include some non-BMP characters. A Japanese font (to support JIS X 213 characters) may cover some CJK ideographs in plane 2. I have to check that out. Alternatively, I might be able to use a Javascript to tell which is the case. (In reply to comment #5) > Alternatively, I might be able to use a Javascript to tell which is the case. That looks like the easiest alternative - document.inputEncoding will contain the detected encoding name. Created attachment 14218 [details]
patch with test cases and changelog entries
Added test cases for UTF-32BE (html with BOM and xml without BOM) and UTF-32LE (html with BOM and xml without BOM). I can't make a xml test case for UTF-32LE that prints out document.inputEncoding. I tried document.write() and a few DOM-based methods to add a node, "<p>{the value of document.inputEncoding}</p>", but I couldn't make it work in Safari.
For UTF-32BE, just comparing the output (without document.inputEncoding) is sufficient because it cannot be mistaken for UTF-16BE or other encodings. For UTF-32LE, we'd better print out the value of document.inputEncoding.
To dynamically create a <p> element, be sure to use createElementNS with XHTML namespace (unlike Firefox, WebKit does not try to guess what namespace to use with createElement; we have a bug tracking this). Alternatively, the following should work: --------------------- <div id="result">FAILURE: script did't run</div> <script>document.getElementById("result").firstChild.nodeValue = document.inputEncoding;</script> --------------------- Comment on attachment 14218 [details]
patch with test cases and changelog entries
r- for now to consider Alexey's comment, please flag for review with new patch.
Created attachment 14220 [details] a new patch with updated test cases Modified test cases per ap's suggestion. Both XML and HTML print out the encoding determined. > --------------------- > <div id="result">FAILURE: script did't run</div> > <script>document.getElementById("result").firstChild.nodeValue = > document.inputEncoding;</script> > --------------------- Actually, I tried the above as well. However, I forgot that I had to make it run after the loading is complete (the script was put in the <head> without an onload triggering.). As a result, it didn't run even in Firefox. Thanks to your suggestion, I realized that :-) Alexey (or anyone else), can you review the latest patch? Comment on attachment 14220 [details]
a new patch with updated test cases
r- for now, but the patch looks really good to me!
+ unsigned char c3 = buf1Len ? (--buf1Len, *buf1++) : buf2Len ? (--buf2Len, *buf2++) : 0;
+ unsigned char c4 = buf2Len ? (--buf2Len, *buf2++) : 0;
...
+ if (c1 == 0xFF && c2 == 0xFE) {
+ if (c3 != 0 || c4 != 0)
+ encodingConsideringBOM = &UTF16LittleEndianEncoding();
+ else
+ encodingConsideringBOM = &UTF32LittleEndianEncoding();
+ }
You need to check why c3 and c4 are zero - if there are only two bytes available yet, it may be too early to assume UTF-32.
else if (numBufferedBytes + length <= sizeof(m_bufferedBytes) && !flush) {
// Continue to look for the BOM.
I believe m_bufferedBytes size should be increased to three bytes, now that we deal with 4-byte BOMs.
Created attachment 15303 [details]
patch updated
patch updated per comments.
Comment on attachment 15303 [details]
patch updated
It seems that the change to html4.css does not belong to this patch.
+ setEncoding(((ptr - m_buffer.data()) % 4) ? "UTF-32LE" : "UTF-32BE", AutoDetectedEncoding);
Since you've replaced encoding names with function calls in several places, I suggest using those here, as well.
+ //else if (numBufferedBytes >= 3 || length >= 3)
We really do not like commented-out code in sources, please remove this line.
Coming to more important issues: I don't think I understand the algorithm in TextDecoder::checkForBOM(). Why did you change m_bufferedBytes size to 4 bytes? If we have already seen 4 bytes of the input, then we are done with BOM detection, so only 3 bytes need be buffered. And the "encodingConsideringBOM != &m_encoding" check looks wrong, comparing TextEncoding by address is not a good way to detect that we have successfully found a BOM.
This is extremely close to an r+, but I think that another iteration is needed to iron out these issues.
Created attachment 15320 [details]
patch updated once more
thanks for the review. pls, take another look
Comment on attachment 15320 [details]
patch updated once more
Looks great, r=me!
Attention committer: there is a tab character in the patch.
Comment on attachment 15320 [details]
patch updated once more
I went to land this patch but the tabs in the various UTF-32 files made it rather tricky. After futzing around for a few minutes trying to find an editor that'd handle all four UTF-32 variations, I gave up. Can you please fix the four UTF-32 files, and the tab that's in the source file while you're at it :)
It's also unclear to me why the some of the tests are .xml and others are .html. Is there some reason for this?
What do you mean by tabs in 4 UTF-32 files? They're NOT ASCII-compatible and they should be treated as binary. As for xml vs html, WebKit currently use different encoding sniffing rules for html and xml, which is why I have separate test cases for html and xml. The same is true of UTF-16 test cases. I'll get rid of tabs in the source code and upload a new patch. Ahah. If it's required that even html/xml files for test are tab-free (in whatever encoding), I'll get rid of them in 4 UTF-32 files. Created attachment 15413 [details]
patch without tabs
tabs are removed in the source code as well as in UTF-32 files.
Comment on attachment 15413 [details]
patch without tabs
Thanks for removing the tabs.
My question about HTML vs XML was that you have only four test cases -- UTF-32BE with and without BOM, UTF-32LE with and without BOM. The BOM-less cases are both XML, the ones with BOM are both HTML. What about BOM-less HTML? Or XML with a BOM? Landed in r24052. |