10728 – text encodings should work without a numeric ID

RESOLVED FIXED 10728

text encodings should work without a numeric ID

https://bugs.webkit.org/show_bug.cgi?id=10728

Summary text encodings should work without a numeric ID

Darin Adler

Reported 2006-09-04 09:25:45 PDT

The text encoding machinery needs to be changed so it doesn't use a numeric ID. That was fine when text encoding was done with the Mac OS X Carbon Text Encoding Converter, but other encoding systems, like the one in ICU, don't involve any numeric IDs. One of the main benefits of this change is that it will fix platforms other than Mac OS X to use ICU properly for decoding.

Attachments
patch, with all the trimmings (315.34 KB, patch) 2006-09-04 16:09 PDT, Darin Adler	ap: review-	Details Formatted Diff Diff
new patch, improved and addresses Alexey's comments (319.83 KB, patch) 2006-09-05 09:29 PDT, Darin Adler	ap: review+	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Darin Adler

Comment 1 2006-09-04 16:09:03 PDT

Created attachment 10396 [details] patch, with all the trimmings

Alexey Proskuryakov

Comment 2 2006-09-05 07:33:48 PDT

Comment on attachment 10396 [details] patch, with all the trimmings I have a few comments of a general kind, and I also think I spotted a couple of bugs. + (WebCore::Decoder::Decoder): Changed to use the functions above. Also renamed + m_reachedBody to m_checkedForHeadCharset. I think that this name is slightly misleading, because it also accounts for XML declaration, and non-default encoding being already set. The old name is no longer correct either, since the logic has changed. + Copyright (C) 2006 Alexey Proskuryakov (ap@nypop.com) Thank you! + static void registerEncodingNames(EncodingNameRegistrar); + static void registerCodecs(TextCodecRegistrar); It looks like the registry is mostly useful to non-ICU platforms (with ICU, it just makes codec creation slower, as it has to be looked up twice). But can other codecs even provide such kind of information? Looking at Qt documentation, QTextCodec provides customizeable heuristical approach, which doesn't really map into a registry. - DeprecatedString result = decoder->decode(static_cast<const char *>([data bytes]), [data length]); + String result = decoder->decode(static_cast<const char *>([data bytes]), [data length]); A pre-existing problem with spacing in "const char *"; there are more instances ofworng "const char *" spacing the patch. - , m_decoder(new Decoder("text/css")) + , m_decoder(new Decoder("text/css", charset)) Oops, my bad! This bugfix needs a ChangeLog entry and a test case. + // FIXME: TextDecoder already checks for the BOM; do we really need this?. Yes, we need to report a correct charset in document.characterSet and related properties, and to use the main resource charset as a default for subresources. I am not sure if this is already covered by test cases, it obviously needs to be. - if (d->m_bFirstData) { - // determine the parse mode - d->m_doc->determineParseMode(decoded); - d->m_bFirstData = false; + d->m_bFirstData = false; + d->m_doc->determineParseMode(decoded); Is this just cleanup? I'm wondering because you haven't changed the same pattern in the next function. - static const TextEncoding utf8Encoding(UTF8Encoding); + static const TextEncoding utf8Encoding(UTF8Encoding()); Is this local static const even needed anymore? + registrar("ISO-8859-8-I", "ISO-8859-8-I"); + registrar("csISO88598I", "ISO_8859-8-I"); The second registration looks like it has a typo - the alias is bound to an unknown encoding. - virtual DeprecatedCString fromUnicode(const DeprecatedString&, bool allowEntities = false); + virtual CString encode(const UChar*, size_t length, bool allowEntities = false) const; encode() relies on, and potentially changes the state of UConverter - seeing it const is a bit strange to me. + String result = String::newUninitialized(len / 2, buffer); If len is odd, and there's a byte buffered from the previous chunk, the result may not fit. + if (m_littleEndian) + for (size_t i = 0; i < len; i += 2) { + UChar c = p[0] | (p[1] << 8); + p += 2; + if (c != BOM) + *q++ = c; + } p[1] can be out of bounds if len is odd (same in big endian case). + // FIXME: CString is not a reasonable data structure for encoded UTF-16, which will have + // null characters inside it. Perhaps the result of encode should not be a CString? Formally, it should be Vector<unsigned char>, as a char isn't even guaranteed to be able to hold all bit patterns AFAIK. Seriously, a Vector seems more appropriate to me, indeed. + OwnPtr<TextCodec> m_decoder; I stumbled reading code that used this variable for a moment, perhaps if should be named m_codec? + // FIXME: Should normalization really be an automatic part of encoding? + // Normalization should probably be an explicit step separate from encoding. As far as I understand <http://www.w3.org/TR/charmod-norm/>, normalization should be an automatic part of every string operation, even append(). That's hardly practical, but at least all text that is sent out from a browser needs to be normalized. UChar TextEncoding::backslashAsCurrencySymbol() const { - if (m_flags & BackslashIsYen) - return 0x00A5; // yen sign - - return '\\'; + static const char* const a = atomicCanonicalTextEncodingName("Shift_JIS_X0213-2000"); + static const char* const b = atomicCanonicalTextEncodingName("EUC-JP"); + return (m_name == a || m_name == b) ? 0x00A5 : '\\'; +} It's quite weird that this function doesn't include the normal Shift_JIS encoding - Shift_JIS_X0213-2000 isn't even recognized by other browsers. Nothing to do with this patch, of course, but I find it hard to believe that the whole backslashAsCurrencySymbol concept works as expected. Korean encodings have similar issues with backslash AFAIK. Index: WebCore/platform/UChar.h Shouldn't the use of ICU here be ifdef'ed? +#include <Carbon/Carbon.h> I think it's enough to include CoreServices for TEC. - m_response = DeprecatedString(); + m_response = String(); This will need to be merged with a fix that I landed after you submitted this patch, sorry (m_response is now initialized to an empty string). I think I should r- this patch because of TextCodecUTF16 buffer overruns, although it brings an huge amount of improvements otherwise.

Darin Adler

Comment 3 2006-09-05 09:10:43 PDT

(In reply to comment #2) > + renamed m_reachedBody to m_checkedForHeadCharset. > > I think that this name is slightly misleading, because it also accounts for XML > declaration, and non-default encoding being already set. The old name is no > longer correct either, since the logic has changed. I believe I did not change the logic, but I agree and will seek another name. Any suggestions? I'll land with this name and rename if we come up with a better one. > + static void registerEncodingNames(EncodingNameRegistrar); > + static void registerCodecs(TextCodecRegistrar); > > It looks like the registry is mostly useful to non-ICU platforms (with ICU, it > just makes codec creation slower, as it has to be looked up twice). But can > other codecs even provide such kind of information? Looking at Qt > documentation, QTextCodec provides customizeable heuristical approach, which > doesn't really map into a registry. I won't try to defend the registry design in this note. I'm not absolutely certain it is best. We can keep refining and redo this if needed. What the registry assumes is that we can get a complete list of all the encodings we want to handle and their aliases. If Qt doesn't provide that, we'll need a table like the one we have on the Mac. I believe the Mac OS X TEC system also doesn't provide that list of encoding names and aliases (or if it does, we've never used it). Maybe we can discuss other ways to handle this. I hope that the API design can stand and this won't affect clients of the encoding classes too much. Removing the registry and replacing it with something else should not be too complicated if we decide to change it. > A pre-existing problem with spacing in "const char *"; there are more instances > of wrong "const char *" spacing the patch. OK, I think I got all of those now. > - , m_decoder(new Decoder("text/css")) > + , m_decoder(new Decoder("text/css", charset)) > > Oops, my bad! This bugfix needs a ChangeLog entry and a test case. Missed that when writing the ChangeLog. ChangeLog is fixed now. I'll see if I can quickly come up with a test case. > + // FIXME: TextDecoder already checks for the BOM; do we really need this?. > > Yes, we need to report a correct charset in document.characterSet and related > properties, and to use the main resource charset as a default for subresources. > I am not sure if this is already covered by test cases, it obviously needs to > be. OK, I'll leave that FIXME out. > - if (d->m_bFirstData) { > - // determine the parse mode > - d->m_doc->determineParseMode(decoded); > - d->m_bFirstData = false; > + d->m_bFirstData = false; > + d->m_doc->determineParseMode(decoded); > > Is this just cleanup? I'm wondering because you haven't changed the same > pattern in the next function. Yes, I changed that because I thought it was strange where the boolean was cleared, since more work was done after determineParseMode. It didn't make sense to me to do that one bit of work, clear the boolean, and then do other work. For clarity I changed the same pattern in the next function. > - static const TextEncoding utf8Encoding(UTF8Encoding); > + static const TextEncoding utf8Encoding(UTF8Encoding()); > > Is this local static const even needed anymore? You're right. I got rid of it. I was a bit "hands off" on KURL for now, but this was fine to fix. > + registrar("ISO-8859-8-I", "ISO-8859-8-I"); > + registrar("csISO88598I", "ISO_8859-8-I"); > > The second registration looks like it has a typo - the alias is bound to an > unknown encoding. It is a typo as you say, but a harmless one. The alias system makes each name canonical, and so this works fine. > - virtual DeprecatedCString fromUnicode(const DeprecatedString&, bool > allowEntities = false); > + virtual CString encode(const UChar*, size_t length, bool allowEntities > = false) const; > > encode() relies on, and potentially changes the state of UConverter - seeing it > const is a bit strange to me. Sure, that makes sense. I'll remove the const. We don't support stateful encoding, and I was trying to reflect that with the const. But it's not entirely correct and probably not helpful. > + String result = String::newUninitialized(len / 2, buffer); > > If len is odd, and there's a byte buffered from the previous chunk, the result > may not fit. Oh, that's right, good catch! Fixed. > + if (m_littleEndian) > + for (size_t i = 0; i < len; i += 2) { > + UChar c = p[0] | (p[1] << 8); > + p += 2; > + if (c != BOM) > + *q++ = c; > + } > > p[1] can be out of bounds if len is odd (same in big endian case). Oh, right, I meant to stop one character earlier! Wow, those are major mistakes. Fixed. > + // FIXME: CString is not a reasonable data structure for encoded UTF-16, > which will have > + // null characters inside it. Perhaps the result of encode should not be a > CString? > > Formally, it should be Vector<unsigned char>, as a char isn't even guaranteed > to be able to hold all bit patterns AFAIK. > > Seriously, a Vector seems more appropriate to me, indeed. We can change this, but I won't change it before checking in. I believe currently use encode only when creating form data for submission. Here are some observations: 1) CString can be used for runs of bytes that have 0 bytes in them, as long as we avoid operations that assume the existence of a null byte. 2) Vector is not suitable as a return type, since copying a vector is a deep copy, so if we want to use it as a return value we might need something more like PassRefPtr< RefCountedVector< Vector<char> > >. We could take a Vector as a parameter and modify it. 3) What you say about char not being guaranteed to hold all the bit patterns is not something I've heard before. It may be technically true, but it's not a practical issue. In practice, char and unsigned char can hold any 8-bit value. We generally use char for compatibility with things like the <string.h> functions and I don't think we need to change that, but I could be convinced otherwise. Using a mix of Vector<char> and Vector<unsigned char> could cause unwanted template bloat. 4) Since the one client of encode wants to use it more-or-less as a CString, it's probably OK to leave it alone for now. I'm not sure that form submission code works well for UTF-16 pages and this should probably be revisited. > + OwnPtr<TextCodec> m_decoder; > > I stumbled reading code that used this variable for a moment, perhaps if should > be named m_codec? Good idea. That name comes from before I renamed StreamingTextDecoder to TextCodec. Fixed. > + // FIXME: Should normalization really be an automatic part of encoding? > + // Normalization should probably be an explicit step separate from > encoding. > > As far as I understand <http://www.w3.org/TR/charmod-norm/>, normalization > should be an automatic part of every string operation, even append(). That's > hardly practical, but at least all text that is sent out from a browser needs > to be normalized. I agree that the text must be normalized. This is simply an architecture question; how and where to do it, not what to do. I'm going to tweak the comment slightly. > UChar TextEncoding::backslashAsCurrencySymbol() const > { > - if (m_flags & BackslashIsYen) > - return 0x00A5; // yen sign > - > - return '\\'; > + static const char* const a = > atomicCanonicalTextEncodingName("Shift_JIS_X0213-2000"); > + static const char* const b = atomicCanonicalTextEncodingName("EUC-JP"); > + return (m_name == a || m_name == b) ? 0x00A5 : '\\'; > +} > > It's quite weird that this function doesn't include the normal Shift_JIS > encoding - Shift_JIS_X0213-2000 isn't even recognized by other browsers. > Nothing to do with this patch, of course, but I find it hard to believe that > the whole backslashAsCurrencySymbol concept works as expected. I agree we should test that this works properly. I believe that Shift_JIS and Shift_JIS_X0213-2000 are both aliases for the same encoding. > Korean encodings have similar issues with backslash AFAIK. I've heard that too. At some point we may want to address that, which is why I used the term "currency symbol" instead of something like "yen". > Index: WebCore/platform/UChar.h > > Shouldn't the use of ICU here be ifdef'ed? Yes, eventually this header should be independent of ICU. I think that a future version of this header should define UChar without using ICU at all. I'm thinking it might include a <wtf/UChar.h> which might have a WTF::UChar definition. > +#include <Carbon/Carbon.h> > > I think it's enough to include CoreServices for TEC. OK. Fixed. > - m_response = DeprecatedString(); > + m_response = String(); > > This will need to be merged with a fix that I landed after you submitted this > patch, sorry (m_response is now initialized to an empty string). Will do. I am sad that we have both null strings and empty strings. I like to use null strings as little as possible, except that they are nearly unavoidable since they are the default value for a new uninitialized String variable!

Darin Adler

Comment 4 2006-09-05 09:29:07 PDT

Created attachment 10400 [details] new patch, improved and addresses Alexey's comments

Alexey Proskuryakov

Comment 5 2006-09-05 09:50:16 PDT

Comment on attachment 10400 [details] new patch, improved and addresses Alexey's comments r=me + , m_decoder(new Decoder("text/css" , charset)) This line has now got a space before comma. + if (!length) + return String(); Oops, I overlooked that TextCodecUTF16::decode() ignores its flush parameter. Could it perhaps be more consistent to output U+FFFD if there's an odd byte? Though it's hard to imagine any practical use for one (spotting errors?)

Darin Adler

Comment 6 2006-09-05 10:10:50 PDT

(In reply to comment #5) > (From update of attachment 10400 [details] [edit]) > + , m_decoder(new Decoder("text/css" , charset)) > > This line has now got a space before comma. Heh, that's left over from when I commented it out to make sure the new layout test failed without the change! I'll fix before landing. > Oops, I overlooked that TextCodecUTF16::decode() ignores its flush parameter. > Could it perhaps be more consistent to output U+FFFD if there's an odd byte? > Though it's hard to imagine any practical use for one (spotting errors?) Good point -- I'll leave that as-is for now.

Darin Adler

Comment 7 2006-09-05 21:35:18 PDT

Committed revision 16245.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 420+

Hardware Mac

OS OS X 10.4

Product WebKit

Component WebCore Misc.

Assignee

Darin Adler

Reported

2006-09-04 09:25 PDT

Modified

2006-09-05 21:35 PDT History

CC List

1 user Show

URL

Keywords

Depends on

Blocks