With a full ICU data library (all checked in http://apps.icu-project.org/datacustom/, 22MB) there isn't the issue. With a light ICU data library (all unchecked in http://apps.icu-project.org/datacustom/, 327KB) if you browse an HTML page with a specific encoding (e.g Japanese web site) there is a segmentation fault: #0 0x0000000000000000 in ?? () #1 0x00007ffff5c78f7e in WebCore::newTextCodec(WebCore::TextEncoding const&) () from libQt5WebKit.so.5 #2 0x00007ffff5b3f75e in WebCore::TextResourceDecoder::decode(char const*, unsigned long) () from libQt5WebKit.so.5 #3 0x00007ffff5ad5239 in WebCore::CachedCSSStyleSheet::data(WTF::PassRefPtr<WebCore::ResourceBuffer>, bool) () from libQt5WebKit.so.5 #4 0x00007ffff5b3cfef in WebCore::SubresourceLoader::didFinishLoading(double) () from libQt5WebKit.so.5 #5 0x00007ffff5e9e6b2 in WebCore::QNetworkReplyHandler::finish() () from libQt5WebKit.so.5 #6 0x00007ffff5e9cd6f in WebCore::QNetworkReplyHandlerCallQueue::flush() () from libQt5WebKit.so.5 The problem is into the newTextCodec function from Source/WebCore/platform/text/TextEncodingRegistry.cpp. The textCodecMap hash map doesn't contain a factory for the Shift_JIS encoding thus factory.function is null.
Created attachment 206139 [details] Segmentation fault occurred when ICU data library doesn't embed the expected encoding. The problem is into the newTextCodec function from Source/WebCore/platform/text/TextEncodingRegistry.cpp. The textCodecMap hash map doesn't contain a factory for the Shift_JIS encoding thus factory.function is null. In this case, the first factory in the hash map is used. It's not a perfect solution, it's just to avoid the segmentation fault.
Created attachment 206141 [details] Use latin1 by default
Comment on attachment 206141 [details] Use latin1 by default LGTM
Comment on attachment 206141 [details] Use latin1 by default Clearing flags on attachment: 206141 Committed r152416: <http://trac.webkit.org/changeset/152416>
All reviewed patches have been landed. Closing bug.
I think that this was the wrong fix. If your underlying library doesn't support a certain encoding, it simply shouldn't be in the map. What you did was make the code less stringent and more confusing for everyone.
(In reply to comment #6) > I think that this was the wrong fix. If your underlying library doesn't support a certain encoding, it simply shouldn't be in the map. > What map? The encodings come from the encoding detection code. The problem here is that ICU has been build without some encodings that WebKit assumes are there, and crashes on if they are not.
> What map? textCodecMap, obviously. Are there any other maps involved in this patch? > The encodings come from the encoding detection code. The problem here is that ICU has been build without some encodings that WebKit assumes are there, and crashes on if they are not. This is the root cause of the crash. WebKit shouldn't assume that, and it certainly shouldn't silently use latin-1 when thinking that it uses a different encoding. Doing this is extremely sloppy, and could even cause security vulnerabilities.
Re-opened since this is blocked by bug 118476
One thing that may be non-obvious here is that TextEncoding::name() is not just some arbitrary string, it's always an encoding name previously registered by text encoding machinery. Refusing to provide a TextCodecFactory function while registering a codec is where the bad behavior is.
(In reply to comment #10) > One thing that may be non-obvious here is that TextEncoding::name() is not just some arbitrary string, it's always an encoding name previously registered by text encoding machinery. Refusing to provide a TextCodecFactory function while registering a codec is where the bad behavior is. In this case the problem is due to http://trac.webkit.org/changeset/64820. In WebCore/platform/text/TextCodecICU.cpp into TextCodecICU::registerEncodingNames function, you add an alias for shift-jis (and others) even if ICU don't provide this encoding. How can we fix that? I see some dirty solution with 'if' but I'm wondering if the previous "fix" was not better for performance purpose. What do you think about that?
Something like #if USE(REDUCED_ICU) would be fine. Or alternatively, we can check for codec existence when adding to the table, as opposed to when getting from it. However, I'm surprised that it's useful for anyone to ship without Shift_JIS, this is a very common encoding.
(In reply to comment #12) > However, I'm surprised that it's useful for anyone to ship without Shift_JIS, this is a very common encoding. Embedded systems with very limited flash: we can't afford a 20MB ICU library (which is larger that the webkit library itself).
Created attachment 206390 [details] Add aliases only for known encodings. If you don't want use latin1 by default thus we must add aliases only for known encodings. In WebCore/platform/text/TextCodecICU.cpp into registerEncodingNames function I added this check.
Comment on attachment 206390 [details] Add aliases only for known encodings. View in context: https://bugs.webkit.org/attachment.cgi?id=206390&action=review > Source/WebCore/ChangeLog:8 > + No new tests (OOPS!). The patch cannot be landed with this line, a pre-commit hook will block it. > Source/WebCore/platform/text/TextCodecICU.cpp:-72 > - registrar("ISO-8859-8-I", "ISO-8859-8-I"); I don't see how moving this line down can work. addToTextEncodingNameMap() gets an atomic name for encoding name, so if ISO-8859-8-I is registered after enumerating ICU aliases, it will just map to ISO-8859-8. That is exactly what we are trying to avoid. > Source/WebCore/platform/text/TextCodecICU.cpp:69 > + WTF::HashMap<const char*, char>knownEncodings; There shouldn't be a WTF prefix when using WTF names. Also, a missing space before knownEncodings. Didn't you mean to use HashSet? > Source/WebCore/platform/text/TextCodecICU.cpp:138 > + if (knownEncodings.contains("macintosh")) { This is not the right check. Since registrar uses an atomic name for standard name, it was fine to call it with any alias as second argument. But this check will fail.
(In reply to comment #15) > (From update of attachment 206390 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=206390&action=review > > > Source/WebCore/ChangeLog:8 > > + No new tests (OOPS!). > > The patch cannot be landed with this line, a pre-commit hook will block it. Yes, I saw I forgot this line just after I have uploaded the patch... > > Source/WebCore/platform/text/TextCodecICU.cpp:-72 > > - registrar("ISO-8859-8-I", "ISO-8859-8-I"); > > I don't see how moving this line down can work. addToTextEncodingNameMap() gets an atomic name for encoding name, so if ISO-8859-8-I is registered after enumerating ICU aliases, it will just map to ISO-8859-8. That is exactly what we are trying to avoid. I didn't understand the existing comment and I can't match your explanation with the current implementation. e.g If we add first 1) ISO-8859-8: registrar("ISO-8859-8", "ISO-8859-8") -> addToTextEncodingNameMap a) isUndesiredAlias -> no b) textEncodingNameMap->get("ISO-8859-8") -> 0 -> atomicName = "ISO-8859-8" c) textEncodingNameMap->add("ISO-8859-8", "ISO-8859-8") 2) ISO-8859-8-I: registrar("ISO-8859-8-I", "ISO-8859-8-I") -> addToTextEncodingNameMap a) isUndesiredAlias -> no b) textEncodingNameMap->get("ISO-8859-8-I") -> 0 -> atomicName = "ISO-8859-8-I" c) textEncodingNameMap->add("ISO-8859-8-I", "ISO-8859-8-I") Thus after these 2 iterations my understanding is that the hash map should contain both encodings. Thanks in advance for your explanations. > > > Source/WebCore/platform/text/TextCodecICU.cpp:69 > > + WTF::HashMap<const char*, char>knownEncodings; > > There shouldn't be a WTF prefix when using WTF names. Also, a missing space before knownEncodings. > > Didn't you mean to use HashSet? > Oops, yes I used HashSet but It seems I had a problem during my merge. :) > > Source/WebCore/platform/text/TextCodecICU.cpp:138 > > + if (knownEncodings.contains("macintosh")) { > > This is not the right check. Since registrar uses an atomic name for standard name, it was fine to call it with any alias as second argument. But this check will fail. I think with the previous point explanation it will be clearer for this one also.
> (In reply to comment #15) > 1) ISO-8859-8: > registrar("ISO-8859-8", "ISO-8859-8") -> addToTextEncodingNameMap > a) isUndesiredAlias -> no > b) textEncodingNameMap->get("ISO-8859-8") -> 0 -> atomicName = "ISO-8859-8" > c) textEncodingNameMap->add("ISO-8859-8", "ISO-8859-8") After this step, the code iterates and registers all ICU aliases for ISO-8859-8, which includes ISO-8859-8-I (see <http://demo.icu-project.org/icu-bin/convexp?s=IANA&s=MIME>). > 2) ISO-8859-8-I: > registrar("ISO-8859-8-I", "ISO-8859-8-I") -> addToTextEncodingNameMap > a) isUndesiredAlias -> no > b) textEncodingNameMap->get("ISO-8859-8-I") -> 0 -> atomicName = "ISO-8859-8-I" … so textEncodingNameMap->get("ISO-8859-8-I") will return "ISO-8859-8". > c) textEncodingNameMap->add("ISO-8859-8-I", "ISO-8859-8-I")
Ultimately, we shouldn't fight with ICU encoding tables, but hardcode our own (see also bug 118505 comment 4).
(In reply to comment #18) > Ultimately, we shouldn't fight with ICU encoding tables, but hardcode our own (see also bug 118505 comment 4). Thanks for explanation. Yes I also think it would be better. How was the table built before ICU? Nevertheless, isn't possible to keep the first patch to avoid the segmentation fault issue during the new table development?
I now think that an ifdef would be the best short-term solution (possibly even on a branch if you have one). Making cross-platform code less robust would be unfortunate.
(In reply to comment #20) > I now think that an ifdef would be the best short-term solution (possibly even on a branch if you have one). Making cross-platform code less robust would be unfortunate. An ifdef around this workaround (default latin1 if there isn't any function attached to the factory)?
I would much prefer an ifdef around adding unsupported encodings.
(In reply to comment #22) > I would much prefer an ifdef around adding unsupported encodings. In my opinion, I think the "default latin1" workaround (around with a #ifdef ICU_LIGHT) seems to be the most flexible way to use a custom ICU data library rather than to have #ifdef JAPANESE_SUPPORTED #ifdef CHINESE_SUPPORTED,... Furthermore, the final text rendering is the same. :)