UNCONFIRMED Bug 118410
Segmentation fault occurred when ICU data library doesn't embed the expected encoding
https://bugs.webkit.org/show_bug.cgi?id=118410
Summary Segmentation fault occurred when ICU data library doesn't embed the expected ...
Benjamin Dupont
Reported 2013-07-05 04:45:22 PDT
With a full ICU data library (all checked in http://apps.icu-project.org/datacustom/, 22MB) there isn't the issue. With a light ICU data library (all unchecked in http://apps.icu-project.org/datacustom/, 327KB) if you browse an HTML page with a specific encoding (e.g Japanese web site) there is a segmentation fault: #0 0x0000000000000000 in ?? () #1 0x00007ffff5c78f7e in WebCore::newTextCodec(WebCore::TextEncoding const&) () from libQt5WebKit.so.5 #2 0x00007ffff5b3f75e in WebCore::TextResourceDecoder::decode(char const*, unsigned long) () from libQt5WebKit.so.5 #3 0x00007ffff5ad5239 in WebCore::CachedCSSStyleSheet::data(WTF::PassRefPtr<WebCore::ResourceBuffer>, bool) () from libQt5WebKit.so.5 #4 0x00007ffff5b3cfef in WebCore::SubresourceLoader::didFinishLoading(double) () from libQt5WebKit.so.5 #5 0x00007ffff5e9e6b2 in WebCore::QNetworkReplyHandler::finish() () from libQt5WebKit.so.5 #6 0x00007ffff5e9cd6f in WebCore::QNetworkReplyHandlerCallQueue::flush() () from libQt5WebKit.so.5 The problem is into the newTextCodec function from Source/WebCore/platform/text/TextEncodingRegistry.cpp. The textCodecMap hash map doesn't contain a factory for the Shift_JIS encoding thus factory.function is null.
Attachments
Segmentation fault occurred when ICU data library doesn't embed the expected encoding. (1.33 KB, patch)
2013-07-05 04:57 PDT, Benjamin Dupont
no flags
Use latin1 by default (1.46 KB, patch)
2013-07-05 05:49 PDT, Benjamin Dupont
no flags
Add aliases only for known encodings. (9.76 KB, patch)
2013-07-10 07:43 PDT, Benjamin Dupont
ap: review-
ap: commit-queue-
Benjamin Dupont
Comment 1 2013-07-05 04:57:03 PDT
Created attachment 206139 [details] Segmentation fault occurred when ICU data library doesn't embed the expected encoding. The problem is into the newTextCodec function from Source/WebCore/platform/text/TextEncodingRegistry.cpp. The textCodecMap hash map doesn't contain a factory for the Shift_JIS encoding thus factory.function is null. In this case, the first factory in the hash map is used. It's not a perfect solution, it's just to avoid the segmentation fault.
Benjamin Dupont
Comment 2 2013-07-05 05:49:24 PDT
Created attachment 206141 [details] Use latin1 by default
Allan Sandfeld Jensen
Comment 3 2013-07-05 07:41:04 PDT
Comment on attachment 206141 [details] Use latin1 by default LGTM
WebKit Commit Bot
Comment 4 2013-07-05 08:03:10 PDT
Comment on attachment 206141 [details] Use latin1 by default Clearing flags on attachment: 206141 Committed r152416: <http://trac.webkit.org/changeset/152416>
WebKit Commit Bot
Comment 5 2013-07-05 08:03:13 PDT
All reviewed patches have been landed. Closing bug.
Alexey Proskuryakov
Comment 6 2013-07-05 20:02:37 PDT
I think that this was the wrong fix. If your underlying library doesn't support a certain encoding, it simply shouldn't be in the map. What you did was make the code less stringent and more confusing for everyone.
Allan Sandfeld Jensen
Comment 7 2013-07-06 03:36:38 PDT
(In reply to comment #6) > I think that this was the wrong fix. If your underlying library doesn't support a certain encoding, it simply shouldn't be in the map. > What map? The encodings come from the encoding detection code. The problem here is that ICU has been build without some encodings that WebKit assumes are there, and crashes on if they are not.
Alexey Proskuryakov
Comment 8 2013-07-08 10:03:53 PDT
> What map? textCodecMap, obviously. Are there any other maps involved in this patch? > The encodings come from the encoding detection code. The problem here is that ICU has been build without some encodings that WebKit assumes are there, and crashes on if they are not. This is the root cause of the crash. WebKit shouldn't assume that, and it certainly shouldn't silently use latin-1 when thinking that it uses a different encoding. Doing this is extremely sloppy, and could even cause security vulnerabilities.
WebKit Commit Bot
Comment 9 2013-07-08 10:05:15 PDT
Re-opened since this is blocked by bug 118476
Alexey Proskuryakov
Comment 10 2013-07-08 10:10:34 PDT
One thing that may be non-obvious here is that TextEncoding::name() is not just some arbitrary string, it's always an encoding name previously registered by text encoding machinery. Refusing to provide a TextCodecFactory function while registering a codec is where the bad behavior is.
Benjamin Dupont
Comment 11 2013-07-09 07:24:57 PDT
(In reply to comment #10) > One thing that may be non-obvious here is that TextEncoding::name() is not just some arbitrary string, it's always an encoding name previously registered by text encoding machinery. Refusing to provide a TextCodecFactory function while registering a codec is where the bad behavior is. In this case the problem is due to http://trac.webkit.org/changeset/64820. In WebCore/platform/text/TextCodecICU.cpp into TextCodecICU::registerEncodingNames function, you add an alias for shift-jis (and others) even if ICU don't provide this encoding. How can we fix that? I see some dirty solution with 'if' but I'm wondering if the previous "fix" was not better for performance purpose. What do you think about that?
Alexey Proskuryakov
Comment 12 2013-07-09 09:23:08 PDT
Something like #if USE(REDUCED_ICU) would be fine. Or alternatively, we can check for codec existence when adding to the table, as opposed to when getting from it. However, I'm surprised that it's useful for anyone to ship without Shift_JIS, this is a very common encoding.
Julien Brianceau
Comment 13 2013-07-09 10:57:02 PDT
(In reply to comment #12) > However, I'm surprised that it's useful for anyone to ship without Shift_JIS, this is a very common encoding. Embedded systems with very limited flash: we can't afford a 20MB ICU library (which is larger that the webkit library itself).
Benjamin Dupont
Comment 14 2013-07-10 07:43:49 PDT
Created attachment 206390 [details] Add aliases only for known encodings. If you don't want use latin1 by default thus we must add aliases only for known encodings. In WebCore/platform/text/TextCodecICU.cpp into registerEncodingNames function I added this check.
Alexey Proskuryakov
Comment 15 2013-07-10 08:43:16 PDT
Comment on attachment 206390 [details] Add aliases only for known encodings. View in context: https://bugs.webkit.org/attachment.cgi?id=206390&action=review > Source/WebCore/ChangeLog:8 > + No new tests (OOPS!). The patch cannot be landed with this line, a pre-commit hook will block it. > Source/WebCore/platform/text/TextCodecICU.cpp:-72 > - registrar("ISO-8859-8-I", "ISO-8859-8-I"); I don't see how moving this line down can work. addToTextEncodingNameMap() gets an atomic name for encoding name, so if ISO-8859-8-I is registered after enumerating ICU aliases, it will just map to ISO-8859-8. That is exactly what we are trying to avoid. > Source/WebCore/platform/text/TextCodecICU.cpp:69 > + WTF::HashMap<const char*, char>knownEncodings; There shouldn't be a WTF prefix when using WTF names. Also, a missing space before knownEncodings. Didn't you mean to use HashSet? > Source/WebCore/platform/text/TextCodecICU.cpp:138 > + if (knownEncodings.contains("macintosh")) { This is not the right check. Since registrar uses an atomic name for standard name, it was fine to call it with any alias as second argument. But this check will fail.
Benjamin Dupont
Comment 16 2013-07-10 09:53:28 PDT
(In reply to comment #15) > (From update of attachment 206390 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=206390&action=review > > > Source/WebCore/ChangeLog:8 > > + No new tests (OOPS!). > > The patch cannot be landed with this line, a pre-commit hook will block it. Yes, I saw I forgot this line just after I have uploaded the patch... > > Source/WebCore/platform/text/TextCodecICU.cpp:-72 > > - registrar("ISO-8859-8-I", "ISO-8859-8-I"); > > I don't see how moving this line down can work. addToTextEncodingNameMap() gets an atomic name for encoding name, so if ISO-8859-8-I is registered after enumerating ICU aliases, it will just map to ISO-8859-8. That is exactly what we are trying to avoid. I didn't understand the existing comment and I can't match your explanation with the current implementation. e.g If we add first 1) ISO-8859-8: registrar("ISO-8859-8", "ISO-8859-8") -> addToTextEncodingNameMap a) isUndesiredAlias -> no b) textEncodingNameMap->get("ISO-8859-8") -> 0 -> atomicName = "ISO-8859-8" c) textEncodingNameMap->add("ISO-8859-8", "ISO-8859-8") 2) ISO-8859-8-I: registrar("ISO-8859-8-I", "ISO-8859-8-I") -> addToTextEncodingNameMap a) isUndesiredAlias -> no b) textEncodingNameMap->get("ISO-8859-8-I") -> 0 -> atomicName = "ISO-8859-8-I" c) textEncodingNameMap->add("ISO-8859-8-I", "ISO-8859-8-I") Thus after these 2 iterations my understanding is that the hash map should contain both encodings. Thanks in advance for your explanations. > > > Source/WebCore/platform/text/TextCodecICU.cpp:69 > > + WTF::HashMap<const char*, char>knownEncodings; > > There shouldn't be a WTF prefix when using WTF names. Also, a missing space before knownEncodings. > > Didn't you mean to use HashSet? > Oops, yes I used HashSet but It seems I had a problem during my merge. :) > > Source/WebCore/platform/text/TextCodecICU.cpp:138 > > + if (knownEncodings.contains("macintosh")) { > > This is not the right check. Since registrar uses an atomic name for standard name, it was fine to call it with any alias as second argument. But this check will fail. I think with the previous point explanation it will be clearer for this one also.
Alexey Proskuryakov
Comment 17 2013-07-10 10:31:51 PDT
> (In reply to comment #15) > 1) ISO-8859-8: > registrar("ISO-8859-8", "ISO-8859-8") -> addToTextEncodingNameMap > a) isUndesiredAlias -> no > b) textEncodingNameMap->get("ISO-8859-8") -> 0 -> atomicName = "ISO-8859-8" > c) textEncodingNameMap->add("ISO-8859-8", "ISO-8859-8") After this step, the code iterates and registers all ICU aliases for ISO-8859-8, which includes ISO-8859-8-I (see <http://demo.icu-project.org/icu-bin/convexp?s=IANA&s=MIME>). > 2) ISO-8859-8-I: > registrar("ISO-8859-8-I", "ISO-8859-8-I") -> addToTextEncodingNameMap > a) isUndesiredAlias -> no > b) textEncodingNameMap->get("ISO-8859-8-I") -> 0 -> atomicName = "ISO-8859-8-I" … so textEncodingNameMap->get("ISO-8859-8-I") will return "ISO-8859-8". > c) textEncodingNameMap->add("ISO-8859-8-I", "ISO-8859-8-I")
Alexey Proskuryakov
Comment 18 2013-07-10 10:33:26 PDT
Ultimately, we shouldn't fight with ICU encoding tables, but hardcode our own (see also bug 118505 comment 4).
Benjamin Dupont
Comment 19 2013-07-10 11:23:30 PDT
(In reply to comment #18) > Ultimately, we shouldn't fight with ICU encoding tables, but hardcode our own (see also bug 118505 comment 4). Thanks for explanation. Yes I also think it would be better. How was the table built before ICU? Nevertheless, isn't possible to keep the first patch to avoid the segmentation fault issue during the new table development?
Alexey Proskuryakov
Comment 20 2013-07-10 11:40:03 PDT
I now think that an ifdef would be the best short-term solution (possibly even on a branch if you have one). Making cross-platform code less robust would be unfortunate.
Benjamin Dupont
Comment 21 2013-07-10 13:25:55 PDT
(In reply to comment #20) > I now think that an ifdef would be the best short-term solution (possibly even on a branch if you have one). Making cross-platform code less robust would be unfortunate. An ifdef around this workaround (default latin1 if there isn't any function attached to the factory)?
Alexey Proskuryakov
Comment 22 2013-07-10 13:28:29 PDT
I would much prefer an ifdef around adding unsupported encodings.
Benjamin Dupont
Comment 23 2013-07-10 14:16:04 PDT
(In reply to comment #22) > I would much prefer an ifdef around adding unsupported encodings. In my opinion, I think the "default latin1" workaround (around with a #ifdef ICU_LIGHT) seems to be the most flexible way to use a custom ICU data library rather than to have #ifdef JAPANESE_SUPPORTED #ifdef CHINESE_SUPPORTED,... Furthermore, the final text rendering is the same. :)
Note You need to log in before you can comment on or make changes to this bug.