118410 – Segmentation fault occurred when ICU data library doesn't embed the expected encoding

UNCONFIRMED 118410

Segmentation fault occurred when ICU data library doesn't embed the expected encoding

https://bugs.webkit.org/show_bug.cgi?id=118410

Summary Segmentation fault occurred when ICU data library doesn't embed the expected ...

Benjamin Dupont

Reported 2013-07-05 04:45:22 PDT

With a full ICU data library (all checked in http://apps.icu-project.org/datacustom/, 22MB) there isn't the issue. With a light ICU data library (all unchecked in http://apps.icu-project.org/datacustom/, 327KB) if you browse an HTML page with a specific encoding (e.g Japanese web site) there is a segmentation fault: #0 0x0000000000000000 in ?? () #1 0x00007ffff5c78f7e in WebCore::newTextCodec(WebCore::TextEncoding const&) () from libQt5WebKit.so.5 #2 0x00007ffff5b3f75e in WebCore::TextResourceDecoder::decode(char const*, unsigned long) () from libQt5WebKit.so.5 #3 0x00007ffff5ad5239 in WebCore::CachedCSSStyleSheet::data(WTF::PassRefPtr<WebCore::ResourceBuffer>, bool) () from libQt5WebKit.so.5 #4 0x00007ffff5b3cfef in WebCore::SubresourceLoader::didFinishLoading(double) () from libQt5WebKit.so.5 #5 0x00007ffff5e9e6b2 in WebCore::QNetworkReplyHandler::finish() () from libQt5WebKit.so.5 #6 0x00007ffff5e9cd6f in WebCore::QNetworkReplyHandlerCallQueue::flush() () from libQt5WebKit.so.5 The problem is into the newTextCodec function from Source/WebCore/platform/text/TextEncodingRegistry.cpp. The textCodecMap hash map doesn't contain a factory for the Shift_JIS encoding thus factory.function is null.

Attachments
Segmentation fault occurred when ICU data library doesn't embed the expected encoding. (1.33 KB, patch) 2013-07-05 04:57 PDT, Benjamin Dupont	no flags	Details Formatted Diff Diff
Use latin1 by default (1.46 KB, patch) 2013-07-05 05:49 PDT, Benjamin Dupont	no flags	Details Formatted Diff Diff
Add aliases only for known encodings. (9.76 KB, patch) 2013-07-10 07:43 PDT, Benjamin Dupont	ap: review- ap: commit-queue-	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Benjamin Dupont

Comment 1 2013-07-05 04:57:03 PDT

Created attachment 206139 [details] Segmentation fault occurred when ICU data library doesn't embed the expected encoding. The problem is into the newTextCodec function from Source/WebCore/platform/text/TextEncodingRegistry.cpp. The textCodecMap hash map doesn't contain a factory for the Shift_JIS encoding thus factory.function is null. In this case, the first factory in the hash map is used. It's not a perfect solution, it's just to avoid the segmentation fault.

Benjamin Dupont

Comment 2 2013-07-05 05:49:24 PDT

Created attachment 206141 [details] Use latin1 by default

Allan Sandfeld Jensen

Comment 3 2013-07-05 07:41:04 PDT

Comment on attachment 206141 [details] Use latin1 by default LGTM

WebKit Commit Bot

Comment 4 2013-07-05 08:03:10 PDT

Comment on attachment 206141 [details] Use latin1 by default Clearing flags on attachment: 206141 Committed r152416: <http://trac.webkit.org/changeset/152416>

WebKit Commit Bot

Comment 5 2013-07-05 08:03:13 PDT

All reviewed patches have been landed. Closing bug.

Alexey Proskuryakov

Comment 6 2013-07-05 20:02:37 PDT

I think that this was the wrong fix. If your underlying library doesn't support a certain encoding, it simply shouldn't be in the map. What you did was make the code less stringent and more confusing for everyone.

Allan Sandfeld Jensen

Comment 7 2013-07-06 03:36:38 PDT

(In reply to comment #6) > I think that this was the wrong fix. If your underlying library doesn't support a certain encoding, it simply shouldn't be in the map. > What map? The encodings come from the encoding detection code. The problem here is that ICU has been build without some encodings that WebKit assumes are there, and crashes on if they are not.

Alexey Proskuryakov

Comment 8 2013-07-08 10:03:53 PDT

> What map? textCodecMap, obviously. Are there any other maps involved in this patch? > The encodings come from the encoding detection code. The problem here is that ICU has been build without some encodings that WebKit assumes are there, and crashes on if they are not. This is the root cause of the crash. WebKit shouldn't assume that, and it certainly shouldn't silently use latin-1 when thinking that it uses a different encoding. Doing this is extremely sloppy, and could even cause security vulnerabilities.

WebKit Commit Bot

Comment 9 2013-07-08 10:05:15 PDT

Re-opened since this is blocked by bug 118476

Alexey Proskuryakov

Comment 10 2013-07-08 10:10:34 PDT

One thing that may be non-obvious here is that TextEncoding::name() is not just some arbitrary string, it's always an encoding name previously registered by text encoding machinery. Refusing to provide a TextCodecFactory function while registering a codec is where the bad behavior is.

Benjamin Dupont

Comment 11 2013-07-09 07:24:57 PDT

(In reply to comment #10) > One thing that may be non-obvious here is that TextEncoding::name() is not just some arbitrary string, it's always an encoding name previously registered by text encoding machinery. Refusing to provide a TextCodecFactory function while registering a codec is where the bad behavior is. In this case the problem is due to http://trac.webkit.org/changeset/64820. In WebCore/platform/text/TextCodecICU.cpp into TextCodecICU::registerEncodingNames function, you add an alias for shift-jis (and others) even if ICU don't provide this encoding. How can we fix that? I see some dirty solution with 'if' but I'm wondering if the previous "fix" was not better for performance purpose. What do you think about that?

Alexey Proskuryakov

Comment 12 2013-07-09 09:23:08 PDT

Something like #if USE(REDUCED_ICU) would be fine. Or alternatively, we can check for codec existence when adding to the table, as opposed to when getting from it. However, I'm surprised that it's useful for anyone to ship without Shift_JIS, this is a very common encoding.

Julien Brianceau

Comment 13 2013-07-09 10:57:02 PDT

(In reply to comment #12) > However, I'm surprised that it's useful for anyone to ship without Shift_JIS, this is a very common encoding. Embedded systems with very limited flash: we can't afford a 20MB ICU library (which is larger that the webkit library itself).

Benjamin Dupont

Comment 14 2013-07-10 07:43:49 PDT

Created attachment 206390 [details] Add aliases only for known encodings. If you don't want use latin1 by default thus we must add aliases only for known encodings. In WebCore/platform/text/TextCodecICU.cpp into registerEncodingNames function I added this check.

Alexey Proskuryakov

Comment 15 2013-07-10 08:43:16 PDT

Comment on attachment 206390 [details] Add aliases only for known encodings. View in context: https://bugs.webkit.org/attachment.cgi?id=206390&action=review > Source/WebCore/ChangeLog:8 > + No new tests (OOPS!). The patch cannot be landed with this line, a pre-commit hook will block it. > Source/WebCore/platform/text/TextCodecICU.cpp:-72 > - registrar("ISO-8859-8-I", "ISO-8859-8-I"); I don't see how moving this line down can work. addToTextEncodingNameMap() gets an atomic name for encoding name, so if ISO-8859-8-I is registered after enumerating ICU aliases, it will just map to ISO-8859-8. That is exactly what we are trying to avoid. > Source/WebCore/platform/text/TextCodecICU.cpp:69 > + WTF::HashMap<const char*, char>knownEncodings; There shouldn't be a WTF prefix when using WTF names. Also, a missing space before knownEncodings. Didn't you mean to use HashSet? > Source/WebCore/platform/text/TextCodecICU.cpp:138 > + if (knownEncodings.contains("macintosh")) { This is not the right check. Since registrar uses an atomic name for standard name, it was fine to call it with any alias as second argument. But this check will fail.

Benjamin Dupont

Comment 16 2013-07-10 09:53:28 PDT

(In reply to comment #15) > (From update of attachment 206390 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=206390&action=review > > > Source/WebCore/ChangeLog:8 > > + No new tests (OOPS!). > > The patch cannot be landed with this line, a pre-commit hook will block it. Yes, I saw I forgot this line just after I have uploaded the patch... > > Source/WebCore/platform/text/TextCodecICU.cpp:-72 > > - registrar("ISO-8859-8-I", "ISO-8859-8-I"); > > I don't see how moving this line down can work. addToTextEncodingNameMap() gets an atomic name for encoding name, so if ISO-8859-8-I is registered after enumerating ICU aliases, it will just map to ISO-8859-8. That is exactly what we are trying to avoid. I didn't understand the existing comment and I can't match your explanation with the current implementation. e.g If we add first 1) ISO-8859-8: registrar("ISO-8859-8", "ISO-8859-8") -> addToTextEncodingNameMap a) isUndesiredAlias -> no b) textEncodingNameMap->get("ISO-8859-8") -> 0 -> atomicName = "ISO-8859-8" c) textEncodingNameMap->add("ISO-8859-8", "ISO-8859-8") 2) ISO-8859-8-I: registrar("ISO-8859-8-I", "ISO-8859-8-I") -> addToTextEncodingNameMap a) isUndesiredAlias -> no b) textEncodingNameMap->get("ISO-8859-8-I") -> 0 -> atomicName = "ISO-8859-8-I" c) textEncodingNameMap->add("ISO-8859-8-I", "ISO-8859-8-I") Thus after these 2 iterations my understanding is that the hash map should contain both encodings. Thanks in advance for your explanations. > > > Source/WebCore/platform/text/TextCodecICU.cpp:69 > > + WTF::HashMap<const char*, char>knownEncodings; > > There shouldn't be a WTF prefix when using WTF names. Also, a missing space before knownEncodings. > > Didn't you mean to use HashSet? > Oops, yes I used HashSet but It seems I had a problem during my merge. :) > > Source/WebCore/platform/text/TextCodecICU.cpp:138 > > + if (knownEncodings.contains("macintosh")) { > > This is not the right check. Since registrar uses an atomic name for standard name, it was fine to call it with any alias as second argument. But this check will fail. I think with the previous point explanation it will be clearer for this one also.

Alexey Proskuryakov

Comment 17 2013-07-10 10:31:51 PDT

> (In reply to comment #15) > 1) ISO-8859-8: > registrar("ISO-8859-8", "ISO-8859-8") -> addToTextEncodingNameMap > a) isUndesiredAlias -> no > b) textEncodingNameMap->get("ISO-8859-8") -> 0 -> atomicName = "ISO-8859-8" > c) textEncodingNameMap->add("ISO-8859-8", "ISO-8859-8") After this step, the code iterates and registers all ICU aliases for ISO-8859-8, which includes ISO-8859-8-I (see <http://demo.icu-project.org/icu-bin/convexp?s=IANA&s=MIME>). > 2) ISO-8859-8-I: > registrar("ISO-8859-8-I", "ISO-8859-8-I") -> addToTextEncodingNameMap > a) isUndesiredAlias -> no > b) textEncodingNameMap->get("ISO-8859-8-I") -> 0 -> atomicName = "ISO-8859-8-I" … so textEncodingNameMap->get("ISO-8859-8-I") will return "ISO-8859-8". > c) textEncodingNameMap->add("ISO-8859-8-I", "ISO-8859-8-I")

Alexey Proskuryakov

Comment 18 2013-07-10 10:33:26 PDT

Ultimately, we shouldn't fight with ICU encoding tables, but hardcode our own (see also bug 118505 comment 4).

Benjamin Dupont

Comment 19 2013-07-10 11:23:30 PDT

(In reply to comment #18) > Ultimately, we shouldn't fight with ICU encoding tables, but hardcode our own (see also bug 118505 comment 4). Thanks for explanation. Yes I also think it would be better. How was the table built before ICU? Nevertheless, isn't possible to keep the first patch to avoid the segmentation fault issue during the new table development?

Alexey Proskuryakov

Comment 20 2013-07-10 11:40:03 PDT

I now think that an ifdef would be the best short-term solution (possibly even on a branch if you have one). Making cross-platform code less robust would be unfortunate.

Benjamin Dupont

Comment 21 2013-07-10 13:25:55 PDT

(In reply to comment #20) > I now think that an ifdef would be the best short-term solution (possibly even on a branch if you have one). Making cross-platform code less robust would be unfortunate. An ifdef around this workaround (default latin1 if there isn't any function attached to the factory)?

Alexey Proskuryakov

Comment 22 2013-07-10 13:28:29 PDT

I would much prefer an ifdef around adding unsupported encodings.

Benjamin Dupont

Comment 23 2013-07-10 14:16:04 PDT

(In reply to comment #22) > I would much prefer an ifdef around adding unsupported encodings. In my opinion, I think the "default latin1" workaround (around with a #ifdef ICU_LIGHT) seems to be the most flexible way to use a custom ICU data library rather than to have #ifdef JAPANESE_SUPPORTED #ifdef CHINESE_SUPPORTED,... Furthermore, the final text rendering is the same. :)

Note You need to log in before you can comment on or make changes to this bug.

Status UNCONFIRMED

Resolution

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS Unspecified

Product WebKit

Component WebCore Misc.

Assignee

Nobody

Reported

2013-07-05 04:45 PDT

Modified

2013-07-10 14:16 PDT History

CC List

4 users Show

URL

Keywords

Depends on

118476

Blocks

110211

Dependencies

tree graph