Bug 118410 - Segmentation fault occurred when ICU data library doesn't embed the expected encoding
Summary: Segmentation fault occurred when ICU data library doesn't embed the expected ...
Status: UNCONFIRMED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: 528+ (Nightly build)
Hardware: All Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on: 118476
Blocks: 110211
  Show dependency treegraph
 
Reported: 2013-07-05 04:45 PDT by Benjamin Dupont
Modified: 2013-07-10 14:16 PDT (History)
4 users (show)

See Also:


Attachments
Segmentation fault occurred when ICU data library doesn't embed the expected encoding. (1.33 KB, patch)
2013-07-05 04:57 PDT, Benjamin Dupont
no flags Details | Formatted Diff | Diff
Use latin1 by default (1.46 KB, patch)
2013-07-05 05:49 PDT, Benjamin Dupont
no flags Details | Formatted Diff | Diff
Add aliases only for known encodings. (9.76 KB, patch)
2013-07-10 07:43 PDT, Benjamin Dupont
ap: review-
ap: commit-queue-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Benjamin Dupont 2013-07-05 04:45:22 PDT
With a full ICU data library (all checked in http://apps.icu-project.org/datacustom/, 22MB) there isn't the issue.

With a light ICU data library (all unchecked in http://apps.icu-project.org/datacustom/, 327KB) if you browse an HTML page with a specific encoding (e.g Japanese web site) there is a segmentation fault:
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff5c78f7e in WebCore::newTextCodec(WebCore::TextEncoding const&) () from libQt5WebKit.so.5
#2  0x00007ffff5b3f75e in WebCore::TextResourceDecoder::decode(char const*, unsigned long) () from libQt5WebKit.so.5
#3  0x00007ffff5ad5239 in WebCore::CachedCSSStyleSheet::data(WTF::PassRefPtr<WebCore::ResourceBuffer>, bool) () from libQt5WebKit.so.5
#4  0x00007ffff5b3cfef in WebCore::SubresourceLoader::didFinishLoading(double) () from libQt5WebKit.so.5
#5  0x00007ffff5e9e6b2 in WebCore::QNetworkReplyHandler::finish() () from libQt5WebKit.so.5
#6  0x00007ffff5e9cd6f in WebCore::QNetworkReplyHandlerCallQueue::flush() () from libQt5WebKit.so.5

The problem is into the newTextCodec function from Source/WebCore/platform/text/TextEncodingRegistry.cpp. The textCodecMap hash map doesn't contain a factory for the Shift_JIS encoding thus factory.function is null.
Comment 1 Benjamin Dupont 2013-07-05 04:57:03 PDT
Created attachment 206139 [details]
Segmentation fault occurred when ICU data library doesn't embed the expected encoding.

The problem is into the newTextCodec function from Source/WebCore/platform/text/TextEncodingRegistry.cpp. The textCodecMap hash map doesn't contain a factory for the Shift_JIS encoding thus factory.function is null. 
In this case, the first factory in the hash map is used. It's not a perfect solution, it's just to avoid the segmentation fault.
Comment 2 Benjamin Dupont 2013-07-05 05:49:24 PDT
Created attachment 206141 [details]
Use latin1 by default
Comment 3 Allan Sandfeld Jensen 2013-07-05 07:41:04 PDT
Comment on attachment 206141 [details]
Use latin1 by default

LGTM
Comment 4 WebKit Commit Bot 2013-07-05 08:03:10 PDT
Comment on attachment 206141 [details]
Use latin1 by default

Clearing flags on attachment: 206141

Committed r152416: <http://trac.webkit.org/changeset/152416>
Comment 5 WebKit Commit Bot 2013-07-05 08:03:13 PDT
All reviewed patches have been landed.  Closing bug.
Comment 6 Alexey Proskuryakov 2013-07-05 20:02:37 PDT
I think that this was the wrong fix. If your underlying library doesn't support a certain encoding, it simply shouldn't be in the map. 

What you did was make the code less stringent and more confusing for everyone.
Comment 7 Allan Sandfeld Jensen 2013-07-06 03:36:38 PDT
(In reply to comment #6)
> I think that this was the wrong fix. If your underlying library doesn't support a certain encoding, it simply shouldn't be in the map. 
> 
What map? The encodings come from the encoding detection code. The problem here is that ICU has been build without some encodings that WebKit assumes are there, and crashes on if they are not.
Comment 8 Alexey Proskuryakov 2013-07-08 10:03:53 PDT
> What map?

textCodecMap, obviously. Are there any other maps involved in this patch?

> The encodings come from the encoding detection code. The problem here is that ICU has been build without some encodings that WebKit assumes are there, and crashes on if they are not.

This is the root cause of the crash. WebKit shouldn't assume that, and it certainly shouldn't silently use latin-1 when thinking that it uses a different encoding. Doing this is extremely sloppy, and could even cause security vulnerabilities.
Comment 9 WebKit Commit Bot 2013-07-08 10:05:15 PDT
Re-opened since this is blocked by bug 118476
Comment 10 Alexey Proskuryakov 2013-07-08 10:10:34 PDT
One thing that may be non-obvious here is that TextEncoding::name() is not just some arbitrary string, it's always an encoding name previously registered by text encoding machinery. Refusing to provide a TextCodecFactory function while registering a codec is where the bad behavior is.
Comment 11 Benjamin Dupont 2013-07-09 07:24:57 PDT
(In reply to comment #10)
> One thing that may be non-obvious here is that TextEncoding::name() is not just some arbitrary string, it's always an encoding name previously registered by text encoding machinery. Refusing to provide a TextCodecFactory function while registering a codec is where the bad behavior is.

In this case the problem is due to http://trac.webkit.org/changeset/64820.
In WebCore/platform/text/TextCodecICU.cpp into TextCodecICU::registerEncodingNames function, you add an alias for shift-jis (and others) even if ICU don't provide this encoding.

How can we fix that? I see some dirty solution with 'if' but I'm wondering if the previous "fix" was not better for performance purpose.

What do you think about that?
Comment 12 Alexey Proskuryakov 2013-07-09 09:23:08 PDT
Something like #if USE(REDUCED_ICU) would be fine. Or alternatively, we can check for codec existence when adding to the table, as opposed to when getting from it.

However, I'm surprised that it's useful for anyone to ship without Shift_JIS, this is a very common encoding.
Comment 13 Julien Brianceau 2013-07-09 10:57:02 PDT
(In reply to comment #12)
> However, I'm surprised that it's useful for anyone to ship without Shift_JIS, this is a very common encoding.
Embedded systems with very limited flash: we can't afford a 20MB ICU library (which is larger that the webkit library itself).
Comment 14 Benjamin Dupont 2013-07-10 07:43:49 PDT
Created attachment 206390 [details]
Add aliases only for known encodings.

If you don't want use latin1 by default thus we must add aliases only for known encodings.
In WebCore/platform/text/TextCodecICU.cpp into registerEncodingNames function I added this check.
Comment 15 Alexey Proskuryakov 2013-07-10 08:43:16 PDT
Comment on attachment 206390 [details]
Add aliases only for known encodings.

View in context: https://bugs.webkit.org/attachment.cgi?id=206390&action=review

> Source/WebCore/ChangeLog:8
> +        No new tests (OOPS!).

The patch cannot be landed with this line, a pre-commit hook will block it.

> Source/WebCore/platform/text/TextCodecICU.cpp:-72
> -    registrar("ISO-8859-8-I", "ISO-8859-8-I");

I don't see how moving this line down can work. addToTextEncodingNameMap() gets an atomic name for encoding name, so if ISO-8859-8-I is registered after enumerating ICU aliases, it will just map to ISO-8859-8. That is exactly what we are trying to avoid.

> Source/WebCore/platform/text/TextCodecICU.cpp:69
> +    WTF::HashMap<const char*, char>knownEncodings;

There shouldn't be a WTF prefix when using WTF names. Also, a missing space before knownEncodings.

Didn't you mean to use HashSet?

> Source/WebCore/platform/text/TextCodecICU.cpp:138
> +    if (knownEncodings.contains("macintosh")) {

This is not the right check. Since registrar uses an atomic name for standard name, it was fine to call it with any alias as second argument. But this check will fail.
Comment 16 Benjamin Dupont 2013-07-10 09:53:28 PDT
(In reply to comment #15)
> (From update of attachment 206390 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=206390&action=review
> 
> > Source/WebCore/ChangeLog:8
> > +        No new tests (OOPS!).
> 
> The patch cannot be landed with this line, a pre-commit hook will block it.
Yes, I saw I forgot this line just after I have uploaded the patch...

> > Source/WebCore/platform/text/TextCodecICU.cpp:-72
> > -    registrar("ISO-8859-8-I", "ISO-8859-8-I");
> 
> I don't see how moving this line down can work. addToTextEncodingNameMap() gets an atomic name for encoding name, so if ISO-8859-8-I is registered after enumerating ICU aliases, it will just map to ISO-8859-8. That is exactly what we are trying to avoid.
I didn't understand the existing comment and I can't match your explanation with the current implementation.
e.g If we add first 
1) ISO-8859-8:
registrar("ISO-8859-8", "ISO-8859-8") -> addToTextEncodingNameMap
a) isUndesiredAlias -> no
b) textEncodingNameMap->get("ISO-8859-8") -> 0 -> atomicName = "ISO-8859-8"
c) textEncodingNameMap->add("ISO-8859-8", "ISO-8859-8")

2) ISO-8859-8-I:
registrar("ISO-8859-8-I", "ISO-8859-8-I") -> addToTextEncodingNameMap
a) isUndesiredAlias -> no
b) textEncodingNameMap->get("ISO-8859-8-I") -> 0 -> atomicName = "ISO-8859-8-I"
c) textEncodingNameMap->add("ISO-8859-8-I", "ISO-8859-8-I")

Thus after these 2 iterations my understanding is that the hash map should contain both encodings.

Thanks in advance for your explanations.

> 
> > Source/WebCore/platform/text/TextCodecICU.cpp:69
> > +    WTF::HashMap<const char*, char>knownEncodings;
> 
> There shouldn't be a WTF prefix when using WTF names. Also, a missing space before knownEncodings.
> 
> Didn't you mean to use HashSet?
> 
Oops, yes I used HashSet but It seems I had a problem during my merge. :)

> > Source/WebCore/platform/text/TextCodecICU.cpp:138
> > +    if (knownEncodings.contains("macintosh")) {
> 
> This is not the right check. Since registrar uses an atomic name for standard name, it was fine to call it with any alias as second argument. But this check will fail.
I think with the previous point explanation it will be clearer for this one also.
Comment 17 Alexey Proskuryakov 2013-07-10 10:31:51 PDT
> (In reply to comment #15)
> 1) ISO-8859-8:
> registrar("ISO-8859-8", "ISO-8859-8") -> addToTextEncodingNameMap
> a) isUndesiredAlias -> no
> b) textEncodingNameMap->get("ISO-8859-8") -> 0 -> atomicName = "ISO-8859-8"
> c) textEncodingNameMap->add("ISO-8859-8", "ISO-8859-8")

After this step, the code iterates and registers all ICU aliases for ISO-8859-8, which includes ISO-8859-8-I (see <http://demo.icu-project.org/icu-bin/convexp?s=IANA&s=MIME>).

> 2) ISO-8859-8-I:
> registrar("ISO-8859-8-I", "ISO-8859-8-I") -> addToTextEncodingNameMap
> a) isUndesiredAlias -> no
> b) textEncodingNameMap->get("ISO-8859-8-I") -> 0 -> atomicName = "ISO-8859-8-I"

… so textEncodingNameMap->get("ISO-8859-8-I") will return "ISO-8859-8".

> c) textEncodingNameMap->add("ISO-8859-8-I", "ISO-8859-8-I")
Comment 18 Alexey Proskuryakov 2013-07-10 10:33:26 PDT
Ultimately, we shouldn't fight with ICU encoding tables, but hardcode our own (see also bug 118505 comment 4).
Comment 19 Benjamin Dupont 2013-07-10 11:23:30 PDT
(In reply to comment #18)
> Ultimately, we shouldn't fight with ICU encoding tables, but hardcode our own (see also bug 118505 comment 4).
Thanks for explanation.

Yes I also think it would be better. How was the table built before ICU?

Nevertheless, isn't possible to keep the first patch to avoid the segmentation fault issue during the new table development?
Comment 20 Alexey Proskuryakov 2013-07-10 11:40:03 PDT
I now think that an ifdef would be the best short-term solution (possibly even on a branch if you have one). Making cross-platform code less robust would be unfortunate.
Comment 21 Benjamin Dupont 2013-07-10 13:25:55 PDT
(In reply to comment #20)
> I now think that an ifdef would be the best short-term solution (possibly even on a branch if you have one). Making cross-platform code less robust would be unfortunate.
An ifdef around this workaround (default latin1 if there isn't any function attached to the factory)?
Comment 22 Alexey Proskuryakov 2013-07-10 13:28:29 PDT
I would much prefer an ifdef around adding unsupported encodings.
Comment 23 Benjamin Dupont 2013-07-10 14:16:04 PDT
(In reply to comment #22)
> I would much prefer an ifdef around adding unsupported encodings.
In my opinion, I think the "default latin1" workaround (around with a #ifdef ICU_LIGHT) seems to be the most flexible way to use a custom ICU data library rather than to have #ifdef JAPANESE_SUPPORTED #ifdef CHINESE_SUPPORTED,... Furthermore, the final text rendering is the same. :)