159581 – Stop supporting compressed character sets BOCU-1 and SCSU

RESOLVED FIXED 159581

Stop supporting compressed character sets BOCU-1 and SCSU

https://bugs.webkit.org/show_bug.cgi?id=159581

Summary Stop supporting compressed character sets BOCU-1 and SCSU

John Wilander

Reported 2016-07-08 13:53:50 PDT

WebKit still supports the compressed character sets BOCU-1 and SCSU whereas Chrome and Firefox don't. These old formats may pass server-side character filters while still rendering in WebKit.

Attachments
Patch (6.94 KB, patch) 2016-07-08 16:36 PDT, John Wilander	no flags	Details Formatted Diff Diff
Patch (7.76 KB, patch) 2016-07-08 16:42 PDT, John Wilander	no flags	Details Formatted Diff Diff
Patch (8.36 KB, patch) 2016-07-11 14:12 PDT, John Wilander	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Myles C. Maxfield

Comment 1 2016-07-08 14:55:34 PDT

We shouldn't remove support for this without knowing the impact of which/how many sites will break.

John Wilander

Comment 2 2016-07-08 16:36:59 PDT

Created attachment 283218 [details] Patch

John Wilander

Comment 3 2016-07-08 16:40:37 PDT

Comment on attachment 283218 [details] Patch webkit-patch complained about the characters in one of the test file's names and so I changed it but didn't add it back in. New patch coming up.

John Wilander

Comment 4 2016-07-08 16:42:12 PDT

Created attachment 283221 [details] Patch

John Wilander

Comment 5 2016-07-08 16:44:49 PDT

Subversion decided to fiddle with the test file. I might have to pull it down in whatever format it is and see that it tests the right thing.

Alexey Proskuryakov

Comment 6 2016-07-08 17:15:35 PDT

Comment on attachment 283221 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=283221&action=review > LayoutTests/ChangeLog:17 > + * http/tests/misc/resources/scsu-moscow.html: Added. A binary file wouldn't be scanned for a <meta>. Do you need the charset to be defined by HTTP Content-Type for this test? > LayoutTests/http/tests/misc/resources/bocu-1.html:7 > +%8C%C3%B3%C2%B9%C0%C4%8E%B1%BC%B5%C2%C4x%81y%8C%7F%C3%B3%C2%B9%C0%C4%8E When I open this file in shipping Safari, I get some garbage text. What is it expected to be?

John Wilander

Comment 7 2016-07-11 08:51:23 PDT

(In reply to comment #6) > Comment on attachment 283221 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=283221&action=review > > > LayoutTests/ChangeLog:17 > > + * http/tests/misc/resources/scsu-moscow.html: Added. > > A binary file wouldn't be scanned for a <meta>. Do you need the charset to > be defined by HTTP Content-Type for this test? > > > LayoutTests/http/tests/misc/resources/bocu-1.html:7 > > +%8C%C3%B3%C2%B9%C0%C4%8E%B1%BC%B5%C2%C4x%81y%8C%7F%C3%B3%C2%B9%C0%C4%8E > > When I open this file in shipping Safari, I get some garbage text. What is > it expected to be? It was an attempt to use a PoC string for BOCU-1 but I decided to not spend more time on it. Without support for BOCU-1 you will see the actual characters (%8C…). But you are right, either both test cases should show something readable if the charsets *are* supported or if they are *not* supported. Not a mix as it is now.

John Wilander

Comment 8 2016-07-11 08:56:32 PDT

(In reply to comment #6) > Comment on attachment 283221 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=283221&action=review > > > LayoutTests/ChangeLog:17 > > + * http/tests/misc/resources/scsu-moscow.html: Added. > > A binary file wouldn't be scanned for a <meta>. Do you need the charset to > be defined by HTTP Content-Type for this test? Sorry, should reply to this too. It might be that I have to set the charset to something to get it to pick up the meta charset. That is the behavior I saw – a HTTP Content-Type "text/html; charset=bogus" plus a <meta charset=SCSU>. I will switch to a PHP test case and do this.

John Wilander

Comment 9 2016-07-11 09:00:50 PDT

I will add a reference to the HTML spec in the new patch. It explicitly says we should not support BOCU-1 nor SCSU: "The above prohibits supporting, for example, CESU-8, UTF-7, BOCU-1, SCSU, EBCDIC, and UTF-32." https://html.spec.whatwg.org/#character-encodings We support a bunch of EBCDIC charsets, CESU-8, and UTF-32 too. Either we open a new bug for those or bundle into this patch.

Alexey Proskuryakov

Comment 10 2016-07-11 09:55:11 PDT

Better to do multiple patches - it's easier to find site compatibility regression culprits when changes aren't bundled.

John Wilander

Comment 11 2016-07-11 14:12:41 PDT

Created attachment 283341 [details] Patch

John Wilander

Comment 12 2016-07-11 16:00:31 PDT

(In reply to comment #10) > Better to do multiple patches - it's easier to find site compatibility > regression culprits when changes aren't bundled. Filed https://bugs.webkit.org/show_bug.cgi?id=159651.

Brent Fulgham

Comment 13 2016-07-25 16:26:49 PDT

(In reply to comment #1) > We shouldn't remove support for this without knowing the impact of which/how > many sites will break. Given that: (1) Chrome and Firefox already block these characters sets (2) The HTML spec explicitly prohibits these character sets ... why are we concerned there is a compatibility issue here?

Brent Fulgham

Comment 14 2016-07-25 16:29:08 PDT

Comment on attachment 283341 [details] Patch r=me. Do you plan on adding the other "forbidden" character sets in a future patch?

John Wilander

Comment 15 2016-07-25 16:58:42 PDT

(In reply to comment #14) > Comment on attachment 283341 [details] > Patch > > r=me. Do you plan on adding the other "forbidden" character sets in a future > patch? Yes, https://bugs.webkit.org/show_bug.cgi?id=159651.

Myles C. Maxfield

Comment 16 2016-07-25 17:54:04 PDT

(In reply to comment #13) > (In reply to comment #1) > > We shouldn't remove support for this without knowing the impact of which/how > > many sites will break. > > Given that: > (1) Chrome and Firefox already block these characters sets Historically, this has been an imperfect signal for these kinds of decisions. I know at least one case where we refuse to break sites which all the other browsers break. > (2) The HTML spec explicitly prohibits these character sets The spec is a poor signal for compatibility. > > ... why are we concerned there is a compatibility issue here?

Brent Fulgham

Comment 17 2016-07-25 19:48:16 PDT

(In reply to comment #16) > (In reply to comment #13) > > (In reply to comment #1) > > > We shouldn't remove support for this without knowing the impact of which/how > > > many sites will break. > > > > Given that: > > (1) Chrome and Firefox already block these characters sets > > Historically, this has been an imperfect signal for these kinds of > decisions. I know at least one case where we refuse to break sites which all > the other Can you identify any websites that require these character sets? Who would we break with this change?

Myles C. Maxfield

Comment 18 2016-07-25 21:59:13 PDT

(In reply to comment #17) > (In reply to comment #16) > > (In reply to comment #13) > > > (In reply to comment #1) > > > > We shouldn't remove support for this without knowing the impact of which/how > > > > many sites will break. > > > > > > Given that: > > > (1) Chrome and Firefox already block these characters sets > > > > Historically, this has been an imperfect signal for these kinds of > > decisions. I know at least one case where we refuse to break sites which all > > the other > > Can you identify any websites that require these character sets? Who would > we break with this change? This is exactly the question I asked way above (in the second comment). The burden of proof should be on the one removing functionality, not the other way around. It seems to me we shouldn't remove functionality if we have no idea what will break.

Myles C. Maxfield

Comment 19 2016-07-25 22:03:39 PDT

(In reply to comment #18) > (In reply to comment #17) > > (In reply to comment #16) > > > (In reply to comment #13) > > > > (In reply to comment #1) > > > > > We shouldn't remove support for this without knowing the impact of which/how > > > > > many sites will break. > > > > > > > > Given that: > > > > (1) Chrome and Firefox already block these characters sets > > > > > > Historically, this has been an imperfect signal for these kinds of > > > decisions. I know at least one case where we refuse to break sites which all > > > the other > > > > Can you identify any websites that require these character sets? Who would > > we break with this change? > > This is exactly the question I asked way above (in the second comment). > > The burden of proof should be on the one removing functionality, not the > other way around. > > It seems to me we shouldn't remove functionality if we have no idea what > will break. I should probably clarify this: Removing functionality might have a very serious impact, or maybe none at all. Improving security will have a valuable benefit. Without knowing more information, it seems difficult to perform the calculus of whether or not this change should be committed. I'm not against the change - I am definitely pro-security! I just hesitate when we are considering willingly breaking an unknown number of websites.

Brent Fulgham

Comment 20 2016-07-25 23:28:37 PDT

(In reply to comment #19) > > I'm not against the change - I am definitely pro-security! I just hesitate > when we are considering willingly breaking an unknown number of websites. The problem here is that we could imagine any number of changes that could possibly break a website. We need to weigh the security benefits against the risk of causing harm. As I see it, we have a change that: 1. Closes a known security hole. 2. Brings us into compatibility with how the other major browsers work. 3. Brings us into agreement with the specification. Those seem like three significant reasons to make this change. On the other side, we have the hypothetical risk of breaking websites that rely on these disallowed character sets. As far as I know, we are unaware of any such sites existing, nor is there compelling reason that someone would choose to implement a site relying on them. I also do not see a way to know whether websites will break with this change, since there is no catalog of websites using particular character sets. I think the right choice here is to land the change, and let users of Safari Technology Preview or the Gtk-based browsers report any problems they encounter.

Myles C. Maxfield

Comment 21 2016-07-26 00:32:16 PDT

> I also do not see a way to know whether websites will break with this > change We need a way to know the ramifications of our changes without breaking the product for our users. If such a mechanism does not exist right now, then all we can do is commit and cross our fingers.

WebKit Commit Bot

Comment 22 2016-07-26 09:58:35 PDT

Comment on attachment 283341 [details] Patch Clearing flags on attachment: 283341 Committed r203725: <http://trac.webkit.org/changeset/203725>

WebKit Commit Bot

Comment 23 2016-07-26 09:58:40 PDT

All reviewed patches have been landed. Closing bug.

Darin Adler

Comment 24 2016-07-26 10:38:52 PDT

Comment on attachment 283341 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=283341&action=review > Source/WebCore/ChangeLog:13 > + The HTML specification says "The above prohibits supporting, for example, > + CESU-8, UTF-7, BOCU-1, SCSU, EBCDIC, and UTF-32." Could we please add tests that cover these other examples of prohibited character encodings, even the ones that are not explicitly included in the list? Maybe we could find a way to write a single test that iterates a list of encoding names and checks that each is not supported; then we can add names for all sorts of character sets that we should not support.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component Text

Assignee

John Wilander

Reported

2016-07-08 13:53 PDT

Modified

2016-07-26 10:38 PDT History

CC List

6 users Show

URL

Keywords

Depends on

Blocks