Bug 159581 - Stop supporting compressed character sets BOCU-1 and SCSU
Summary: Stop supporting compressed character sets BOCU-1 and SCSU
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Text (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: John Wilander
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-07-08 13:53 PDT by John Wilander
Modified: 2016-07-26 10:38 PDT (History)
6 users (show)

See Also:


Attachments
Patch (6.94 KB, patch)
2016-07-08 16:36 PDT, John Wilander
no flags Details | Formatted Diff | Diff
Patch (7.76 KB, patch)
2016-07-08 16:42 PDT, John Wilander
no flags Details | Formatted Diff | Diff
Patch (8.36 KB, patch)
2016-07-11 14:12 PDT, John Wilander
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description John Wilander 2016-07-08 13:53:50 PDT
WebKit still supports the compressed character sets BOCU-1 and SCSU whereas Chrome and Firefox don't. These old formats may pass server-side character filters while still rendering in WebKit.
Comment 1 Myles C. Maxfield 2016-07-08 14:55:34 PDT
We shouldn't remove support for this without knowing the impact of which/how many sites will break.
Comment 2 John Wilander 2016-07-08 16:36:59 PDT
Created attachment 283218 [details]
Patch
Comment 3 John Wilander 2016-07-08 16:40:37 PDT
Comment on attachment 283218 [details]
Patch

webkit-patch complained about the characters in one of the test file's names and so I changed it but didn't add it back in. New patch coming up.
Comment 4 John Wilander 2016-07-08 16:42:12 PDT
Created attachment 283221 [details]
Patch
Comment 5 John Wilander 2016-07-08 16:44:49 PDT
Subversion decided to fiddle with the test file. I might have to pull it down in whatever format it is and see that it tests the right thing.
Comment 6 Alexey Proskuryakov 2016-07-08 17:15:35 PDT
Comment on attachment 283221 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=283221&action=review

> LayoutTests/ChangeLog:17
> +        * http/tests/misc/resources/scsu-moscow.html: Added.

A binary file wouldn't be scanned for a <meta>. Do you need the charset to be defined by HTTP Content-Type for this test?

> LayoutTests/http/tests/misc/resources/bocu-1.html:7
> +%8C%C3%B3%C2%B9%C0%C4%8E%B1%BC%B5%C2%C4x%81y%8C%7F%C3%B3%C2%B9%C0%C4%8E

When I open this file in shipping Safari, I get some garbage text. What is it expected to be?
Comment 7 John Wilander 2016-07-11 08:51:23 PDT
(In reply to comment #6)
> Comment on attachment 283221 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=283221&action=review
> 
> > LayoutTests/ChangeLog:17
> > +        * http/tests/misc/resources/scsu-moscow.html: Added.
> 
> A binary file wouldn't be scanned for a <meta>. Do you need the charset to
> be defined by HTTP Content-Type for this test?
> 
> > LayoutTests/http/tests/misc/resources/bocu-1.html:7
> > +%8C%C3%B3%C2%B9%C0%C4%8E%B1%BC%B5%C2%C4x%81y%8C%7F%C3%B3%C2%B9%C0%C4%8E
> 
> When I open this file in shipping Safari, I get some garbage text. What is
> it expected to be?

It was an attempt to use a PoC string for BOCU-1 but I decided to not spend more time on it. Without support for BOCU-1 you will see the actual characters (%8C…).

But you are right, either both test cases should show something readable if the charsets *are* supported or if they are *not* supported. Not a mix as it is now.
Comment 8 John Wilander 2016-07-11 08:56:32 PDT
(In reply to comment #6)
> Comment on attachment 283221 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=283221&action=review
> 
> > LayoutTests/ChangeLog:17
> > +        * http/tests/misc/resources/scsu-moscow.html: Added.
> 
> A binary file wouldn't be scanned for a <meta>. Do you need the charset to
> be defined by HTTP Content-Type for this test?

Sorry, should reply to this too. It might be that I have to set the charset to something to get it to pick up the meta charset. That is the behavior I saw – a HTTP Content-Type "text/html; charset=bogus" plus a <meta charset=SCSU>. I will switch to a PHP test case and do this.
Comment 9 John Wilander 2016-07-11 09:00:50 PDT
I will add a reference to the HTML spec in the new patch. It explicitly says we should not support BOCU-1 nor SCSU:

"The above prohibits supporting, for example, CESU-8, UTF-7, BOCU-1, SCSU, EBCDIC, and UTF-32."
https://html.spec.whatwg.org/#character-encodings

We support a bunch of EBCDIC charsets, CESU-8, and UTF-32 too. Either we open a new bug for those or bundle into this patch.
Comment 10 Alexey Proskuryakov 2016-07-11 09:55:11 PDT
Better to do multiple patches - it's easier to find site compatibility regression culprits when changes aren't bundled.
Comment 11 John Wilander 2016-07-11 14:12:41 PDT
Created attachment 283341 [details]
Patch
Comment 12 John Wilander 2016-07-11 16:00:31 PDT
(In reply to comment #10)
> Better to do multiple patches - it's easier to find site compatibility
> regression culprits when changes aren't bundled.

Filed https://bugs.webkit.org/show_bug.cgi?id=159651.
Comment 13 Brent Fulgham 2016-07-25 16:26:49 PDT
(In reply to comment #1)
> We shouldn't remove support for this without knowing the impact of which/how
> many sites will break.

Given that:
(1) Chrome and Firefox already block these characters sets
(2) The HTML spec explicitly prohibits these character sets

... why are we concerned there is a compatibility issue here?
Comment 14 Brent Fulgham 2016-07-25 16:29:08 PDT
Comment on attachment 283341 [details]
Patch

r=me. Do you plan on adding the other "forbidden" character sets in a future patch?
Comment 15 John Wilander 2016-07-25 16:58:42 PDT
(In reply to comment #14)
> Comment on attachment 283341 [details]
> Patch
> 
> r=me. Do you plan on adding the other "forbidden" character sets in a future
> patch?

Yes, https://bugs.webkit.org/show_bug.cgi?id=159651.
Comment 16 Myles C. Maxfield 2016-07-25 17:54:04 PDT
(In reply to comment #13)
> (In reply to comment #1)
> > We shouldn't remove support for this without knowing the impact of which/how
> > many sites will break.
> 
> Given that:
> (1) Chrome and Firefox already block these characters sets

Historically, this has been an imperfect signal for these kinds of decisions. I know at least one case where we refuse to break sites which all the other browsers break.

> (2) The HTML spec explicitly prohibits these character sets

The spec is a poor signal for compatibility.

> 
> ... why are we concerned there is a compatibility issue here?
Comment 17 Brent Fulgham 2016-07-25 19:48:16 PDT
(In reply to comment #16)
> (In reply to comment #13)
> > (In reply to comment #1)
> > > We shouldn't remove support for this without knowing the impact of which/how
> > > many sites will break.
> > 
> > Given that:
> > (1) Chrome and Firefox already block these characters sets
> 
> Historically, this has been an imperfect signal for these kinds of
> decisions. I know at least one case where we refuse to break sites which all
> the other

Can you identify any websites that require these character sets? Who would we break with this change?
Comment 18 Myles C. Maxfield 2016-07-25 21:59:13 PDT
(In reply to comment #17)
> (In reply to comment #16)
> > (In reply to comment #13)
> > > (In reply to comment #1)
> > > > We shouldn't remove support for this without knowing the impact of which/how
> > > > many sites will break.
> > > 
> > > Given that:
> > > (1) Chrome and Firefox already block these characters sets
> > 
> > Historically, this has been an imperfect signal for these kinds of
> > decisions. I know at least one case where we refuse to break sites which all
> > the other
> 
> Can you identify any websites that require these character sets? Who would
> we break with this change?

This is exactly the question I asked way above (in the second comment).

The burden of proof should be on the one removing functionality, not the other way around.

It seems to me we shouldn't remove functionality if we have no idea what will break.
Comment 19 Myles C. Maxfield 2016-07-25 22:03:39 PDT
(In reply to comment #18)
> (In reply to comment #17)
> > (In reply to comment #16)
> > > (In reply to comment #13)
> > > > (In reply to comment #1)
> > > > > We shouldn't remove support for this without knowing the impact of which/how
> > > > > many sites will break.
> > > > 
> > > > Given that:
> > > > (1) Chrome and Firefox already block these characters sets
> > > 
> > > Historically, this has been an imperfect signal for these kinds of
> > > decisions. I know at least one case where we refuse to break sites which all
> > > the other
> > 
> > Can you identify any websites that require these character sets? Who would
> > we break with this change?
> 
> This is exactly the question I asked way above (in the second comment).
> 
> The burden of proof should be on the one removing functionality, not the
> other way around.
> 
> It seems to me we shouldn't remove functionality if we have no idea what
> will break.

I should probably clarify this:
Removing functionality might have a very serious impact, or maybe none at all. Improving security will have a valuable benefit. Without knowing more information, it seems difficult to perform the calculus of whether or not this change should be committed.

I'm not against the change - I am definitely pro-security! I just hesitate when we are considering willingly breaking an unknown number of websites.
Comment 20 Brent Fulgham 2016-07-25 23:28:37 PDT
(In reply to comment #19)
> 
> I'm not against the change - I am definitely pro-security! I just hesitate
> when we are considering willingly breaking an unknown number of websites.

The problem here is that we could imagine any number of changes that could possibly break a website. We need to weigh the security benefits against the risk of causing harm.

As I see it, we have a change that:
1. Closes a known security hole.
2. Brings us into compatibility with how the other major browsers work.
3. Brings us into agreement with the specification.

Those seem like three significant reasons to make this change.

On the other side, we have the hypothetical risk of breaking websites that rely on these disallowed character sets. As far as I know, we are unaware of any such sites existing, nor is there compelling reason that someone would choose to implement a site relying on them.

I also do not see a way to know whether websites will break with this change, since there is no catalog of websites using particular character sets. I think the right choice here is to land the change, and let users of Safari Technology Preview or the Gtk-based browsers report any problems they encounter.
Comment 21 Myles C. Maxfield 2016-07-26 00:32:16 PDT
> I also do not see a way to know whether websites will break with this
> change

We need a way to know the ramifications of our changes without breaking the product for our users.

If such a mechanism does not exist right now, then all we can do is commit and cross our fingers.
Comment 22 WebKit Commit Bot 2016-07-26 09:58:35 PDT
Comment on attachment 283341 [details]
Patch

Clearing flags on attachment: 283341

Committed r203725: <http://trac.webkit.org/changeset/203725>
Comment 23 WebKit Commit Bot 2016-07-26 09:58:40 PDT
All reviewed patches have been landed.  Closing bug.
Comment 24 Darin Adler 2016-07-26 10:38:52 PDT
Comment on attachment 283341 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=283341&action=review

> Source/WebCore/ChangeLog:13
> +        The HTML specification says "The above prohibits supporting, for example,
> +        CESU-8, UTF-7, BOCU-1, SCSU, EBCDIC, and UTF-32."

Could we please add tests that cover these other examples of prohibited character encodings, even the ones that are not explicitly included in the list? Maybe we could find a way to write a single test that iterates a list of encoding names and checks that each is not supported; then we can add names for all sorts of character sets that we should not support.