Bug 148810 - Document.characterSet should return "UTF-8" instead of null by default
Summary: Document.characterSet should return "UTF-8" instead of null by default
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: dewei_zhu
URL: https://dom.spec.whatwg.org/#concept-...
Keywords:
Depends on:
Blocks:
 
Reported: 2015-09-04 12:15 PDT by dewei_zhu
Modified: 2015-09-14 18:26 PDT (History)
7 users (show)

See Also:


Attachments
Patch (39.90 KB, patch)
2015-09-04 12:20 PDT, dewei_zhu
no flags Details | Formatted Diff | Diff
Patch (40.63 KB, patch)
2015-09-04 13:47 PDT, dewei_zhu
no flags Details | Formatted Diff | Diff
Patch (41.91 KB, patch)
2015-09-04 14:29 PDT, dewei_zhu
no flags Details | Formatted Diff | Diff
Patch (46.87 KB, patch)
2015-09-04 16:10 PDT, dewei_zhu
no flags Details | Formatted Diff | Diff
Patch (46.15 KB, patch)
2015-09-04 17:11 PDT, dewei_zhu
no flags Details | Formatted Diff | Diff
Patch (47.30 KB, patch)
2015-09-09 16:52 PDT, dewei_zhu
no flags Details | Formatted Diff | Diff
Patch (47.48 KB, patch)
2015-09-09 17:34 PDT, dewei_zhu
no flags Details | Formatted Diff | Diff
Patch (47.36 KB, patch)
2015-09-09 18:19 PDT, dewei_zhu
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description dewei_zhu 2015-09-04 12:15:13 PDT
Set default encoding of document to be 'UTF-8'.
Comment 1 dewei_zhu 2015-09-04 12:20:31 PDT
Created attachment 260607 [details]
Patch
Comment 2 dewei_zhu 2015-09-04 12:22:12 PDT
Comment on attachment 260607 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260607&action=review

> Source/WebCore/ChangeLog:14
> +

Is it ok to have 2 separate change logs?
Comment 3 Chris Dumez 2015-09-04 12:33:49 PDT
Comment on attachment 260607 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260607&action=review

> Source/WebCore/ChangeLog:11
> +

Please indicate in the Change Log that this behavior is consistent with Firefox.

> Source/WebCore/dom/Document.cpp:-1255
> -    return String();

We should also update Document.idl to drop the [TreatReturnedNullStringAs=Null] IDL extended attribute from the following attributes: inputEncoding, charset, characterSet. The spec says that these attributes are not nullable and we no longer return a null string.

> Source/WebCore/dom/Document.cpp:1255
> +    return String(UTF8Encoding().domName());

I don't believe we need the String()

> LayoutTests/ChangeLog:3
> +        Update the tests which test the default encoding of document.

Please use bug title here: "Set 'UTF-8' to be the default of document encoding." and move "Update the tests which test the default encoding of document." to the description under the Reviewed By line.

> LayoutTests/fast/dom/document-attribute-js-null-expected.txt:5
> +TEST FAILED: The value should have been undefined but was the string 'UTF-8'. [tested Document.charset]

You need to update this test instead of simply rebaselining it. This is not an imported test.
Comment 4 Chris Dumez 2015-09-04 12:34:49 PDT
Comment on attachment 260607 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260607&action=review

> Source/WebCore/ChangeLog:5
> +        <rdar://problem/22549736>

Wrong radar?

> Source/WebCore/ChangeLog:10
> +        https://dom.spec.whatwg.org/#concept-document-url.

Wrong URL?
Comment 5 Chris Dumez 2015-09-04 12:37:04 PDT
Comment on attachment 260607 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260607&action=review

> Source/WebCore/ChangeLog:3
> +        Set 'UTF-8' to be the default of document encoding.

I would prefer "document.characterSet should return "UTF-8" instead of null by default"
Comment 6 Chris Dumez 2015-09-04 12:37:18 PDT
Comment on attachment 260607 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260607&action=review

>> Source/WebCore/ChangeLog:14
>> +
> 
> Is it ok to have 2 separate change logs?

Yes.
Comment 7 dewei_zhu 2015-09-04 13:47:11 PDT
Created attachment 260611 [details]
Patch
Comment 8 Chris Dumez 2015-09-04 13:55:12 PDT
Comment on attachment 260611 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260611&action=review

> Source/WebCore/dom/Document.cpp:1255
> +    return UTF8Encoding().domName();

What about the IDL changes I suggested?
Comment 9 dewei_zhu 2015-09-04 14:29:01 PDT
Created attachment 260620 [details]
Patch
Comment 10 dewei_zhu 2015-09-04 14:30:06 PDT
Comment on attachment 260620 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260620&action=review

> Source/WebCore/dom/Document.idl:178
> +             [TreatNullAs=NullString] attribute DOMString charset;

Can we also drop [TreatNullAs=NullString] as well?
Comment 11 Chris Dumez 2015-09-04 14:39:28 PDT
Comment on attachment 260620 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260620&action=review

> Source/WebCore/ChangeLog:11
> +        is consistent with Firefox.

Chrome also follows the spec.

>> Source/WebCore/dom/Document.idl:178
>> +             [TreatNullAs=NullString] attribute DOMString charset;
> 
> Can we also drop [TreatNullAs=NullString] as well?

It is non-standard so I wouldn't touch it.
Comment 12 Chris Dumez 2015-09-04 14:40:58 PDT
Comment on attachment 260620 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260620&action=review

> Source/WebCore/dom/Document.cpp:1255
> +    return UTF8Encoding().domName();

As per Alexey's feedback, let's check that this doesn't impact Document::encoding() call sites in native code.
Comment 13 Chris Dumez 2015-09-04 14:43:48 PDT
Comment on attachment 260620 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260620&action=review

>> Source/WebCore/dom/Document.cpp:1255
>> +    return UTF8Encoding().domName();
> 
> As per Alexey's feedback, let's check that this doesn't impact Document::encoding() call sites in native code.

I see one call site in FrameLoader::addExtraFieldsToRequest():
        // Always try UTF-8. If that fails, try frame encoding (if any) and then the default.
        request.setResponseContentDispositionEncodingFallbackArray("UTF-8", m_frame.document()->encoding(), m_frame.settings().defaultTextEncodingName());

This already tries UTF-8 first and the implementation ignores null-string encodings so I think this is fine.
Comment 14 Chris Dumez 2015-09-04 14:45:38 PDT
Comment on attachment 260620 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260620&action=review

>>> Source/WebCore/dom/Document.cpp:1255
>>> +    return UTF8Encoding().domName();
>> 
>> As per Alexey's feedback, let's check that this doesn't impact Document::encoding() call sites in native code.
> 
> I see one call site in FrameLoader::addExtraFieldsToRequest():
>         // Always try UTF-8. If that fails, try frame encoding (if any) and then the default.
>         request.setResponseContentDispositionEncodingFallbackArray("UTF-8", m_frame.document()->encoding(), m_frame.settings().defaultTextEncodingName());
> 
> This already tries UTF-8 first and the implementation ignores null-string encodings so I think this is fine.

charset() / inputEncoding() are used a lot more internally and they all call Document::encoding() so we need to double check those.
Comment 15 dewei_zhu 2015-09-04 16:10:04 PDT
Created attachment 260639 [details]
Patch
Comment 16 Chris Dumez 2015-09-04 16:16:05 PDT
Comment on attachment 260639 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260639&action=review

> Source/WebCore/dom/Document.cpp:1258
> +String Document::characterSet() const

Please do not duplicate code. Have this call document::encoding().

> Source/WebCore/dom/Document.h:403
>      String charset() const { return Document::encoding(); }

Please add a comment explaining this may return a null String.

> Source/WebCore/dom/Document.h:404
> +    String characterSet() const;

Please add a comment explaining this returns "UTF-8" by default.

> LayoutTests/ChangeLog:9
> +        Update the tests which test the default encoding of document.

Please add a blank line after this.

> LayoutTests/ChangeLog:11
> +        * fast/dom/document-attribute-js-null-expected.txt: Obsolete test.

Not obsolete.

> LayoutTests/fast/dom/document-attribute-js-null-expected.txt:-5
> -TEST SUCCEEDED: The value was undefined. [tested Document.charset]

Do not remove.

> LayoutTests/fast/dom/document-attribute-js-null.html:-69
> -                        {name: 'charset', expectedNull: undefined},

Do not remove
Comment 17 dewei_zhu 2015-09-04 17:11:20 PDT
Created attachment 260653 [details]
Patch
Comment 18 WebKit Commit Bot 2015-09-04 17:14:14 PDT
Attachment 260653 [details] did not pass style-queue:


ERROR: Source/WebCore/dom/Document.cpp:1261:  Missing space before ( in if(  [whitespace/parens] [5]
Total errors found: 1 in 15 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 19 Chris Dumez 2015-09-04 17:30:13 PDT
Comment on attachment 260653 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260653&action=review

r=me with changes.

> Source/WebCore/dom/Document.cpp:1255
>      return String();

nullAtom

> Source/WebCore/dom/Document.cpp:1260
> +    String name = encoding();

AtomicString

> Source/WebCore/dom/Document.cpp:1261
> +    if(!name.isEmpty())

missing space between if and (

isNull() not isEmpty()

> Source/WebCore/dom/Document.h:403
> +    String charset() const { return Document::encoding(); } // may return a null String

Comments need to end with a dot and start with a capital letter.

> Source/WebCore/dom/Document.h:404
> +    String characterSet() const; // returns "UTF-8" by default

Ditto

> Source/WebCore/dom/InlineStyleSheetOwner.cpp:144
> +    m_sheet = CSSStyleSheet::createInline(element, URL(), m_startTextPosition, document.charset());

Let's use encoding() for these as they wanted an encoding. Ditto below.
Comment 20 dewei_zhu 2015-09-09 16:52:28 PDT
Created attachment 260889 [details]
Patch
Comment 21 Chris Dumez 2015-09-09 16:57:00 PDT
Comment on attachment 260889 [details]
Patch

r=me
Comment 22 Ryosuke Niwa 2015-09-09 16:57:07 PDT
Comment on attachment 260889 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260889&action=review

> Source/WebCore/ChangeLog:15
> +        * dom/Document.cpp:
> +        (WebCore::Document::encoding):
> +        (WebCore::Document::characterSet):

You should add comments as to what you're modifying in each function.

> Source/WebCore/dom/Document.h:404
> +    String charset() const { return Document::encoding(); } // May return a null String.
> +    String characterSet() const; // Returns "UTF-8" by default.

We don't normally add comments like this. Instead, we rename functions to reflect these semantics.
Can we call the latter bindingCharacterSet instead?
Comment 23 Chris Dumez 2015-09-09 17:01:18 PDT
Comment on attachment 260889 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260889&action=review

>> Source/WebCore/dom/Document.h:404
>> +    String characterSet() const; // Returns "UTF-8" by default.
> 
> We don't normally add comments like this. Instead, we rename functions to reflect these semantics.
> Can we call the latter bindingCharacterSet instead?

Hmm, OK, how about characterSetForBindings() then. I think we usually use such naming.

> Source/WebCore/dom/Document.idl:67
> +    [ImplementedAs=characterSet] readonly attribute DOMString inputEncoding;

characterSetForBindings

> Source/WebCore/dom/Document.idl:191
> +    readonly attribute DOMString characterSet;

[ImplementedAs=characterSetForBindings]
Comment 24 dewei_zhu 2015-09-09 17:34:30 PDT
Created attachment 260893 [details]
Patch
Comment 25 Ryosuke Niwa 2015-09-09 17:45:08 PDT
Comment on attachment 260893 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260893&action=review

> Source/WebCore/dom/Document.h:404
> +    String charset() const { return Document::encoding(); } // May return a null String.
> +    String characterSetForBindings() const; // Returns "UTF-8" by default.

I don't think we want these comments.
Comment 26 Ryosuke Niwa 2015-09-09 17:45:51 PDT
These are "what" comments that repeat the code.  We only want "why" comments.
Comment 27 dewei_zhu 2015-09-09 18:19:19 PDT
Created attachment 260896 [details]
Patch
Comment 28 WebKit Commit Bot 2015-09-09 19:04:13 PDT
Comment on attachment 260896 [details]
Patch

Clearing flags on attachment: 260896

Committed r189564: <http://trac.webkit.org/changeset/189564>
Comment 29 WebKit Commit Bot 2015-09-09 19:04:20 PDT
All reviewed patches have been landed.  Closing bug.
Comment 30 Alexey Proskuryakov 2015-09-10 09:54:51 PDT
This broke imported/w3c/web-platform-tests/html/dom/interfaces.html on Windows. Could you please update the results?
Comment 31 dewei_zhu 2015-09-10 09:56:42 PDT
(In reply to comment #30)
> This broke imported/w3c/web-platform-tests/html/dom/interfaces.html on
> Windows. Could you please update the results?
Of course. I'll do it right now.
Comment 32 dewei_zhu 2015-09-10 10:13:36 PDT
(In reply to comment #30)
> This broke imported/w3c/web-platform-tests/html/dom/interfaces.html on
> Windows. Could you please update the results?

Updated. https://bugs.webkit.org/show_bug.cgi?id=149038
Comment 33 Darin Adler 2015-09-11 10:25:32 PDT
Comment on attachment 260896 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=260896&action=review

I think it’s worthwhile to return at some point and do additional cleanup.

> Source/WebCore/dom/Document.h:404
>      String charset() const { return Document::encoding(); }
> -    String characterSet() const { return Document::encoding(); }
> +    String characterSetForBindings() const;

I know we made an effort at clarity here, but I still find the naming confusing. There’s charset, used for the DOM binding attribute "charset", and also characterSetForBindings, used for two other DOM binding attributes "inputEncoding" and "characterSet".

Now how do those names make it clear why we wouldn’t want to use characterSetForBindings for "charset"? There’s a reason, presumably charset is supposed to return null rather than "UTF-8", but I think that reason needs to be expressed in these function names.

> Source/WebCore/loader/DocumentWriter.cpp:192
> +                m_decoder->setEncoding(TextEncoding(parentFrame->document()->encoding()), TextResourceDecoder::EncodingFromParentFrame);

This seems inefficient. Why convert from a TextEncoding to a string and then back to a TextEncoding?

> Source/WebCore/loader/FormSubmission.cpp:154
> +    return TextEncoding(document.encoding());

This seems inefficient. Why convert from a TextEncoding to a string and then back to a TextEncoding?
Comment 34 dewei_zhu 2015-09-14 18:26:50 PDT
Polished in bug 149131.
(In reply to comment #33)
> Comment on attachment 260896 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=260896&action=review
> 
> I think it’s worthwhile to return at some point and do additional cleanup.
> 
> > Source/WebCore/dom/Document.h:404
> >      String charset() const { return Document::encoding(); }
> > -    String characterSet() const { return Document::encoding(); }
> > +    String characterSetForBindings() const;
> 
> I know we made an effort at clarity here, but I still find the naming
> confusing. There’s charset, used for the DOM binding attribute "charset",
> and also characterSetForBindings, used for two other DOM binding attributes
> "inputEncoding" and "characterSet".
> 
> Now how do those names make it clear why we wouldn’t want to use
> characterSetForBindings for "charset"? There’s a reason, presumably charset
> is supposed to return null rather than "UTF-8", but I think that reason
> needs to be expressed in these function names.
> 
> > Source/WebCore/loader/DocumentWriter.cpp:192
> > +                m_decoder->setEncoding(TextEncoding(parentFrame->document()->encoding()), TextResourceDecoder::EncodingFromParentFrame);
> 
> This seems inefficient. Why convert from a TextEncoding to a string and then
> back to a TextEncoding?
> 
> > Source/WebCore/loader/FormSubmission.cpp:154
> > +    return TextEncoding(document.encoding());
> 
> This seems inefficient. Why convert from a TextEncoding to a string and then
> back to a TextEncoding?