Bug 24150

Summary: Add virtual ScriptExecutionContext::encoding()
Product: WebKit Reporter: Dmitry Titov <dimich>
Component: WebCore Misc.Assignee: Dmitry Titov <dimich>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, levin
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 24016    
Attachments:
Description Flags
Proposed patch
none
Updated patch.
ap: review-
updated patch.
none
Updated the test so it runs in FF too. ap: review+

Description Dmitry Titov 2009-02-24 18:31:31 PST
This also moved inputEncoding() to the SEC. The idea is to reflect actual encoding of loaded JavaScript (initially initialized from the parent context and then optionally changed by http response headers). This will allow to unvirtualize completeURL() and also properly "inherit" encoding in case of nested workers.
Comment 1 Dmitry Titov 2009-02-24 18:34:07 PST
Created attachment 27950 [details]
Proposed patch
Comment 2 David Levin 2009-02-24 18:54:16 PST
Minor comment:
> String displayStringModifiedByEncoding(const String& str) const {

I know you only moved the methods, but it would be nice to fix the { placement on the methods to follow WebKit style while you are doing this.
Comment 3 Alexey Proskuryakov 2009-02-25 04:03:52 PST
Comment on attachment 27950 [details]
Proposed patch

Why move displayStringModifiedByEncoding() to ScriptExecutionContext? It isn't needed there.

Keeping a full-blown TextResourceDecoder in SEC while it only needs TextEncoding doesn't look very good either. Document::m_decoder was the decoder used for document source - but worker script source is decoded elsewhere.
Comment 4 Dmitry Titov 2009-02-25 17:22:10 PST
Agreed. Changing the title of the bug and attaching new patch.

Instead of moving the decoder, have a virtual method that returns the encoding as a String.

Document does what it did with decoder, while WorkerContext initially inherits the encoding from the parent context and then updates it after script loading.

Also update WorkerContext::completeURL to use the encoding.
Comment 5 Dmitry Titov 2009-02-25 17:55:45 PST
Created attachment 27999 [details]
Updated patch.

No more moving decoder :-)

It feels that amount of wiring the encoding string needs to propagate from Worker::notifyFinished to the WorkerContext is a bit excessive but I'm not sure how to make it cleaner.
Comment 6 Alexey Proskuryakov 2009-02-26 04:52:10 PST
Comment on attachment 27999 [details]
Updated patch.

> +    String inputEncoding() const { return encoding(); }
> +    String charset() const { return encoding(); }
> +    String characterSet() const { return encoding(); }

You need to use "return Document::encoding();" syntax to avoid virtual dispatch. Please also consider making encoding() private (that would improve safety, but decrease clarity somewhat, as the functions won't be near each other any more).

> +    // FIXME: nested workers need loading support. Consider adopting ThreadableLoader here.

Capital "N" is needed in "nested".

> -    // FIXME: does this need to provide a charset, like Document::completeURL does?
> -    return KURL(m_location->url(), url);
> +
> +    if (m_encoding.isNull())
> +        return KURL(m_location->url(), url);
> +
> +    return KURL(m_location->url(), url, TextEncoding(m_encoding));
>  }

Have you tested that this is what Firefox does? I wouldn't be surprised if it always used UTF-8, and didn't inherit the encoding from the document at all.

This change needs an automated test.

> +        String m_encoding;

Why is it better to store it as String, not as TextEncoding?

> -        WorkerThread(const KURL&, const String& userAgent, const String& sourceCode, WorkerObjectProxy*);
> +        WorkerThread(const KURL&, const String& userAgent,  const String& encoding, const String& sourceCode, WorkerObjectProxy*);

Two spaces here.

r-, because this needs an automated test (XMLHttpRequest URL completion for now, but it's worth testing how Firefox decodes nested worker scripts and importScripts() content).
Comment 7 Dmitry Titov 2009-03-02 18:22:32 PST
Tested FF 3.1b2
Interestingly enough, they inherit encoding from the document - with a new worker on any nesting level starting from the encoding of the document rather then the parent script.
For XHR it's different - it always starts from utf8.
Will match the behavior and add a test.
Comment 8 Dmitry Titov 2009-03-04 17:51:32 PST
Created attachment 28295 [details]
updated patch.

Updated patch, now with test.
Comment 9 Dmitry Titov 2009-03-04 18:01:57 PST
(In reply to comment #6)
> (From update of attachment 27999 [details] [review])
> > +    String inputEncoding() const { return encoding(); }
> > +    String charset() const { return encoding(); }
> > +    String characterSet() const { return encoding(); }
> 
> You need to use "return Document::encoding();" syntax to avoid virtual
> dispatch. 

Done.

> Please also consider making encoding() private (that would improve
> safety, but decrease clarity somewhat, as the functions won't be near each
> other any more).

Moved the function in .h file (to another similar one, userAgent()) but not sure if making it private (and virtual) will help  - because on WorkerContext we'll need it public...

> > +    // FIXME: nested workers need loading support. Consider adopting ThreadableLoader here.
> Capital "N" is needed in "nested".

Done.

> > -    // FIXME: does this need to provide a charset, like Document::completeURL does?
> > -    return KURL(m_location->url(), url);
> > +
> > +    if (m_encoding.isNull())
> > +        return KURL(m_location->url(), url);
> > +
> > +    return KURL(m_location->url(), url, TextEncoding(m_encoding));
> >  }
> 
> Have you tested that this is what Firefox does? I wouldn't be surprised if it
> always used UTF-8, and didn't inherit the encoding from the document at all.
> 
> This change needs an automated test.

That happened to be a very good suggestion! I've tested FF and realized I'm adding wrong behavior :-). So I've changed the code and I'm adding the test that verifies the behavior is same as in FF (and also pretty logical).
 
> > +        String m_encoding;
> Why is it better to store it as String, not as TextEncoding?

It is only used as a string - to pass into script loader as initial value (now that we don't need to create TextEncoding in completeURL). So storing String is now simpler and perhaps faster.

> > +        WorkerThread(const KURL&, const String& userAgent,  const String& encoding, const String& sourceCode, WorkerObjectProxy*);
> Two spaces here.

Done.
Comment 10 Alexey Proskuryakov 2009-03-05 00:41:55 PST
Comment on attachment 28295 [details]
updated patch.

That's very interesting, thanks! So, it turns out that the charset is inherited as a default for nested worker script decoding, correct?

> +    // Note Document::completeURL provides a charset here (to encode query portion of url when
> +    // submitting charset-encoded forms data). Workers always use UTF-8 which is default for KURL.
>      return KURL(m_location->url(), url);

How does URL completion work in Firefox for Worker source and importScript() URLs? If I remember correctly, Firefox doesn't use document encoding for completing XHR URLs even on main thread, and we don't currently match this quirk.

If 'new Worker(relativeUrl)' syntax uses UTF-8 even on main thread in Firefox, then we should consider moving XMLHttpRequest and Worker off completeURL entirely (but we'll need to test IE behavior for XHR URL completion then).

Any changes to XHR and Worker behavior on main thread will need a separate bug of course, but I'd like to have a full understanding of correct expected behavior before going forward with this change.
Comment 11 Dmitry Titov 2009-03-05 20:09:41 PST
Created attachment 28345 [details]
Updated the test so it runs in FF too.
Comment 12 Dmitry Titov 2009-03-05 20:27:23 PST
Here are results of more testing. I think the nested case, which we don't implement yet, may be still 'in the works', especially wrt base url...

- tested version is FF 3.1b2
- 'XYZ' below is a non-latin string
- 'document' is the main html page
- 'first worker' is a worker created by document
- 'nested worker' is a worker created by the first worker
- (WK+) means behavior matched WebKit

In the document:

  new Worker('foo.php?query=XYZ');
    - query part of URL will be encoded using document's encoding (WK+)
    - relative URL will be completed using base URL of the document (WK+)
    - loaded script will be decoded using document's encoding or http header if any (WK+)

  XHR('foo.php?query=XYZ');
    - query part of URL will be encoded using UTF8 (in WK, document encoding)
    - relative URL will be completed using base URL of the document (WK+)
    - loaded content will be decoded using UTF8 or http header if any (WK+)

  <iframe src='foo.php?query=XYZ'>
    - query part of URL will be encoded using document's encoding (WK+)
    - relative URL will be completed using base URL of the document (WK+)
    - loaded content will be decoded using document's encoding or http header if any (WK+)

inside first worker:

  new Worker('foo.php?query=XYZ');
  importScripts('foo.php?query=XYZ');
    - query part of URL will be encoded using document's encoding, even if first worker had different http header (WK not impl)
    - relative URL will be completed using base URL of the document, not the one of the first worker (WK not impl)
    - loaded script will be decoded using document's encoding or http header if any, even if first worker had different http header  (WK not impl)

XHR("foo.php?XXX")
    - query part of URL will be encoded using UTF8 (in WK, document encoding)
    - relative URL will be completed using base URL of the document (in WK, base url of the first worker)
    - loaded content will be decoded using UTF8 or http header if any (WK+)
Comment 13 Alexey Proskuryakov 2009-03-06 01:58:46 PST
Comment on attachment 28345 [details]
Updated the test so it runs in FF too.

Thanks, that's very useful information! Much of the Gecko behavior looks like it may be unintentional indeed - would you be willing to file bugs with Mozilla? It could also be worth discussing this with WHATWG.

-    // FIXME: does this need to provide a charset, like Document::completeURL does?
+    // Note Document::completeURL provides a charset here (to encode query portion of url when
+    // submitting charset-encoded forms data). Workers always use UTF-8 which is default for KURL.
     return KURL(m_location->url(), url);

I think that the word "Note" is extraneous here - it's a comment, after all. Per your research, it is not clear that we'll want this behavior forever, given that Firefox uses document encoding for Worker and importScripts() URL completion. So, the FIXME still seems appropriate, perhaps amended with results of your research.

r=me
Comment 14 Dmitry Titov 2009-03-06 13:37:33 PST
Fixed the comment to be FIXME and landed in http://trac.webkit.org/changeset/41499

I'll follow up separately with FF on what is the right encodings and base URLs to be used there.
Comment 15 Dmitry Titov 2009-03-06 18:20:41 PST
Retested with latest FF nightly (Minefield, 03/04/09)

There is progress - nested workers and importScripts now resolves relative URLs using the parent worker script' base url (previously was using top-level document base URL) :-)

inside workers, non-latin portions of query in XHR URLs now are encoded using top document encoding (was UTF-8 before) - strange, in document they still use UTF-8
Comment 16 Dmitry Titov 2009-03-19 16:54:03 PDT
More info:
Discussion on WHATWG IRC and mailing list show that the spec will very soon capture the follwoing;
1. All URLs in Workers are encoded using UTF-8
2. 'Default' encoding for scripts themselves is always UTF-8 unless overridden by Con tent-Type http header.

I've filed bugs for FF and will file a new one for WebKit, to remove encoding() from ScriptExecutionContext().