Bug 80170

Summary: Contents of noscript elements turned into strings in WebArchives
Product: WebKit Reporter: Jessie Berlin <jberlin>
Component: Page LoadingAssignee: Brady Eidson <beidson>
Status: RESOLVED FIXED    
Severity: Normal CC: adachan, beidson, japhet, jberlin, sfalken, webkit-bug-importer, webkit.review.bot
Priority: P2 Keywords: InRadar
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Reduced case with a noscript element
none
Patch v1 aestes: review+

Description Jessie Berlin 2012-03-02 10:27:25 PST
Created attachment 129920 [details]
Reduced case with a noscript element

When a page with the following markup:

<!DOCTYPE html>
<html>
  <head>
    <title>Page With No Script Element To Be WebArchived</title>
  </head>
  <body>
    <div>If JS is turned off, you should see red text below</div>
    <div>
      <noscript>
        <div height="200" width="200" style="color: red">Some red text</div>
      </noscript>
    </div>
  </body>
</html>

is saved as a Web Archive, the contents of the noscript element inside the Web Archive are put into a string:

<!DOCTYPE html>
<html><head>
    <title>Page With No Script Element</title>
  </head>
  <body>
    <div>If JS is turned off, you should see red text below</div>
    <div>
      <noscript>
        &amp;lt;div height="200" width="200" style="color: red"&amp;gt;Some red text&amp;lt;/div&amp;gt;
      </noscript>
    </div>
</body></html>

I first noticed this on r109561, but the comment from https://bugs.webkit.org/show_bug.cgi?id=13522 indicates that this was happening as far back as August 2010.
Comment 1 Jessie Berlin 2012-03-02 10:29:54 PST
<rdar://problem/10972534>
Comment 2 Brady Eidson 2012-03-02 10:30:04 PST
When we create a WebArchive, we serialize the DOM.

Since the contents of the <noscript> tag is never parsed into the DOM tree when scripting is enabled, they remain the innertext of the <noscript> tag.

I agree this is weird, but is there a real world symptom?
Comment 3 Radar WebKit Bug Importer 2012-03-02 10:33:38 PST
<rdar://problem/10972577>
Comment 4 Jessie Berlin 2012-03-02 10:40:23 PST
(In reply to comment #2)
> When we create a WebArchive, we serialize the DOM.
> 
> Since the contents of the <noscript> tag is never parsed into the DOM tree when scripting is enabled, they remain the innertext of the <noscript> tag.
> 
> I agree this is weird, but is there a real world symptom?

The real word symptom is odd strings appearing in webarchives that were created with JavaScript enabled, but the user has decided to look at with JavaScript disabled.
Comment 5 Brady Eidson 2012-05-10 12:21:52 PDT
There's a much deeper question about how innerHTML of <noscript> is expected to work in both a scripting and non-scripting environment that I believe Alexey is planning to pursue with the spec.

But for webarchives, we can solve this by filtering out the <noscript> elements completely if scripting is enabled.

Patch coming soon.
Comment 6 Brady Eidson 2012-05-10 12:41:31 PDT
Created attachment 141229 [details]
Patch v1
Comment 7 Andy Estes 2012-05-10 13:29:48 PDT
Comment on attachment 141229 [details]
Patch v1

View in context: https://bugs.webkit.org/attachment.cgi?id=141229&action=review

> Source/WebCore/editing/MarkupAccumulator.h:73
> +    String serializeNodes(Node* targetNode, Node* nodeToSkip, EChildrenOnly, Vector<QualifiedName>* tagNamesToSkip);

You should make this new argument optional.

> Source/WebCore/page/PageSerializer.cpp:217
> +    String text = accumulator.serializeNodes(document->documentElement(), 0, IncludeNode, 0);

You wouldn't have to change this line if the new argument to serializeNodes() were optional.
Comment 8 Brady Eidson 2012-05-10 14:08:45 PDT
http://trac.webkit.org/changeset/116685