Bug 56650 - [Chromium] The WebPageSerializer::retrieveAllResources() should retrieve CSS resources
Summary: [Chromium] The WebPageSerializer::retrieveAllResources() should retrieve CSS ...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKit Misc. (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC OS X 10.5
: P2 Normal
Assignee: Jay Civelli
URL:
Keywords:
Depends on: 57389 57390
Blocks:
  Show dependency treegraph
 
Reported: 2011-03-18 09:47 PDT by Jay Civelli
Modified: 2011-03-29 14:27 PDT (History)
5 users (show)

See Also:


Attachments
Initial patch (32.61 KB, patch)
2011-03-18 10:20 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Fixed CSS test (24.93 KB, patch)
2011-03-18 15:21 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Fixed tests. (25.41 KB, patch)
2011-03-18 16:23 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Addressed Adam's comments (25.60 KB, patch)
2011-03-18 18:21 PDT, Jay Civelli
levin: review-
levin: commit-queue-
Details | Formatted Diff | Diff
Addressed David's comments (26.50 KB, patch)
2011-03-29 00:38 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Fixed build (26.36 KB, patch)
2011-03-29 00:49 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jay Civelli 2011-03-18 09:47:10 PDT
The WebPageSerializer::retrieveAllResources() does not retrieve CSS resources from the page, such as:
@import
background urls
@font-face
Comment 1 Jay Civelli 2011-03-18 10:20:31 PDT
Created attachment 86175 [details]
Initial patch
Comment 2 Jay Civelli 2011-03-18 15:21:52 PDT
Created attachment 86223 [details]
Fixed CSS test
Comment 3 WebKit Review Bot 2011-03-18 15:26:45 PDT
Attachment 86223 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebKit/chromium/ChangeLog', u'Sourc..." exit_code: 1

Source/WebKit/chromium/tests/data/pageserialization/css_test_page.html:15:  Line contains tab character.  [whitespace/tab] [5]
Source/WebKit/chromium/tests/data/pageserialization/css_test_page.html:16:  Line contains tab character.  [whitespace/tab] [5]
Total errors found: 2 in 8 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 4 Adam Barth 2011-03-18 15:59:26 PDT
Comment on attachment 86223 [details]
Fixed CSS test

View in context: https://bugs.webkit.org/attachment.cgi?id=86223&action=review

I realize you removed the review request, but here are some random nits.

> Source/WebKit/chromium/src/WebPageSerializer.cpp:132
> +    // Process the attribute if one was found found.

"found found" => "found"

> Source/WebKit/chromium/src/WebPageSerializer.cpp:137
> +            KURL url(ParsedURLString, element->document()->completeURL(value));

No need to use ParsedURLString.  You can just write:

KURL url = element->document()->completeURL(value);

> Source/WebKit/chromium/src/WebPageSerializer.cpp:143
> +            if (!url.isEmpty() && !url.isNull() && url.isValid()
> +                && (url.protocolIs("http") || url.protocolIs("https") || url.protocolIs("file"))) {
> +                resourceURLs->add(url);
> +            }

You should do the JavaScript URL check together with these.  Also "!url.isEmpty() && !url.isNull() && url.isValid()" are all redundant with the later checks, right?

> Source/WebKit/chromium/src/WebPageSerializer.cpp:149
> +    CSSStyleDeclaration* styleDeclaration = element->style();
> +    if (styleDeclaration)

We usually combine these lines:

if (CSSStyleDeclaration* styleDeclaration = element->style())
  ...

> Source/WebKit/chromium/src/WebPageSerializer.cpp:161
> +    // If we have already seen that frame, ignore it.

This comment is a "what" not a "why" and should be removed.

> Source/WebKit/chromium/src/WebPageSerializer.cpp:189
> +        HTMLElement* element = toHTMLElement(allNodes->item(i));
> +        if (element) {

These should be combined.

> Source/WebKit/chromium/src/WebPageSerializer.cpp:209
> +    // Retrieving the background property would probably be good enough.
> +    // Iterating for any other image properties to be safe (I am not sure there
> +    // are actually others).

Maybe list bullets can have images?

> Source/WebKit/chromium/src/WebPageSerializer.cpp:213
> +        // prop ID, but CSSStyleDeclaration only gives access to property names,

"prop" <-- please don't use abbreviations.

> Source/WebKit/chromium/src/WebPageSerializer.cpp:216
> +        RefPtr<CSSValue> value =
> +            styleDeclaration->getPropertyCSSValue(styleDeclaration->item(i));

There's not 80 col limit.  Please don't add unneeded line breaks.

> Source/WebKit/chromium/src/WebPageSerializer.cpp:225
> +                    KURL(ParsedURLString,
> +                         static_cast<CSSStyleSheet*>(styleDeclaration->stylesheet())->document()->completeURL(url)));

No need to use ParsedURLString.  Doesn't completeURL return a KURL?

> Source/WebKit/chromium/src/WebPageSerializer.cpp:240
> +    if (!visitedStyleSheets->add(styleSheet).second)
> +        return; // We have already seen that styleSheet.

That's a very strange way to test if we've got the stylesheet.

> Source/WebKit/chromium/src/WebPageSerializer.cpp:253
> +            resourceURLs->add(
> +                KURL(ParsedURLString,
> +                    styleSheet->document()->completeURL(importRule->href())));

No ParsedURLString, all on one line.

> Source/WebKit/chromium/src/WebPageSerializer.cpp:257
> +            retrieveResourcesForCSSStyleSheet(
> +                static_cast<CSSImportRule*>(item)->styleSheet(),
> +                visitedStyleSheets, supportedSchemes, resourceURLs);

All on one line.

> Source/WebKit/chromium/src/WebPageSerializer.cpp:265
> +                    CSSFontFaceSrcValue* fontFaceSrc =
> +                        static_cast<CSSFontFaceSrcValue*>(valueList->item(j));

one line.  :)

> Source/WebKit/chromium/src/WebPageSerializer.cpp:315
> +    // Now retrieve CSS resources from style-sheets.

Again, a what, not a why.
Comment 5 Jay Civelli 2011-03-18 16:23:09 PDT
Created attachment 86238 [details]
Fixed tests.
Comment 6 Jay Civelli 2011-03-18 18:19:13 PDT
(In reply to comment #4)

Thanks for the comments!

> (From update of attachment 86223 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=86223&action=review
> 
> I realize you removed the review request, but here are some random nits.
> 
> > Source/WebKit/chromium/src/WebPageSerializer.cpp:132
> > +    // Process the attribute if one was found found.
> 
> "found found" => "found"
Done.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:137
> > +            KURL url(ParsedURLString, element->document()->completeURL(value));
> 
> No need to use ParsedURLString.  You can just write:
> 
> KURL url = element->document()->completeURL(value);
Done everywhere.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:143
> > +            if (!url.isEmpty() && !url.isNull() && url.isValid()
> > +                && (url.protocolIs("http") || url.protocolIs("https") || url.protocolIs("file"))) {
> > +                resourceURLs->add(url);
> > +            }
> 
> You should do the JavaScript URL check together with these.  Also "!url.isEmpty() && !url.isNull() && url.isValid()" are all redundant with the later checks, right?
Looking at KURL.cpp, it seems we still need isValid(), but you are right we don't need the rest or even to check for the JS protocol.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:149
> > +    CSSStyleDeclaration* styleDeclaration = element->style();
> > +    if (styleDeclaration)
> 
> We usually combine these lines:
> 
> if (CSSStyleDeclaration* styleDeclaration = element->style())
>   ...
OK

> 
> > Source/WebKit/chromium/src/WebPageSerializer.cpp:161
> > +    // If we have already seen that frame, ignore it.
> 
> This comment is a "what" not a "why" and should be removed.
I thought the comment useful as it is not really obvious that ListHashSet::add() returns a std::pair with second being a bool that indicates whether the element was inserted.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:189
> > +        HTMLElement* element = toHTMLElement(allNodes->item(i));
> > +        if (element) {
> 
> These should be combined.
Done.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:209
> > +    // Retrieving the background property would probably be good enough.
> > +    // Iterating for any other image properties to be safe (I am not sure there
> > +    // are actually others).
> 
> Maybe list bullets can have images?
You are right! Updated the command and the test.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:213
> > +        // prop ID, but CSSStyleDeclaration only gives access to property names,
> 
> "prop" <-- please don't use abbreviations.
OK

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:216
> > +        RefPtr<CSSValue> value =
> > +            styleDeclaration->getPropertyCSSValue(styleDeclaration->item(i));
> 
> There's not 80 col limit.  Please don't add unneeded line breaks.
Done.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:225
> > +                    KURL(ParsedURLString,
> > +                         static_cast<CSSStyleSheet*>(styleDeclaration->stylesheet())->document()->completeURL(url)));
> 
> No need to use ParsedURLString.  Doesn't completeURL return a KURL?
Done.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:240
> > +    if (!visitedStyleSheets->add(styleSheet).second)
> > +        return; // We have already seen that styleSheet.
> 
> That's a very strange way to test if we've got the stylesheet.
Yeah, LinkedHashSet:add() inserts and returns a pair, second is a boolean that indicates if the value has been inserted.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:253
> > +            resourceURLs->add(
> > +                KURL(ParsedURLString,
> > +                    styleSheet->document()->completeURL(importRule->href())));
> 
> No ParsedURLString, all on one line.
Done.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:257
> > +            retrieveResourcesForCSSStyleSheet(
> > +                static_cast<CSSImportRule*>(item)->styleSheet(),
> > +                visitedStyleSheets, supportedSchemes, resourceURLs);
> 
> All on one line.
OK.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:265
> > +                    CSSFontFaceSrcValue* fontFaceSrc =
> > +                        static_cast<CSSFontFaceSrcValue*>(valueList->item(j));
> 
> one line.  :)
Done.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:315
> > +    // Now retrieve CSS resources from style-sheets.
> 
> Again, a what, not a why.
Changed.
Comment 7 Jay Civelli 2011-03-18 18:21:50 PDT
Created attachment 86250 [details]
Addressed Adam's comments
Comment 8 Adam Barth 2011-03-18 18:22:34 PDT
Comment on attachment 86223 [details]
Fixed CSS test

View in context: https://bugs.webkit.org/attachment.cgi?id=86223&action=review

>>> Source/WebKit/chromium/src/WebPageSerializer.cpp:143
>>> +            if (!url.isEmpty() && !url.isNull() && url.isValid()
>>> +                && (url.protocolIs("http") || url.protocolIs("https") || url.protocolIs("file"))) {
>>> +                resourceURLs->add(url);
>>> +            }
>> 
>> You should do the JavaScript URL check together with these.  Also "!url.isEmpty() && !url.isNull() && url.isValid()" are all redundant with the later checks, right?
> 
> Looking at KURL.cpp, it seems we still need isValid(), but you are right we don't need the rest or even to check for the JS protocol.

Ok.

>>> Source/WebKit/chromium/src/WebPageSerializer.cpp:161
>>> +    // If we have already seen that frame, ignore it.
>> 
>> This comment is a "what" not a "why" and should be removed.
> 
> I thought the comment useful as it is not really obvious that ListHashSet::add() returns a std::pair with second being a bool that indicates whether the element was inserted.

That's a pretty nutty API...  It's probably not worth fixing the API in this patch.  Feel free to leave the comment if think it's helpful.
Comment 9 David Levin 2011-03-25 10:53:12 PDT
Comment on attachment 86250 [details]
Addressed Adam's comments

View in context: https://bugs.webkit.org/attachment.cgi?id=86250&action=review

Does this do an url rewriting?

If a page comes from http://www.example.com/ and refers to "http://www.example.com/other.jpg", how does "other.jpg" get loaded from the serialized version?

Many places where I ask questions are things that are not obvious to me from looking at the code and may be are indicative of comments that should be added to the code.  (Also since I don't know this code well of course some questions have obvious answers to you but not to me :).)

> Source/WebKit/chromium/src/WebPageSerializer.cpp:90
> +        || element->hasTagName(HTMLNames::objectTag) || element->hasTagName(HTMLNames::embedTag))

The indentation on these two continuation lines seems odd. Typically it should align just inside the parenthesis that is continued and that happens on neither line.

> Source/WebKit/chromium/src/WebPageSerializer.cpp:92
> +        Frame* frame = static_cast<const HTMLFrameOwnerElement*>(element)->contentFrame();

Why do we continue through the code if the element has that tag name but !element->isFrameOwnerElement?

Is this code important to that case:
if (CSSStyleDeclaration* styleDeclaration = element->style())
        retrieveResourcesForCSSStyleDeclaration(styleDeclaration, resourceURLs);

and if so, why isn't it important when element->isFrameOwnerElement ?

> Source/WebKit/chromium/src/WebPageSerializer.cpp:101
>      if (element->hasTagName(HTMLNames::imgTag) || element->hasTagName(HTMLNames::scriptTag))

I would like this better as a separate function. 

Then the code would change as well ideally.

It would look something like this (and have a name that tells me what it is doing).

if (element->hasTagName(HTMLNames::imgTag) || element->hasTagName(HTMLNames::scriptTag))
    return &HTMLNames::srcAttr;

if (element->hasTagName(HTMLNames::inputTag)) {
    if (input->isImageButton())
        return &HTMLNames::srcAttr;
    return 0;
}

> Source/WebKit/chromium/src/WebPageSerializer.cpp:140
> +        // Ignore javascript content.

How is this ignoring javascript content?

> Source/WebKit/chromium/src/WebPageSerializer.cpp:144
> +            // does no have a cache mechanism, we skip it as well.

s/no/not/

> Source/WebKit/chromium/src/WebPageSerializer.cpp:219
> +            // Non cached-image are just place-holders and do not contain data.

images (since you used "are", etc.)

> Source/WebKit/chromium/src/WebPageSerializer.cpp:222
> +                resourceURLs->add(static_cast<CSSStyleSheet*>(styleDeclaration->stylesheet())->document()->completeURL(url));

How do you know that it is a CSSStyleSheet?

> Source/WebKit/chromium/src/WebPageSerializer.cpp:250
> +            retrieveResourcesForCSSStyleSheet(static_cast<CSSImportRule*>(item)->styleSheet(), visitedStyleSheets, supportedSchemes, resourceURLs);

Use importRule and avoid another cast.

> Source/WebKit/chromium/src/WebPageSerializer.cpp:257
> +                    CSSFontFaceSrcValue* fontFaceSrc = static_cast<CSSFontFaceSrcValue*>(valueList->item(j));

How do you know the items are CSSFontFaceSrcValue?

> Source/WebKit/chromium/src/WebPageSerializer.cpp:310
> +        iter != styleSheets.end(); ++iter) {

Alignment (as I mentioned below).

> Source/WebKit/chromium/src/WebPageSerializer.cpp:319
> +        iter != resourceKURLs.end(); ++iter, i++) {

Might as well do ++i for consistency with other places.

> Source/WebKit/chromium/src/WebPageSerializer.cpp:330
> +        iter != frameKURLs.end(); ++iter, ++i) {

Align with interior of ( 
or just don't wrap the line. Up to you.

> Source/WebKit/chromium/tests/WebPageSerializerTest.cpp:53
> +    TestWebFrameClient() : m_scriptEnabled(false) {}

space in the {}

> Source/WebKit/chromium/tests/WebPageSerializerTest.cpp:54
> +    virtual bool allowScript(WebFrame*, bool enabledPerSettings) { return m_scriptEnabled; }

I would comment out "enabledPerSettings" like /* enabledPerSettings*/ since it is unused that may cause errors in some builds.

> Source/WebKit/chromium/tests/WebPageSerializerTest.cpp:181
> +                          

Pls don't do unnecessary space changes.

> Source/WebKit/chromium/tests/WebPageSerializerTest.cpp:226
> +        m_webView, m_supportedSchemes, &resources, &frames));    

You don't need to wrap the line but if it makes you happy...
Comment 10 Jay Civelli 2011-03-29 00:35:26 PDT
(In reply to comment #9)
> (From update of attachment 86250 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=86250&action=review
> 
> Does this do an url rewriting?
No it does not. It simply returns the URLs for the resources found in the page. 
 
> If a page comes from http://www.example.com/ and refers to "http://www.example.com/other.jpg", how does "other.jpg" get loaded from the serialized version?
This CL does not deal with that. The "save-as" page feature stores the resources from the page into a directory and rewrite the URLs in an extra step.

> Many places where I ask questions are things that are not obvious to me from looking at the code and may be are indicative of comments that should be added to the code.  (Also since I don't know this code well of course some questions have obvious answers to you but not to me :).)
> 
> > Source/WebKit/chromium/src/WebPageSerializer.cpp:90
> > +        || element->hasTagName(HTMLNames::objectTag) || element->hasTagName(HTMLNames::embedTag))
> 
> The indentation on these two continuation lines seems odd. Typically it should align just inside the parenthesis that is continued and that happens on neither line.
OK, I tried to do that right.


> > Source/WebKit/chromium/src/WebPageSerializer.cpp:92
> > +        Frame* frame = static_cast<const HTMLFrameOwnerElement*>(element)->contentFrame();
> 
> Why do we continue through the code if the element has that tag name but !element->isFrameOwnerElement?
object and embed tags can be used as iframe if they source is HTML, in which case isFrameOwnerElement returns true. Otherwise they should simply be considered as a resource.
Note that an iframe can also be used to display an image I believe, and that case does not seem to be supported.
I filed a bug https://bugs.webkit.org/show_bug.cgi?id=57300 and will address that separately.

> Is this code important to that case:
> if (CSSStyleDeclaration* styleDeclaration = element->style())
>         retrieveResourcesForCSSStyleDeclaration(styleDeclaration, resourceURLs);
> 
> and if so, why isn't it important when element->isFrameOwnerElement ?
I don't think style applies to the iframe tag itself, so we would not get anything interesting from that.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:101
> >      if (element->hasTagName(HTMLNames::imgTag) || element->hasTagName(HTMLNames::scriptTag))
> 
> I would like this better as a separate function. 
> 
> Then the code would change as well ideally.
> 
> It would look something like this (and have a name that tells me what it is doing).
> 
> if (element->hasTagName(HTMLNames::imgTag) || element->hasTagName(HTMLNames::scriptTag))
>     return &HTMLNames::srcAttr;
> 
> if (element->hasTagName(HTMLNames::inputTag)) {
>     if (input->isImageButton())
>         return &HTMLNames::srcAttr;
>     return 0;
> }
Good idea, done.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:140
> > +        // Ignore javascript content.
> 
> How is this ignoring javascript content?
Out-of-date comment, removed.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:144
> > +            // does no have a cache mechanism, we skip it as well.
> 
> s/no/not/
Done.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:219
> > +            // Non cached-image are just place-holders and do not contain data.
> 
> images (since you used "are", etc.)
Done.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:222
> > +                resourceURLs->add(static_cast<CSSStyleSheet*>(styleDeclaration->stylesheet())->document()->completeURL(url));
> 
> How do you know that it is a CSSStyleSheet?
Intuition... What? Not good enough?
OK, now explicitly testing for CSS type before casting.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:250
> > +            retrieveResourcesForCSSStyleSheet(static_cast<CSSImportRule*>(item)->styleSheet(), visitedStyleSheets, supportedSchemes, resourceURLs);
> 
> Use importRule and avoid another cast.
Done.

> > Source/WebKit/chromium/src/WebPageSerializer.cpp:257
> > +                    CSSFontFaceSrcValue* fontFaceSrc = static_cast<CSSFontFaceSrcValue*>(valueList->item(j));
> 
> How do you know the items are CSSFontFaceSrcValue?
Sadly there does not seem to be a reliable way to know.
I am doing like WebCore/css/CSSFontSelector.cpp does, which is assume the CSSPropertySrc value only contains CSSFontFaceSrcValue.
Added a comment to reflect this. 
 
> > Source/WebKit/chromium/src/WebPageSerializer.cpp:310
> > +        iter != styleSheets.end(); ++iter) {
> 
> Alignment (as I mentioned below).
Done.
 
> > Source/WebKit/chromium/src/WebPageSerializer.cpp:319
> > +        iter != resourceKURLs.end(); ++iter, i++) {
> 
> Might as well do ++i for consistency with other places.
Done.
 
> > Source/WebKit/chromium/src/WebPageSerializer.cpp:330
> > +        iter != frameKURLs.end(); ++iter, ++i) {
> 
> Align with interior of ( 
> or just don't wrap the line. Up to you.
Done.

> > Source/WebKit/chromium/tests/WebPageSerializerTest.cpp:53
> > +    TestWebFrameClient() : m_scriptEnabled(false) {}
> 
> space in the {}
Done.

> > Source/WebKit/chromium/tests/WebPageSerializerTest.cpp:54
> > +    virtual bool allowScript(WebFrame*, bool enabledPerSettings) { return m_scriptEnabled; }
> 
> I would comment out "enabledPerSettings" like /* enabledPerSettings*/ since it is unused that may cause errors in some builds.
Done.
 
> > Source/WebKit/chromium/tests/WebPageSerializerTest.cpp:181
> > +                          
> 
> Pls don't do unnecessary space changes.
Removed.

> > Source/WebKit/chromium/tests/WebPageSerializerTest.cpp:226
> > +        m_webView, m_supportedSchemes, &resources, &frames));    
> 
> You don't need to wrap the line but if it makes you happy...
Unwrapped.
Comment 11 Jay Civelli 2011-03-29 00:38:47 PDT
Created attachment 87275 [details]
Addressed David's comments
Comment 12 WebKit Review Bot 2011-03-29 00:44:53 PDT
Attachment 87275 [details] did not build on chromium:
Build output: http://queues.webkit.org/results/8280139
Comment 13 Jay Civelli 2011-03-29 00:49:30 PDT
Created attachment 87277 [details]
Fixed build
Comment 14 WebKit Commit Bot 2011-03-29 11:40:38 PDT
Comment on attachment 87277 [details]
Fixed build

Clearing flags on attachment: 87277

Committed r82293: <http://trac.webkit.org/changeset/82293>
Comment 15 WebKit Commit Bot 2011-03-29 11:40:45 PDT
All reviewed patches have been landed.  Closing bug.
Comment 16 Adam Barth 2011-03-29 13:49:38 PDT
Looks like this might have broken RetrieveCSSResources:

http://build.chromium.org/p/chromium.webkit/builders/Webkit%20Win/builds/3137/steps/webkit_unit_tests/logs/RetrieveCSSResources

WebPageSerializerTest.RetrieveCSSResources: 
Did not complete.

I'll see if the issue continues.
Comment 17 Adam Barth 2011-03-29 14:25:09 PDT
Happened twice.  Rolling out.