Bug 11839 - Webarchive fails to save CSS files in @import statements
Summary: Webarchive fails to save CSS files in @import statements
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKit Misc. (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P2 Normal
Assignee: David Kilzer (:ddkilzer)
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2006-12-15 04:20 PST by David Kilzer (:ddkilzer)
Modified: 2008-11-25 14:04 PST (History)
3 users (show)

See Also:


Attachments
Test files in zip archive (668 bytes, application/zip)
2006-12-15 04:39 PST, David Kilzer (:ddkilzer)
no flags Details
Example webarchive missing @import stylesheet (1.08 KB, application/x-webarchive)
2006-12-15 04:40 PST, David Kilzer (:ddkilzer)
no flags Details
Test files in zip archive v2 (947 bytes, application/zip)
2006-12-16 05:49 PST, David Kilzer (:ddkilzer)
no flags Details
Fixed webarchive for test v2 (2.24 KB, application/x-webarchive)
2006-12-16 05:53 PST, David Kilzer (:ddkilzer)
no flags Details
Test files in zip archive v3 (1.19 KB, application/zip)
2006-12-16 06:54 PST, David Kilzer (:ddkilzer)
no flags Details
Fixed webarchive for test v3 (2.95 KB, application/x-webarchive)
2006-12-16 06:55 PST, David Kilzer (:ddkilzer)
no flags Details
Patch v1 (4.64 KB, patch)
2006-12-16 07:03 PST, David Kilzer (:ddkilzer)
no flags Details | Formatted Diff | Diff
Patch v2 (5.29 KB, patch)
2006-12-17 19:24 PST, David Kilzer (:ddkilzer)
no flags Details | Formatted Diff | Diff
Patch v3 (18.85 KB, patch)
2007-02-04 18:11 PST, David Kilzer (:ddkilzer)
beidson: review-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description David Kilzer (:ddkilzer) 2006-12-15 04:20:21 PST
Summary:

When saving a page in the webarchive format, CSS files appearing in @import statements are not saved, while CSS files in <link rel="stylesheet> tags are saved.  CSS files appearing in @import statements should also be saved.

Steps to reproduce:

1. Open a test page with an @import statement in Safari.  (Will post one to this bug shortly.)
2. Save the page as a webarchive named:  test.webarchive.
3. Open test.webarchive in the Property List Editor application.
4. Review the list of WebSubresources saved in the file.

Expected results:

The CSS file linked in the @import statement should be included in the webarchive as a WebSubresource.

Actual results:

The CSS file linked in the @import statement is not included in the webarchive as a WebSubresource.

Regression:

Same bug occurs in Safari 2.0.4 (419.3) on Mac OS X 10.4.8 (8L127).

Notes:

The "Activity Window" in Safari lists CSS files from both <link rel="stylesheet"> tags and @import statements.  If this list is kept in the WebKit framework (as opposed to Safari), perhaps it should be used when outputting a webarchive instead.
Comment 1 David Kilzer (:ddkilzer) 2006-12-15 04:39:46 PST
Created attachment 11858 [details]
Test files in zip archive

1. Download zip file.
2. Unzip into directory.
3. Open bug-11839-test.html.
4. Save as webarchive.
5. View in Property List Editor.
Comment 2 David Kilzer (:ddkilzer) 2006-12-15 04:40:41 PST
Created attachment 11859 [details]
Example webarchive missing @import stylesheet
Comment 3 David Kilzer (:ddkilzer) 2006-12-15 08:00:15 PST
I have a patch to fix this, but I need to clean it up and get recursion working (since a stylesheet referenced by @import may include other stylesheets, or--worse--link back to its parent).

Comment 4 David Kilzer (:ddkilzer) 2006-12-16 05:49:42 PST
Created attachment 11878 [details]
Test files in zip archive v2

These test files include recursive @import statements.
Comment 5 David Kilzer (:ddkilzer) 2006-12-16 05:53:04 PST
Created attachment 11879 [details]
Fixed webarchive for test v2
Comment 6 David Kilzer (:ddkilzer) 2006-12-16 06:54:13 PST
Created attachment 11881 [details]
Test files in zip archive v3

Also tests reading @import statements from stylesheets defined in <link> tags.
Comment 7 David Kilzer (:ddkilzer) 2006-12-16 06:55:09 PST
Created attachment 11882 [details]
Fixed webarchive for test v3
Comment 8 David Kilzer (:ddkilzer) 2006-12-16 07:03:01 PST
Created attachment 11883 [details]
Patch v1

Reviews list of CSS rules for each stylesheet defined in a web page (via <link> or <style>) and adds @import urls to the list of urls to save when creating a .webarchive file.
Comment 9 David Kilzer (:ddkilzer) 2006-12-16 07:12:25 PST
Comment on attachment 11883 [details]
Patch v1

>++ (void)_readCSSRulesRecursively:(DOMCSSStyleSheet *)sheet withURLs:(NSMutableArray *)URLs withDOMDocument:(DOMDocument *)document
>+{
>+    DOMCSSRuleList *rules = [sheet cssRules];
>+    for (int i = 0, max = [rules length]; i < max; i++) {
>+        DOMCSSRule *rule = [rules item:i];
>+        if ([rule type] == DOM_IMPORT_RULE) {
>+            DOMCSSImportRule *importRule = (DOMCSSImportRule *) rule; 
>+            NSString *string = [importRule href];
>+            if ([string length] > 0) {
>+                NSURL *newURL = [document URLWithAttributeString:string];
>+                if (![URLs containsObject:newURL]) {
>+                    [URLs addObject:newURL];
>+                    [DOMHTMLStyleElement _readCSSRulesRecursively:[importRule styleSheet] withURLs:URLs withDOMDocument:document];
>+                }
>+            }
>+        }
>+    }
>+}

Strictly speaking, the check for "if (![URLs containsObject:newURL])" here is not necessary to prevent recursion, only to prevent duplicate copies of the same resource being added twice.  (The DOMCSSImportRule that refers to the duplicate stylesheet URL returns nil for the styleSheet call, which ends up falling through in the next recursion.)

If this code is removed, the first recursed stylesheet will appear twice in the .webarchive file.  This is not a big deal, since it already happens with images (see Bug 7266), and will be fixed when that bug is fixed.

I wasn't sure whether I should rely on the CSS parser not creating recursive references when building DOMCSSImportRule objects in the future, though.
Comment 10 mitz 2006-12-16 07:58:13 PST
Comment on attachment 11883 [details]
Patch v1

I can't review this, but here are a few observations and a question :-)

I don't think it's very elegant to do the following also for rel="icon" (although it doesn't do any harm):

+        DOMCSSStyleSheet *sheet = (DOMCSSStyleSheet *)[self sheet];
+        [DOMHTMLStyleElement _readCSSRulesRecursively:sheet withURLs:URLs withDOMDocument:[self ownerDocument]];

Extra space before 'rule':

+            DOMCSSImportRule *importRule = (DOMCSSImportRule *) rule; 

Since length is unsigned, this should be just 'if ([string length])':

+            if ([string length] > 0) {

Regarding this method:

++ (void)_readCSSRulesRecursively:(DOMCSSStyleSheet *)sheet withURLs:(NSMutableArray *)URLs withDOMDocument:(DOMDocument *)document;

I don't think its name describes what it does or the arguments' roles very well. Have you considered, instead of a DOMHTMLStyleElement class method, a DOMCSSStyleSheet (WebPrivate) instance method? Say,
-(void)_recursivelyAddSubresourceURLsToArray:(NSMutableArray *)URLs withDOMDocument:(DOMDocument *)document;
Comment 11 David Kilzer (:ddkilzer) 2006-12-17 19:24:07 PST
Created attachment 11898 [details]
Patch v2

Addresses feedback in Comment #10.  Thanks for the pre-review Mitz!
Comment 12 Darin Adler 2006-12-19 07:48:01 PST
Comment on attachment 11898 [details]
Patch v2

The future direction for this code is:

    1) to be converted to C++ and moved to WebCore
    2) to eschew recursive algorithms and use iterative algorithms with data for context rather than using the processor stack
    3) to never do an operation like containsObject: on an array, since that's O(n) -- we need to use sets instead in cases like that

Regarding the word "Private" in WebKit, that's used to mean "for possible use internally in OS X". The word "Internal" means "to be used only inside WebKit". Thus these categories should be using the word Internal, not Private.

We need a way to do automated tests of this code.

Anyway, r=me
Comment 13 David Kilzer (:ddkilzer) 2006-12-19 11:08:58 PST
Comment on attachment 11898 [details]
Patch v2

Clearing review flag.  I want to fix some of the issues raised in Comment #12, specifically item #2 (recursion vs. iteration), item #3 (containsObject: on NSArray) and the Internal vs. Private issue.

The choice to "reuse" the NSMutableArray and call containsObject: on it was a conscious one.  I felt that most web pages would have a relatively small number of total stylesheets included in a hierarchical fashion (on the order of 10), and that the performance benefit from creating another data structure just for bookkeeping of duplicates was "worse" than performing a containsObject: on an existing data structure with such a small number of objects.  (I considered this "premature optimization".)

However, I don't have any basis for the "average" number of hierarchically included stylesheets per web page on the Internet (other than personal experience), and I didn't test the performance difference between running containsObject: on an NSArray with 10+ objects and an NSSet with 10+ objects to measure the performance drop-off.
Comment 14 David Kilzer (:ddkilzer) 2006-12-19 11:15:07 PST
Will rework Patch v2 to address comments in Comment #12.
Comment 15 Darin Adler 2006-12-19 11:20:23 PST
(In reply to comment #13)
> The choice to "reuse" the NSMutableArray and call containsObject: on it was a
> conscious one.  I felt that most web pages would have a relatively small number
> of total stylesheets included in a hierarchical fashion (on the order of 10),
> and that the performance benefit from creating another data structure just for
> bookkeeping of duplicates was "worse" than performing a containsObject: on an
> existing data structure with such a small number of objects.  (I considered
> this "premature optimization".)

Here's my thinking on this: Generally, when dealing with things decided by page authors, we should be prepared for unusually large data sets, even if most pages wouldn't have them.

Thus we need to cultivate habits of using associative data structures rather than vectors or arrays, even if the vector or array is faster for the normal case.

It's different if the size of the data structure is somehow controlled by WebKit -- when it's part of the web page we have to assume it could be pathologically large in some cases.

I'd really like to see this code in C++ rather than Objective-C and using HashSet instead of NSSet. Adding more Objective-C just means more code to translate later.
Comment 16 David Kilzer (:ddkilzer) 2006-12-19 14:09:14 PST
(In reply to comment #15)
> I'd really like to see this code in C++ rather than Objective-C and using
> HashSet instead of NSSet. Adding more Objective-C just means more code to
> translate later.

I'm going to ask a really dumb question or two.

Could you define the scope of "this code"?  Is it everything in the WebKit/DOM directory, or are you talking about the code the creates the .webarchive files?

Also, since some of this code is "Private", will the current Objective-C interface need to be preserved?
Comment 17 Darin Adler 2006-12-19 15:43:05 PST
(In reply to comment #16)
> Could you define the scope of "this code"?  Is it everything in the WebKit/DOM
> directory, or are you talking about the code the creates the .webarchive files?

I was specifically talking about the guts of the archiving code. As we did with the frame loader, the logic about what goes into an archive should be cross-platform C++ code. In the context of this patch, I was thinking that at least the code that gathers up all the URLs should be in C++.

I now see that's expressed in the form of "DOM operations". I think those operations should probably be done in C++ and wrapped in Objective-C as needed. Two reasons we'd need to wrap would be: 1) legacy uses outside WebKit, 2) uses inside WebKit, some of which we could eliminate over time.

> Also, since some of this code is "Private", will the current Objective-C
> interface need to be preserved?

The answer there is "I don't know". We need to preserve SPI/Private stuff if it's being used, and if it's not we can usually get away with making it internal. I don't know exactly what's being used at the moment; it's possible Mail uses some of this.

This is a powerful argument against adding any more SPI/Private stuff unless it's actually needed.
Comment 18 David Kilzer (:ddkilzer) 2007-02-04 18:11:39 PST
Created attachment 12925 [details]
Patch v3

(In reply to comment #13)
> Clearing review flag.  I want to fix some of the issues raised in Comment #12,
> specifically item #2 (recursion vs. iteration), item #3 (containsObject: on
> NSArray) and the Internal vs. Private issue.

Changes since Patch v2:

- Added layout test thanks to Bug 11882.
- Switched algorithm from recursion to iteration.
- Changed to call containsObject: on an NSMutableSet instead of an NSMutableArray (at the cost of allocating another data structure).
- No longer modify the DOMNode _URLsFromSelectors: SPI.
- Switched Private API added (DOMCSSStyleSheet _recursivelyAddSubresourceURLsToArray:withDOMDocument:) to Internal API (DOMCSSStyleSheet _addSubresourceURLsToArray:withBaseURL:).
- Fixed bug when referencing CSS stylesheets outside the current URL "directory" of the page that was being archived.
Comment 19 Darin Adler 2007-02-04 18:40:40 PST
Comment on attachment 12925 [details]
Patch v3

+    NSMutableSet *URLList = [NSMutableSet setWithCapacity:[URLs count]];
+    [URLList addObjectsFromArray:URLs];

+[NSMutableSet setWithArray:] will do the same thing in one line.

+    NSMutableArray *sheetList = [NSMutableArray arrayWithCapacity:1];
+    [sheetList addObject:self];

+[NSMutableArray arrayWithObject:] will do the same thing in one line.

+    NSMutableArray *baseURLList = [NSMutableArray arrayWithCapacity:1];
+    [baseURLList addObject:URL];

Ditto.

+        DOMCSSStyleSheet *sheet = [sheetList objectAtIndex:0];
+        [sheetList removeObjectAtIndex:0];

It's unnecessarily inefficient to keep removing the first object from an array. This will be slow if the array is large. I think it might scale better if these style sheets were in a set rather than an array since the order doesn't matter. We could store the base URLs in a dictionary. For common cases this would be slower, but it would have better worst-case behavior.

+                    NSURL *importedStylesheetURL = [[NSURL URLWithString:importedStylesheetURLString relativeToURL:baseURL] absoluteURL];

We can get slightly better efficiency here by using initWithString: instead of URLWithString: and manually releasing the URL. Our rule of thumb is that we try not to rely on autorelease except when it's required because of a return value.

+    [URLs setArray:[URLList allObjects]];

Shouldn't this be addObjectsFromArray: instead? Should we be sorting the list?

Another idea is to write the logic of this entire function in C++, with conversion on entry and exit. Something like this:

    void addSubresourceURLs(CSSStyleSheet*, const KURL& baseURL, Vector<KURL>&);

That would be more forward looking, since we hope to some day have this web archiving code be cross-platform.

+        NSURL *stylesheetURL = [[self _URLsFromSelectors:@selector(href), nil] objectAtIndex:0];

Why is it OK to look only at the first URL?

+        NSMutableArray *URLs = [NSMutableArray arrayWithCapacity:1];
+        [URLs addObject:stylesheetURL];

Again.

r=me
Comment 20 David Kilzer (:ddkilzer) 2007-02-05 11:32:48 PST
(In reply to comment #19)
> (From update of attachment 12925 [details] [edit])
> +    NSMutableSet *URLList = [NSMutableSet setWithCapacity:[URLs count]];
> +    [URLList addObjectsFromArray:URLs];
> [NSMutableSet setWithArray:] will do the same thing in one line.

Obviously I need to better-familiarize myself with the Foundation collection classes.

> +        DOMCSSStyleSheet *sheet = [sheetList objectAtIndex:0];
> +        [sheetList removeObjectAtIndex:0];
> 
> It's unnecessarily inefficient to keep removing the first object from an array.
> This will be slow if the array is large. I think it might scale better if these
> style sheets were in a set rather than an array since the order doesn't matter.
> We could store the base URLs in a dictionary. For common cases this would be
> slower, but it would have better worst-case behavior.

Foundation classes are missing an ordered set class (like Java's HashTreeSet).

I need to test this change, but my initial concern about using an unordered set is that it would produce unpredictable results in the test output.  I'll give it a try, though.

> +                    NSURL *importedStylesheetURL = [[NSURL
> URLWithString:importedStylesheetURLString relativeToURL:baseURL] absoluteURL];
> 
> We can get slightly better efficiency here by using initWithString: instead of
> URLWithString: and manually releasing the URL. Our rule of thumb is that we try
> not to rely on autorelease except when it's required because of a return value.

Okay.

> +    [URLs setArray:[URLList allObjects]];
> 
> Shouldn't this be addObjectsFromArray: instead? Should we be sorting the list?

The URLList gets everything in URLs at the top of the method, so I thought it would be more efficiently to simply blast everything in URLList back into URLs when finished.  Is this not the case?  I'm missing insight into the Foundation classes.

> Another idea is to write the logic of this entire function in C++, with
> conversion on entry and exit. Something like this:
> 
>     void addSubresourceURLs(CSSStyleSheet*, const KURL& baseURL,
> Vector<KURL>&);
> 
> That would be more forward looking, since we hope to some day have this web
> archiving code be cross-platform.

I'll see what I can do.  I'd rather get the bug fixed before the Wednesday cut-off, though.

> +        NSURL *stylesheetURL = [[self _URLsFromSelectors:@selector(href), nil]
> objectAtIndex:0];
> 
> Why is it OK to look only at the first URL?

Because a <style> tag has only one 'href' attribute, thus that method call will only ever return one URL.
Comment 21 David Kilzer (:ddkilzer) 2007-02-05 11:33:37 PST
Comment on attachment 12925 [details]
Patch v3

Clearing darin's r+ flag per Comment #20.
Comment 22 Darin Adler 2007-02-05 11:49:45 PST
(In reply to comment #20)
> Foundation classes are missing an ordered set class (like Java's HashTreeSet).

For what it's worth, WTF now has this. It's called ListHashSet. We also use Vector/HashSet pairs when we don't need removal.

> I need to test this change, but my initial concern about using an unordered set
> is that it would produce unpredictable results in the test output.  I'll give
> it a try, though.

If you're concerned about unpredictable results in the test output, then I think the use of a set for URLList is also a problem. I suggest sorting the final list to make the results predictable.

> > +    [URLs setArray:[URLList allObjects]];
> > 
> > Shouldn't this be addObjectsFromArray: instead? Should we be sorting the list?
> 
> The URLList gets everything in URLs at the top of the method, so I thought it
> would be more efficiently to simply blast everything in URLList back into URLs
> when finished.  Is this not the case?  I'm missing insight into the Foundation
> classes.

I didn't notice that URLList starts out as URLs at the top of the method.

I think we need sorting for the reason you mention above. Perhaps the caller could handle this.

> > +        NSURL *stylesheetURL = [[self _URLsFromSelectors:@selector(href), nil]
> > objectAtIndex:0];
> > 
> > Why is it OK to look only at the first URL?
> 
> Because a <style> tag has only one 'href' attribute, thus that method call will
> only ever return one URL.

Perhaps we need to assert that. Can _URLsFromSelectors possibly return an empty array? If so, then objectAtIndex:0 is going to raise an Objective-C exception.
Comment 23 Brady Eidson 2008-04-02 15:19:31 PDT
Patch is a great start - but this code is all in WebCore now.  I'm working on converting it to "The New Math™"
Comment 24 Brady Eidson 2008-04-02 15:19:57 PDT
Comment on attachment 12925 [details]
Patch v3

r-, simply because it's obsolete.  I'm working on changing it over
Comment 25 Alexey Proskuryakov 2008-04-02 22:28:39 PDT
<rdar://problem/5838347>
Comment 26 Brady Eidson 2008-04-02 22:58:28 PDT
Landed a WebCore-ified version of the patch along with the layout test in r31578
Comment 27 David Kilzer (:ddkilzer) 2008-04-03 13:32:54 PDT
(In reply to comment #12)
> (From update of attachment 11898 [details] [edit])
> The future direction for this code is:
> 
>     1) to be converted to C++ and moved to WebCore
>     2) to eschew recursive algorithms and use iterative algorithms with data
> for context rather than using the processor stack

[...]

I'm disappointed that r31578 switched back to a recursive algorithm after I spent time making an iterative one in Attachment #12925 [details].  :(

http://trac.webkit.org/projects/webkit/changeset/31578

Comment 28 Brady Eidson 2008-04-03 13:51:18 PDT
Why?  :)
Comment 29 David Kilzer (:ddkilzer) 2008-04-03 14:15:30 PDT
(In reply to comment #28)
> Why?  :)

Universal entropy increased.  :)

Comment 30 David Kilzer (:ddkilzer) 2008-11-25 14:04:09 PST
(In reply to comment #27)
> (In reply to comment #12)
> > (From update of attachment 11898 [details] [review] [edit])
> > The future direction for this code is:
> > 
> >     1) to be converted to C++ and moved to WebCore
> >     2) to eschew recursive algorithms and use iterative algorithms with data
> > for context rather than using the processor stack
> 
> [...]
> 
> I'm disappointed that r31578 switched back to a recursive algorithm after I
> spent time making an iterative one in Attachment #12925 [details] [review].  :(
> 
> http://trac.webkit.org/projects/webkit/changeset/31578

See Bug 11850 Comment #1 for a fix.