Bug 7266 - Webarchive format saves duplicate WebSubresources to .webarchive file
Summary: Webarchive format saves duplicate WebSubresources to .webarchive file
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKit Misc. (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P2 Normal
Assignee: David Kilzer (:ddkilzer)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-02-14 22:07 PST by David Kilzer (:ddkilzer)
Modified: 2007-02-05 19:53 PST (History)
3 users (show)

See Also:


Attachments
Test files in zip archive (838 bytes, application/zip)
2006-12-17 20:44 PST, David Kilzer (:ddkilzer)
no flags Details
Patch v1 (2.11 KB, patch)
2006-12-17 20:45 PST, David Kilzer (:ddkilzer)
no flags Details | Formatted Diff | Diff
Patch v2 (14.60 KB, patch)
2007-02-03 16:50 PST, David Kilzer (:ddkilzer)
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description David Kilzer (:ddkilzer) 2006-02-14 22:07:08 PST
Version 1 of the .webarchive format has some issues.  This may turn into a tracking bug later, but for now I'll just list them here.  Add more to comments as needed.  These issues occur in the latest WebKit (r12809 on anonsvn).  I have not tested them yet on released Safari.

1. Neither JavaScript nor CSS files are stored in the .webarchive.  Apparently these resources are reloaded from the web server once the .webarchive is reconstituted in the browser since the base URL for the main web page is preserved.

2. Duplicate copies of images are stored in the .webarchive: one copy for each time it appears on the web page (even though they reference the exact same URL).
Comment 1 David Kilzer (:ddkilzer) 2006-02-14 23:20:48 PST
I can't reproduce issue #1 listed in the description (needs further testing), so changing this bug to document issue #2: saving multiple copies of the same image in a .webarchive file.

Steps to reproduce:

1. Open Safari (ToT WebKit or production release from 10.4.4/10.4.5).

2. Go to a page that reuses the same image multiple times on one page.  I use:  http://tv.yahoo.com/grid/  (enter your zip and choose your cable/satellite provider as needed)

3. Save the web page as a "Web Archive".

4. Open the "Web Archive" in the Property List Editor.

5. Note under Root->WebSubresources that buttonleft.gif and buttonright.gif are stored as many times in the file as they appear on the web page.

Expected behavior:

The buttonleft.gif and buttonright.gif images should only be stored once per .webarchive.

Actual behavior:

The buttonleft.gif and buttonright.gif images are stored once each time they  .webarchive.
Comment 2 David Kilzer (:ddkilzer) 2006-02-14 23:42:01 PST
Filed Bug 7267 for issue #1 in the description (Comment #0) of this bug.
Comment 3 David Kilzer (:ddkilzer) 2006-12-01 03:25:29 PST
Another good test page is the BugReporter login page.  Count the number of spacer.gif images that appear in .webarchive file:  https://bugreport.apple.com/

Comment 4 David Kilzer (:ddkilzer) 2006-12-16 07:12:42 PST
WebSubresources may include images, CSS stylesheets and JavaScript sources, and all may be duplicated when saving a .webarchive file.

Comment 5 David Kilzer (:ddkilzer) 2006-12-17 20:44:28 PST
Created attachment 11899 [details]
Test files in zip archive

After unzipping the zip file, open bug-7266-test.html and save in webarchive.  There should only be three WebSubresources, but shipping Safari and ToT will save six (three duplicates).  After applying this patch, only three WebSubresources are saved.
Comment 6 David Kilzer (:ddkilzer) 2006-12-17 20:45:32 PST
Created attachment 11900 [details]
Patch v1

First pass at a patch to prevent duplicate WebSubresources from being saved to webarchive files.
Comment 7 David Kilzer (:ddkilzer) 2006-12-18 10:04:03 PST
(In reply to comment #6)
> Created an attachment (id=11900) [edit]
> Patch v1
> 
> First pass at a patch to prevent duplicate WebSubresources from being saved to
> webarchive files.

Another approach I could take with this is to pull the code that finds subresources by NSURL out into a separate loop, and use either an NSMutableDictionary (keys are NSURLs, values are integers, to keep order of URLs the same as they would have been with an NSArray) or an NSMutableSet (if the order doesn't really matter) to store the unique list of NSURL objects.
Comment 8 David Kilzer (:ddkilzer) 2006-12-18 20:14:20 PST
BTW, the reason this works is that (void)addArchive:(WebArchive *)archive in WebUnarchivingState puts the subresource values into an NSMutableDictionary named archivedResources, which causes duplicates originally saved to the .webarchive file to be removed.

Comment 9 Darin Adler 2006-12-19 07:43:09 PST
Comment on attachment 11900 [details]
Patch v1

r=me
Comment 10 David Kilzer (:ddkilzer) 2006-12-20 03:42:23 PST
Comment on attachment 11900 [details]
Patch v1

Clearing review flag pending Bug 11882 and conversion to C++ of the archiving code.
Comment 11 David Kilzer (:ddkilzer) 2007-02-03 16:50:31 PST
Created attachment 12905 [details]
Patch v2

Proposed fix.
Comment 12 Darin Adler 2007-02-04 18:41:44 PST
Comment on attachment 12905 [details]
Patch v2

As we make more and more substantial changes it really makes me wish more of this was translated to C++.

r=me
Comment 13 David Kilzer (:ddkilzer) 2007-02-05 11:08:58 PST
(In reply to comment #12)
> As we make more and more substantial changes it really makes me wish more of
> this was translated to C++.

Given my time constraints and the apparent release schedule (e.g., the impending tree lock-down in two days), I don't think it's realistic to do this and still get the fixes into the next release (which is my primary goal).
Comment 14 David Kilzer (:ddkilzer) 2007-02-05 19:53:02 PST
Committed revision 19422.