Bug 58947 - Add a way to serialize a WebView back to HTML
Summary: Add a way to serialize a WebView back to HTML
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKit Misc. (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC OS X 10.5
: P2 Normal
Assignee: Jay Civelli
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-04-19 17:23 PDT by Jay Civelli
Modified: 2011-04-28 15:52 PDT (History)
9 users (show)

See Also:


Attachments
Initial patch (53.96 KB, patch)
2011-04-19 17:34 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Fixing style (53.89 KB, patch)
2011-04-19 18:28 PDT, Jay Civelli
abarth: review-
Details | Formatted Diff | Diff
Moved code to WebCore. (53.33 KB, patch)
2011-04-21 11:27 PDT, Jay Civelli
abarth: review-
Details | Formatted Diff | Diff
Addressed Adam's and Alexey's comments (55.72 KB, patch)
2011-04-21 14:59 PDT, Jay Civelli
abarth: review-
Details | Formatted Diff | Diff
Now keeping original files encoding (59.23 KB, patch)
2011-04-22 13:49 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Only specifying the charset if the page does not do so. (61.06 KB, patch)
2011-04-22 17:43 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Addressing Alexey's comments (66.05 KB, patch)
2011-04-25 12:29 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Patch (65.61 KB, patch)
2011-04-26 01:09 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Patch (64.49 KB, patch)
2011-04-26 11:03 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Patch (64.50 KB, patch)
2011-04-26 11:36 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Patch (64.50 KB, patch)
2011-04-26 12:09 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Patch (64.50 KB, patch)
2011-04-26 17:33 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Patch (64.44 KB, patch)
2011-04-26 17:45 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Patch for landing (64.44 KB, patch)
2011-04-26 23:00 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Patch for landing (64.43 KB, patch)
2011-04-26 23:04 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Patch for landing (64.43 KB, patch)
2011-04-27 08:51 PDT, Jay Civelli
commit-queue: commit-queue-
Details | Formatted Diff | Diff
patch (updated to avoid Android.mk) (63.96 KB, patch)
2011-04-28 02:13 PDT, Adam Barth
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jay Civelli 2011-04-19 17:23:37 PDT
We need an API that serializes the content of a WebView (all frames) back to HTML and also retrieves the resources for that page.
That would allow to save a page to a self-contained file (ex: MHTML)
Comment 1 Jay Civelli 2011-04-19 17:34:29 PDT
Created attachment 90280 [details]
Initial patch
Comment 2 WebKit Review Bot 2011-04-19 17:46:00 PDT
Attachment 90280 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1

Source/WebKit/chromium/tests/data/htmlserializer/top_frame.html:1:  One or more unexpected \r (^M) found; better to use only a \n  [whitespace/carriage_return] [1]
Suppressing further [whitespace/carriage_return] reports for this file.
Source/WebKit/chromium/tests/data/htmlserializer/blank_frames.html:1:  One or more unexpected \r (^M) found; better to use only a \n  [whitespace/carriage_return] [1]
Suppressing further [whitespace/carriage_return] reports for this file.
Source/WebKit/chromium/tests/data/htmlserializer/iframe2.html:1:  One or more unexpected \r (^M) found; better to use only a \n  [whitespace/carriage_return] [1]
Suppressing further [whitespace/carriage_return] reports for this file.
Total errors found: 58 in 18 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 3 WebKit Review Bot 2011-04-19 18:07:16 PDT
Attachment 90280 [details] did not build on mac:
Build output: http://queues.webkit.org/results/8471501
Comment 4 Jay Civelli 2011-04-19 18:28:49 PDT
Created attachment 90288 [details]
Fixing style
Comment 5 WebKit Review Bot 2011-04-19 20:28:14 PDT
Attachment 90288 [details] did not build on chromium:
Build output: http://queues.webkit.org/results/8475511
Comment 6 Adam Barth 2011-04-19 22:28:31 PDT
Comment on attachment 90288 [details]
Fixing style

View in context: https://bugs.webkit.org/attachment.cgi?id=90288&action=review

This looks great.  The comments below are just a ton of nits.  Once we get this working, we should remove the Save Web Page As code in favor of this code path.

> Source/WebCore/editing/MarkupAccumulator.h:82
> +    virtual void appendCustomAttributes(Vector<UChar>&, Element*, Namespaces*) { }

Please don't put implementations of virtual methods in headers.  That just makes the linker work hard.

> Source/WebKit/chromium/public/WebHTMLSerializer.h:46
> +// In the serialized version, any JavaScript tag or link is removed (event

JavaScript tag -> script tags

link -> ???

> Source/WebKit/chromium/public/WebHTMLSerializer.h:57
> +        SerializedResource() { }
> +        SerializedResource(const WebURL& url, const WebCString& mimeType, const WebCString& data)
> +            : url(url), mimeType(mimeType), data(data) { }

This isn't proper WebKit style.

> Source/WebKit/chromium/public/WebHTMLSerializer.h:62
> +    WEBKIT_API static void serialize(WebView*, WebVector<SerializedResource>* resources);

This feels like a method of WebView rather than a static, but those are questions for fishd.

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:61
> +static const WebCore::QualifiedName* ignoredTags[] = {
> +    &HTMLNames::scriptTag,
> +    &HTMLNames::noscriptTag,
> +};

These could be references instead of pointers.

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:72
> +class WebHTMLSerializerImpl::SerializerMarkupAccumulator : public MarkupAccumulator {

I'd just use a normal class name and put all this stuff in an anonymous namespace.

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:99
> +    : MarkupAccumulator(nodes, AbsoluteURLs), m_serializer(serializer)

This should be on two lines for proper WebKit style.

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:118
> +    // FIXME: for object (plugins) tags and video tag we could replace them by an image fo their current contents.

fo -> of

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:124
> +    if (!element->hasTagName(HTMLNames::iframeTag) && !element->hasTagName(HTMLNames::frameTag))
> +        return;

What about other FrameOwnerElement subclasses, like <object> ?

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:129
> +    if (url.isValid() && url != blankURL())
> +        return;

comparing against blankURL isn't right.  You should look to see if it has the scheme "about"

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:132
> +    url = m_serializer->getURLForBlankFrame(frameElement->contentFrame());

getURLForBlankFrame ->urlForBlankFrame (or something...  we don't start function names with "get").

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:133
> +    RefPtr<Attribute> attribute = Attribute::create(HTMLNames::srcAttr, AtomicString(url.string()));

You don't need to call AtomicString explicitly.

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:144
> +    : m_resources(resources), m_blankFrameCounter(0)

Same style problem.

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:154
> +    if (!url.isValid() || url == blankURL())
> +        // For blank frames we generate a fake URL so they can be referenced by their containing frame.
> +        url = getURLForBlankFrame(frame);

Multi-line ifs should have { } even if they only have one statement.  Also, same blankURL bug.

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:180
> +            KURL url = document->completeURL(imageElem->getAttribute(HTMLNames::srcAttr));

This won't be the URL of the image in the case where there's a redirect.

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:183
> +        } else if (element->hasTagName(HTMLNames::linkTag) && equalIgnoringCase(element->getAttribute(HTMLNames::typeAttr), "text/css")) {

This is wrong.  You want to know whether the rel attribute contains stylesheet.  You should be able to ask the HTMLLinkElement if it's a stylesheet.

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:186
> +            if (sheet->isCSSStyleSheet()) {

Yeah, looks like you've already do that.  You should remove the typeAttr check.

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:189
> +                // Note that serializeCSSStyleSheet took care of adding the style sheet to the resources vector.

Can you change this into an ASSERT?

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:200
> +    Frame* childFrame = frame->tree()->firstChild();
> +    while (childFrame) {

This looks like a for loop in disguise.

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:247
> +    String mimeType = MIMETypeRegistry::getMIMETypeForExtension(image->image()->filenameExtension());

This is wrong.

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:297
> +    String url("http://chromium_phony_frame/");

Oh man.  That's not a good idea.  How about "wyciwyg:" as the stem?

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.h:60
> +    class SerializerMarkupAccumulator;
> +    friend class SerializerMarkupAccumulator;

This doen't necessarily need to be a friend.  We already have the API layer as a shield.

> Source/WebKit/chromium/src/WebHTMLSerializerImpl.h:62
> +    // Serializes the style-sheet back to text and adds it to the resources if URL is not-empty.

style-sheet => stylesheet

> Source/WebKit/chromium/tests/WebHTMLSerializerTest.cpp:61
> +    WebHTMLSerializerTest() : m_htmlMimeType(WebString::fromUTF8("text/html")), m_cssMimeType(WebString::fromUTF8("text/css")), m_pngMimeType(WebString::fromUTF8("img/png")) { }

Same style problem.
Comment 7 Alexey Proskuryakov 2011-04-19 23:55:34 PDT
Subclassing WebCore classes is not how we customize behavior from WebKit. This should be a client or a strategy call.

Can you explain what the end goal here is? One that I can think of sounds questionable, but maybe it's just me.
Comment 8 Adam Barth 2011-04-20 00:04:48 PDT
> Subclassing WebCore classes is not how we customize behavior from WebKit. This should be a client or a strategy call.

That's not really what's going on in this patch.  No WebCore behavior is being customized.

There's a larger question of whether this code should be in WebCore or WebKit.  In discussing this with Jay, I think moving more of it to WebCore would make sense as this logic isn't overly specific to Chromium.

> Can you explain what the end goal here is? One that I can think of sounds questionable, but maybe it's just me.

This code supports a browser feature much like the "Save as web archive" feature in Safari.  Currently Chrome has a "Save as web page (complete)", but this is a different approach to that feature.  I'd like to replace the existing approach (see http://trac.webkit.org/browser/trunk/Source/WebKit/chromium/src/WebPageSerializerImpl.cpp but please don't cry) with this approach, but that might or might not happen.
Comment 9 Alexey Proskuryakov 2011-04-20 09:07:31 PDT
> That's not really what's going on in this patch.  No WebCore behavior is being customized.

If you mean that nothing observable from DOM is being customized, then sure. But that's not the point. Having a virtual method that's being overridden from WebKit (instead of using a client call) is exactly what is wrong about this patch technically - there is no exception like "we can avoid using a client-based design if we're implementing a feature that's only accessible from browser UI".

I suspect that you may feel uneasy about clients since they are commonly per-frame. Well, since we're exposing serializer as a separate WebCore "API", then there can be per-serializer clients, too.

Another approach is of course to have the logic in WebCore and to set a boolean flag in serializer instance.

> This code supports a browser feature much like the "Save as web archive" feature in Safari.  Currently Chrome has a "Save as web page (complete)", but this is a different approach to that feature.

Can you describe what the difference with the desired Chromium behavior is? Can that be something that all WebKit-based browsers want?
Comment 10 Adam Barth 2011-04-20 09:28:32 PDT
> If you mean that nothing observable from DOM is being customized, then sure. But that's not the point. Having a virtual method that's being overridden from WebKit (instead of using a client call) is exactly what is wrong about this patch technically

That's exactly what a client is, right?  The only difference is that clients are pure virtual.

> Can you describe what the difference with the desired Chromium behavior is?

Your question isn't well-posed.  The difference between the desired Chromium behavior and what?

> Can that be something that all WebKit-based browsers want?

Potentially.  There isn't really a "correct" way to save a live web page into a file.  There are a bunch of trade-offs.  This approach represents one set of trade-offs.  The existing code represents a different set of trade-offs.  Many browsers have this sort of save-as feature, so it might well be valuable to other WebKit-based browsers, depending on how they view the various trade-offs.
Comment 11 Alexey Proskuryakov 2011-04-20 10:10:54 PDT
> That's exactly what a client is, right?  The only difference is that clients are pure virtual.

Yes, it's a major distinction. If you don't like this design principle, this bug is not a place to discuss it.

> Your question isn't well-posed.  The difference between the desired Chromium behavior and what?

I can't do anything to make the question any more clear. You said that "this is a different approach", I asked what the difference was, and now you are asking "difference with what?"

> Potentially.  There isn't really a "correct" way to save a live web page into a file.

Agreed. We have tons of customizations of this kind in WebCore, sometimes implemented with client calls, other times with preferences. See e.g. platform editing behaviors, where WebCore is customized to do different things for the same user actions.

I'm asking someone to describe the desired behavior in plain English, to avoid trying to guess the purpose from poorly commented and admittedly bad ("but please don't cry ") code. I presume this is not a trade secret.
Comment 12 Adam Barth 2011-04-20 11:05:19 PDT
One major difference between WebPageSerializer and WebHTMLSerializer is that WebPageSerializer includes inline script whereas WebHTMLSerializer does not.  Also, WebHTMLSerializer might replace a canvas with a static image whereas WebPageSerializer would not.

Generalizing WebHTMLSerializer's perspective is that you want a static version of the page (but not a screenshot because you want text-reflow, etc, to work).  WebPageSerializer is trying to create a local version of the page that works as closely as possible to the hosted version.  WebPageSerializer roughly uses the same approach as Firefox's "save complete web page", which has a number of unfixable bugs:

<script>
document.write("This text should appear once");
</script>

In WebPageSerializer, the text will appear twice but with WebHTMLSerializer the text appears only once.

<button>Hello</button>
<script>
document.getElementsByTagName("button")[0].addEventListender("click", function () {
  alert("Hi there");
}, false);
</script>

In WebPageSerializer, the button works but with WebHTMLSerializer the button does not.

Now, aside from the differences in approach, WebPageSerializer has very poor code quality and lots of bugs.  I wasn't involved in reviewing the patch that added it, and I think the project should not have accepted it at that level of quality.  I re-wrote it from scratch a number of months ago, but never got around to finishing it because other higher-priority things came up.

Do you have other questions?
Comment 13 Alexey Proskuryakov 2011-04-20 11:27:13 PDT
Thank you - that's a great explanation, and I think that it's good idea to consider keeping more of the code in WebCore.
Comment 14 Jay Civelli 2011-04-21 11:25:34 PDT
Comment on attachment 90288 [details]
Fixing style

View in context: https://bugs.webkit.org/attachment.cgi?id=90288&action=review

>> Source/WebCore/editing/MarkupAccumulator.h:82
>> +    virtual void appendCustomAttributes(Vector<UChar>&, Element*, Namespaces*) { }
> 
> Please don't put implementations of virtual methods in headers.  That just makes the linker work hard.

Done.

>> Source/WebKit/chromium/public/WebHTMLSerializer.h:46
>> +// In the serialized version, any JavaScript tag or link is removed (event
> 
> JavaScript tag -> script tags
> 
> link -> ???

Corrected.

>> Source/WebKit/chromium/public/WebHTMLSerializer.h:57
>> +            : url(url), mimeType(mimeType), data(data) { }
> 
> This isn't proper WebKit style.

Move to WebPageSerialization.h, constructor actually removed.

>> Source/WebKit/chromium/public/WebHTMLSerializer.h:62
>> +    WEBKIT_API static void serialize(WebView*, WebVector<SerializedResource>* resources);
> 
> This feels like a method of WebView rather than a static, but those are questions for fishd.

Per fishd, moved it to WebPageSerializer.

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:61
>> +};
> 
> These could be references instead of pointers.

makes compiler unhappy:
"Array of references are illegal"

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:72
>> +class WebHTMLSerializerImpl::SerializerMarkupAccumulator : public MarkupAccumulator {
> 
> I'd just use a normal class name and put all this stuff in an anonymous namespace.

OK, done.

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:99
>> +    : MarkupAccumulator(nodes, AbsoluteURLs), m_serializer(serializer)
> 
> This should be on two lines for proper WebKit style.

Done.

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:118
>> +    // FIXME: for object (plugins) tags and video tag we could replace them by an image fo their current contents.
> 
> fo -> of

Done.

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:124
>> +        return;
> 
> What about other FrameOwnerElement subclasses, like <object> ?

Ah, good point. Now dealing with that.

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:129
>> +        return;
> 
> comparing against blankURL isn't right.  You should look to see if it has the scheme "about"

Done.

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:132
>> +    url = m_serializer->getURLForBlankFrame(frameElement->contentFrame());
> 
> getURLForBlankFrame ->urlForBlankFrame (or something...  we don't start function names with "get").

Done

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:133
>> +    RefPtr<Attribute> attribute = Attribute::create(HTMLNames::srcAttr, AtomicString(url.string()));
> 
> You don't need to call AtomicString explicitly.

Removed.

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:144
>> +    : m_resources(resources), m_blankFrameCounter(0)
> 
> Same style problem.

Fixed.

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:154
>> +        url = getURLForBlankFrame(frame);
> 
> Multi-line ifs should have { } even if they only have one statement.  Also, same blankURL bug.

Done.

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:180
>> +            KURL url = document->completeURL(imageElem->getAttribute(HTMLNames::srcAttr));
> 
> This won't be the URL of the image in the case where there's a redirect.

I think that's OK. All that matters is that the image URL in the serialized frame matches the URL for the image we store. We don't have to resolve the redirects.

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:183
>> +        } else if (element->hasTagName(HTMLNames::linkTag) && equalIgnoringCase(element->getAttribute(HTMLNames::typeAttr), "text/css")) {
> 
> This is wrong.  You want to know whether the rel attribute contains stylesheet.  You should be able to ask the HTMLLinkElement if it's a stylesheet.

Removed.

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:189
>> +                // Note that serializeCSSStyleSheet took care of adding the style sheet to the resources vector.
> 
> Can you change this into an ASSERT?

OK.

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:200
>> +    while (childFrame) {
> 
> This looks like a for loop in disguise.

Disguise removed.

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:247
>> +    String mimeType = MIMETypeRegistry::getMIMETypeForExtension(image->image()->filenameExtension());
> 
> This is wrong.

OK, now using the mime type from the response that created the resource. Is that better?

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.cpp:297
>> +    String url("http://chromium_phony_frame/");
> 
> Oh man.  That's not a good idea.  How about "wyciwyg:" as the stem?

OK.

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.h:60
>> +    friend class SerializerMarkupAccumulator;
> 
> This doen't necessarily need to be a friend.  We already have the API layer as a shield.

Not a friend anymore.

>> Source/WebKit/chromium/src/WebHTMLSerializerImpl.h:62
>> +    // Serializes the style-sheet back to text and adds it to the resources if URL is not-empty.
> 
> style-sheet => stylesheet

Done.

>> Source/WebKit/chromium/tests/WebHTMLSerializerTest.cpp:61
>> +    WebHTMLSerializerTest() : m_htmlMimeType(WebString::fromUTF8("text/html")), m_cssMimeType(WebString::fromUTF8("text/css")), m_pngMimeType(WebString::fromUTF8("img/png")) { }
> 
> Same style problem.

Fixed.
Comment 15 Jay Civelli 2011-04-21 11:27:05 PDT
Created attachment 90562 [details]
Moved code to WebCore.

As discussed, I moved this code to WebCore/page.
Comment 16 Adam Barth 2011-04-21 12:04:52 PDT
Comment on attachment 90562 [details]
Moved code to WebCore.

View in context: https://bugs.webkit.org/attachment.cgi?id=90562&action=review

This looks great.  A few minor comments.

> Source/WebCore/page/PageSerializer.cpp:56
> +namespace {

Apparently we're supposed to use the static keyword in favor of anonymous namespaces, but I see that you're using this because you want the SerializerMarkupAccumulator class to have internal linkage.

> Source/WebCore/page/PageSerializer.cpp:127
> +    url = m_serializer->urlForBlankFrame(frame);
> +    RefPtr<Attribute> attribute = Attribute::create(HTMLNames::srcAttr, url.string());
> +    appendAttribute(out, element, *attribute, namespaces);

This isn't quite correct in the case of an <object> tag, which uses the data attribute for the frame's URL.  HTMLFrameElementBase::isURLAttribute exists to solve this problem in other places.  See also HTMLObjectElement::imageSourceAttributeName for a slightly different solution to a similar problem.

> Source/WebCore/page/PageSerializer.cpp:146
> +

extra blank line here

> Source/WebCore/page/PageSerializer.cpp:163
> +    if (!url.isValid() || url == blankURL()) {

This should use the protocolIs check instead of comparing against blankURL.

> Source/WebCore/page/PageSerializer.cpp:168
> +    if (m_resourceURLs.contains(url)) {

Maybe we should convert all the URLs be unique?

> Source/WebCore/page/PageSerializer.cpp:191
> +            HTMLImageElement* imageElem = static_cast<HTMLImageElement*>(element);

imageElem => imageElement (we don't like abbreviations)

> Source/WebCore/page/PageSerializer.cpp:193
> +            CachedImage* cachedImg = imageElem->cachedImage();

cachedImage.

> Source/WebCore/page/PageSerializer.cpp:199
> +                KURL url = document->completeURL(linkElement->getAttribute(HTMLNames::hrefAttr));

This isn't really correct.  There might have been a redirect.  You probably want StyleSheet::finalURL, but that might cause problems when loading the style sheet...  I guess it depends on how we deserialize these.  I'm not sure what to do here.

> Source/WebCore/page/PageSerializer.cpp:207
> +                serializeCSSStyleSheet(static_cast<CSSStyleSheet*>(sheet), KURL());

What about style attributes?  Seems like you'll want to grab the images and such out of those as well.

> Source/WebCore/page/PageSerializer.cpp:217
> +    String cssText;

We should use StringBuilder instead.  Calling append a lot on String is slow.

> Source/WebCore/page/PageSerializer.cpp:307
> +    String url("wyciwyg://frame/");
> +    url.append(String::number(m_blankFrameCounter++));
> +    KURL fakeURL(ParsedURLString, url);

You should use makeString instead of calling append.  I'm not sure it matters, but it's the new hotness.

> Source/WebCore/page/PageSerializer.h:92
> +    // The list of resources retreived so far.
> +    Vector<Resource>* m_resources;
> +
> +    // The list of URLs we have retrieved so far, use to ensure we only retrieved each resource once.
> +    ListHashSet<KURL> m_resourceURLs;
> +
> +    // The URLs used to identify blank frames.
> +    HashMap<Frame*, KURL> m_blankFrameURLs;
> +
> +    // Used to generate unique fake URLs for blank frames.
> +    unsigned m_blankFrameCounter;

We usually skip these comments and remove all the blank lines between member variables.

> Source/WebKit/chromium/src/WebPageSerializer.cpp:195
> +        resource.data = WebCString(iter->data->data(), iter->data->size());

This creates a copy of all the images?  There isn't a way to just take a reference to the SharedData ?
Comment 17 Adam Barth 2011-04-21 12:06:16 PDT
You should add the new WebCore files to all the other build systems.
Comment 18 Alexey Proskuryakov 2011-04-21 12:26:27 PDT
Comment on attachment 90562 [details]
Moved code to WebCore.

View in context: https://bugs.webkit.org/attachment.cgi?id=90562&action=review

Looks pretty good to me, too.

The new serializer uses utf-8 for all content, correct? That seems less than perfect, since page behavior does depend on encoding (most commonly to encode forms and request URLs). Is the encoding transferred out of band somewhere?

>> Source/WebCore/page/PageSerializer.cpp:56
>> +namespace {
> 
> Apparently we're supposed to use the static keyword in favor of anonymous namespaces, but I see that you're using this because you want the SerializerMarkupAccumulator class to have internal linkage.

We usually don't care about linkage utility classes in WebCore namespace. So I suggest just removing the anonymous namespace.

> Source/WebCore/page/PageSerializer.cpp:138
> +PageSerializer::Resource::Resource() { }

Braces should be on separate lines.

> Source/WebCore/page/PageSerializer.cpp:171
> +        // FIXME: we could have 2 frame with the same URL but which were dynamically changed and have now
> +        //        different content. So we should serialize both and somehow rename the frame src in the
> +        //        containing frame. Arg!

We don't align code and comments like that, please just use one space.

> Source/WebCore/page/PageSerializer.cpp:235
> +            // FIXME: add support for font face rule. It is not clear to me at this point if the actual otf/eot file can
> +            //        be retrieved from the CSSFontFaceRule object.

Same comment about indentation.

> Source/WebCore/page/PageSerializer.cpp:305
> +    String url("wyciwyg://frame/");

What will WebCore do with this wyciwyg scheme later? The result of this serialization seems useless to non-Chromium ports, and this should either be addressed, or made clear through naming and use of #ifs.

> Source/WebCore/page/PageSerializer.cpp:316
> +} // namespace WebCore
> +
> +
> +

Lots of empty lines here.

I'm not a fan of comments like "namespace WebCore" - most of the time you don't question file structure, so they are useless and add visual noise. And when you do, they are likely to be misleading - the only way to match braces is to use your text editor features, comments can't tell the truth.

> Source/WebCore/page/PageSerializer.h:51
> +// It serializes all the page frames and retreives resources such as images and CSS stylesheets.

Typo: "retreives".

> Source/WebCore/page/PageSerializer.h:59
> +        Resource(const KURL&, const CString& mimeType, PassRefPtr<SharedBuffer> data);

Please don't use CString here. It's a fairly strange a dangerous class.

> Source/WebCore/page/PageSerializer.h:82
> +    // The list of resources retreived so far.

Typo: "retreived". I'm not sure if this comment adds any information.
Comment 19 Adam Barth 2011-04-21 13:43:49 PDT
Comment on attachment 90562 [details]
Moved code to WebCore.

View in context: https://bugs.webkit.org/attachment.cgi?id=90562&action=review

>> Source/WebCore/page/PageSerializer.cpp:305
>> +    String url("wyciwyg://frame/");
> 
> What will WebCore do with this wyciwyg scheme later? The result of this serialization seems useless to non-Chromium ports, and this should either be addressed, or made clear through naming and use of #ifs.

wyciwyg doesn't mean anything special to Chromium either.  I'm not sure how this is supposed to work.  Maybe the code that uses the serialized version is supposed to re-write these URLs to be local to the disk?  I just recommended that Jay use wyciwyg because Firefox uses that scheme for a similar purpose.
Comment 20 Jay Civelli 2011-04-21 14:58:02 PDT
Comment on attachment 90562 [details]
Moved code to WebCore.

View in context: https://bugs.webkit.org/attachment.cgi?id=90562&action=review

>> Source/WebCore/page/PageSerializer.cpp:127
>> +    appendAttribute(out, element, *attribute, namespaces);
> 
> This isn't quite correct in the case of an <object> tag, which uses the data attribute for the frame's URL.  HTMLFrameElementBase::isURLAttribute exists to solve this problem in other places.  See also HTMLObjectElement::imageSourceAttributeName for a slightly different solution to a similar problem.

I see. Since there is no getURLAttribute method I ended up iterating over the attributes and calling isURLAttribute to figure out the right one. Less than ideal, but I am not sure what else to do (short of introducing a new getURLAttribute method to Element).

>> Source/WebCore/page/PageSerializer.cpp:146
>> +
> 
> extra blank line here

Removed.

>> Source/WebCore/page/PageSerializer.cpp:163
>> +    if (!url.isValid() || url == blankURL()) {
> 
> This should use the protocolIs check instead of comparing against blankURL.

Switched to protocolIs("about").
Do you know if there could be cases where an frame loads something other than about:blank?

>> Source/WebCore/page/PageSerializer.cpp:168
>> +    if (m_resourceURLs.contains(url)) {
> 
> Maybe we should convert all the URLs be unique?

Yeah, we could unique frame URLs but that would require some more changes to MarkupAccumulator.
I'll do that in a subsequent patch if that's OK with you.

>> Source/WebCore/page/PageSerializer.cpp:191
>> +            HTMLImageElement* imageElem = static_cast<HTMLImageElement*>(element);
> 
> imageElem => imageElement (we don't like abbreviations)

Done.

>> Source/WebCore/page/PageSerializer.cpp:193
>> +            CachedImage* cachedImg = imageElem->cachedImage();
> 
> cachedImage.

Done.

>> Source/WebCore/page/PageSerializer.cpp:199
>> +                KURL url = document->completeURL(linkElement->getAttribute(HTMLNames::hrefAttr));
> 
> This isn't really correct.  There might have been a redirect.  You probably want StyleSheet::finalURL, but that might cause problems when loading the style sheet...  I guess it depends on how we deserialize these.  I'm not sure what to do here.

I am not sure I understand the problem.
Let's say the page has:
<link rel="stylesheet" type="text/css" href="styles.css" />
and styles.css is a redirect to something else.
As long as the link is serialized as is (without it's href being resolved), then its URL will match the URL we are using for the CSS resource. And during deserialization, they'll still match.

>> Source/WebCore/page/PageSerializer.cpp:207
>> +                serializeCSSStyleSheet(static_cast<CSSStyleSheet*>(sheet), KURL());
> 
> What about style attributes?  Seems like you'll want to grab the images and such out of those as well.

It does it as part of the serialization process. (this is validated by the WebPageNewSerializeTest::CSSResources test).

>> Source/WebCore/page/PageSerializer.cpp:217
>> +    String cssText;
> 
> We should use StringBuilder instead.  Calling append a lot on String is slow.

Done.

>> Source/WebCore/page/PageSerializer.cpp:307
>> +    KURL fakeURL(ParsedURLString, url);
> 
> You should use makeString instead of calling append.  I'm not sure it matters, but it's the new hotness.

Done, I totally want to use the hippest APIs.

>> Source/WebCore/page/PageSerializer.h:92
>> +    unsigned m_blankFrameCounter;
> 
> We usually skip these comments and remove all the blank lines between member variables.

Removed the comments.

>> Source/WebKit/chromium/src/WebPageSerializer.cpp:195
>> +        resource.data = WebCString(iter->data->data(), iter->data->size());
> 
> This creates a copy of all the images?  There isn't a way to just take a reference to the SharedData ?

Yes, it copies all resources.
There is no way in the current Chromium WebKit API to create a wrapper around SharedData.
I guess I could do that if you think that's necessary.
Comment 21 Jay Civelli 2011-04-21 14:59:57 PDT
Created attachment 90610 [details]
Addressed Adam's and Alexey's comments
Comment 22 Jay Civelli 2011-04-21 15:02:17 PDT
Comment on attachment 90562 [details]
Moved code to WebCore.

View in context: https://bugs.webkit.org/attachment.cgi?id=90562&action=review

>>> Source/WebCore/page/PageSerializer.cpp:56
>>> +namespace {
>> 
>> Apparently we're supposed to use the static keyword in favor of anonymous namespaces, but I see that you're using this because you want the SerializerMarkupAccumulator class to have internal linkage.
> 
> We usually don't care about linkage utility classes in WebCore namespace. So I suggest just removing the anonymous namespace.

Removed the anonymous namespace.

>> Source/WebCore/page/PageSerializer.cpp:138
>> +PageSerializer::Resource::Resource() { }
> 
> Braces should be on separate lines.

Done.

>> Source/WebCore/page/PageSerializer.cpp:171
>> +        //        containing frame. Arg!
> 
> We don't align code and comments like that, please just use one space.

Done.

>> Source/WebCore/page/PageSerializer.cpp:235
>> +            //        be retrieved from the CSSFontFaceRule object.
> 
> Same comment about indentation.

Done.

>> Source/WebCore/page/PageSerializer.cpp:316
>> +
> 
> Lots of empty lines here.
> 
> I'm not a fan of comments like "namespace WebCore" - most of the time you don't question file structure, so they are useless and add visual noise. And when you do, they are likely to be misleading - the only way to match braces is to use your text editor features, comments can't tell the truth.

Removed empty lines and comment.

>> Source/WebCore/page/PageSerializer.h:51
>> +// It serializes all the page frames and retreives resources such as images and CSS stylesheets.
> 
> Typo: "retreives".

Fixed.

>> Source/WebCore/page/PageSerializer.h:59
>> +        Resource(const KURL&, const CString& mimeType, PassRefPtr<SharedBuffer> data);
> 
> Please don't use CString here. It's a fairly strange a dangerous class.

Now using String.

>> Source/WebCore/page/PageSerializer.h:82
>> +    // The list of resources retreived so far.
> 
> Typo: "retreived". I'm not sure if this comment adds any information.

Removed comments.
Comment 23 WebKit Review Bot 2011-04-21 15:03:37 PDT
Attachment 90610 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1

Source/WebCore/editing/MarkupAccumulator.h:96:  The parameter name "element" adds no information, so it should be removed.  [readability/parameter_name] [5]
Total errors found: 1 in 19 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 24 Jay Civelli 2011-04-21 15:04:09 PDT
On patch number 4 I fixed the encoding by intercepting the content meta attribute and replacing it with UTF-8 (as well as the CSS charset) per Adam suggestion.
I still need to add the new files to all build systems.
Comment 25 Adam Barth 2011-04-21 15:06:11 PDT
Comment on attachment 90562 [details]
Moved code to WebCore.

View in context: https://bugs.webkit.org/attachment.cgi?id=90562&action=review

>>> Source/WebCore/page/PageSerializer.cpp:127
>>> +    appendAttribute(out, element, *attribute, namespaces);
>> 
>> This isn't quite correct in the case of an <object> tag, which uses the data attribute for the frame's URL.  HTMLFrameElementBase::isURLAttribute exists to solve this problem in other places.  See also HTMLObjectElement::imageSourceAttributeName for a slightly different solution to a similar problem.
> 
> I see. Since there is no getURLAttribute method I ended up iterating over the attributes and calling isURLAttribute to figure out the right one. Less than ideal, but I am not sure what else to do (short of introducing a new getURLAttribute method to Element).

You can add the method to HTMLFrameOwnerElement, right?

>>> Source/WebCore/page/PageSerializer.cpp:163
>>> +    if (!url.isValid() || url == blankURL()) {
>> 
>> This should use the protocolIs check instead of comparing against blankURL.
> 
> Switched to protocolIs("about").
> Do you know if there could be cases where an frame loads something other than about:blank?

Sure.  WebKit treats about:foo just like about:blank.

<iframe src="about:foo"></script>

>>> Source/WebCore/page/PageSerializer.cpp:168
>>> +    if (m_resourceURLs.contains(url)) {
>> 
>> Maybe we should convert all the URLs be unique?
> 
> Yeah, we could unique frame URLs but that would require some more changes to MarkupAccumulator.
> I'll do that in a subsequent patch if that's OK with you.

Sounds good.

>>> Source/WebCore/page/PageSerializer.cpp:199
>>> +                KURL url = document->completeURL(linkElement->getAttribute(HTMLNames::hrefAttr));
>> 
>> This isn't really correct.  There might have been a redirect.  You probably want StyleSheet::finalURL, but that might cause problems when loading the style sheet...  I guess it depends on how we deserialize these.  I'm not sure what to do here.
> 
> I am not sure I understand the problem.
> Let's say the page has:
> <link rel="stylesheet" type="text/css" href="styles.css" />
> and styles.css is a redirect to something else.
> As long as the link is serialized as is (without it's href being resolved), then its URL will match the URL we are using for the CSS resource. And during deserialization, they'll still match.

Right, the danger is that you're taking bytes that you got from one URL and saying that they came from another URL.  That's the first step towards a security vulnerability.  Once we re-write all the URLs to use dummy URLs, we won't have that problem.

>>> Source/WebCore/page/PageSerializer.cpp:207
>>> +                serializeCSSStyleSheet(static_cast<CSSStyleSheet*>(sheet), KURL());
>> 
>> What about style attributes?  Seems like you'll want to grab the images and such out of those as well.
> 
> It does it as part of the serialization process. (this is validated by the WebPageNewSerializeTest::CSSResources test).

Ah, great.  I missed that.

>>> Source/WebKit/chromium/src/WebPageSerializer.cpp:195
>>> +        resource.data = WebCString(iter->data->data(), iter->data->size());
>> 
>> This creates a copy of all the images?  There isn't a way to just take a reference to the SharedData ?
> 
> Yes, it copies all resources.
> There is no way in the current Chromium WebKit API to create a wrapper around SharedData.
> I guess I could do that if you think that's necessary.

Sounds like a performance optimization we could consider in the future.  Perhaps a FIXME comment?

> Source/WebKit/chromium/tests/WebPageNewSerializerTest.cpp:183
> +TEST_F(WebPageNewSerializeTest, CSSResources)

I guess I don't see how this tests the case I was thinking about.  What about:

<div style="width:10; heigh:10; background-image: url(blue_background.png)">Hi</div>

?
Comment 26 WebKit Review Bot 2011-04-21 15:26:30 PDT
Attachment 90610 [details] did not build on chromium:
Build output: http://queues.webkit.org/results/8498121
Comment 27 Adam Barth 2011-04-21 16:21:22 PDT
Comment on attachment 90610 [details]
Addressed Adam's and Alexey's comments

View in context: https://bugs.webkit.org/attachment.cgi?id=90610&action=review

I feel like I'm nit-picking at this point.  The main things to address are moving the function to HTMLFrameOwnerElement and style attributes.

>> Source/WebCore/editing/MarkupAccumulator.h:96
>>      void appendOpenTag(Vector<UChar>& out, Element* element, Namespaces*);
>>      void appendCloseTag(Vector<UChar>& out, Element* element);
>> -    void appendAttribute(Vector<UChar>& out, Element* element, const Attribute&, Namespaces*);
>> +    virtual void appendAttribute(Vector<UChar>& out, Element* element, const Attribute&, Namespaces*);
> 
> The parameter name "element" adds no information, so it should be removed.  [readability/parameter_name] [5]

I'd fix all the instance of this issue in this block of code.

> Source/WebCore/page/PageSerializer.cpp:65
> +static const WebCore::QualifiedName* frameOwnerSrcAttributes[] = {

This array does not appear to be used.

> Source/WebCore/page/PageSerializer.cpp:81
> +    // FIXME: ideally HTMLFrameOwnerElement would have a method to return the URL attribute name.

I'd just add this method to HTMLFrameOwnerElement

> Source/WebCore/page/PageSerializer.cpp:133
> +    if (element->hasTagName(HTMLNames::metaTag) && attribute.name() == HTMLNames::contentAttr) {

Do we want to make sure the http-equiv attribute of this element is Content-Type ?  Otherwise, you could be munging other http-equiv meta tags.

> Source/WebCore/page/PageSerializer.cpp:136
> +        if (charset != "utf-8" && charset != "utf8") {

Doe we need a case-insensitive compare here?  If not, we should ASSERT that charset is lowercase.

> Source/WebCore/page/PageSerializer.cpp:142
> +    }

This isn't the only case.  There's also:

<meta charset="utf-8">

> Source/WebCore/page/PageSerializer.cpp:166
> +    if (!urlAttribute) {
> +        LOG_ERROR("No URL attribute found for frame owner.");
> +        return;
> +    }

Why is this an error?  It can definitely happen.

> Source/WebCore/page/PageSerializer.cpp:219
> +    Vector<Node*> nodes;
> +    SerializerMarkupAccumulator accumulator(this, &nodes);
> +    CString frameHTML = accumulator.serializeNodes(document->documentElement(), 0, IncludeNode).utf8();    
> +    m_resources->append(Resource(url, String("text/html"), SharedBuffer::create(frameHTML.data(), frameHTML.length())));
> +    m_resourceURLs.add(url);

I wonder if you want to add a <meta charset="UTF-8"> to the serialization just in case there's no meta element.

Also, is document always text/html?  What if it's application/svg+xml ?
Comment 28 WebKit Review Bot 2011-04-21 16:32:52 PDT
Attachment 90610 [details] did not build on chromium:
Build output: http://queues.webkit.org/results/8497170
Comment 29 Alexey Proskuryakov 2011-04-21 16:58:33 PDT
> On patch number 4 I fixed the encoding by intercepting the content meta attribute and replacing 
> it with UTF-8 (as well as the CSS charset) per Adam suggestion.

What I was saying was that ideally, one should preserve the original encoding if they want to restore the page later. Encoding affects the behavior, so changing it to utf-8 will break some pages, even though they will be decoded without problem.
Comment 30 Adam Barth 2011-04-21 17:35:58 PDT
(In reply to comment #29)
> > On patch number 4 I fixed the encoding by intercepting the content meta attribute and replacing 
> > it with UTF-8 (as well as the CSS charset) per Adam suggestion.
> 
> What I was saying was that ideally, one should preserve the original encoding if they want to restore the page later. Encoding affects the behavior, so changing it to utf-8 will break some pages, even though they will be decoded without problem.

Did you have something specific in mind?  This approach changes a lot of things that are observable to the page, so some number of pages will break.
Comment 31 Alexey Proskuryakov 2011-04-21 19:54:35 PDT
From the top of my head, I have two variants:
- encode documents using document.charset, add a META to make sure that we decode correctly (least confusing);
- encode using utf-8, change META if it exists, store the encoding out of band (or perhaps in a private attribute), so that it could be restored after parsing.

The first option seems preferable.

> This approach changes a lot of things that are observable to the page, so some number of pages will break.

Yes, I'm not pushing this as a pre-requisite for landing this patch. But encoding is something that may be easier to get right from the start than to change later.
Comment 32 Adam Barth 2011-04-21 20:13:43 PDT
(In reply to comment #31)
> From the top of my head, I have two variants:
> - encode documents using document.charset, add a META to make sure that we decode correctly (least confusing);
> - encode using utf-8, change META if it exists, store the encoding out of band (or perhaps in a private attribute), so that it could be restored after parsing.
> 
> The first option seems preferable.

The approach in Jay's previous patch is like (2) but it doesn't restore the encoding after parsing.  What goes wrong if we don't restore the encoding?
Comment 33 Alexey Proskuryakov 2011-04-21 20:42:03 PDT
I briefly mentioned that in comment 18. Some uses I could remember or easily find:
- for encoding form data;
- by XSSFilter, basically for the same purpose;
- by Document::completeURL();
- as a default encoding for some text-based subresources (which would obviously affect new one that are dynamically added after restoring);
- it changes rendering of a backslash/Yen character in Japanese.

Losing the encoding would cause a number of subtle bugs.
Comment 34 Adam Barth 2011-04-21 22:12:58 PDT
Ok.  Jay, what do you think about option (1):

- encode documents using document.charset, add a META to make sure that we decode correctly
Comment 35 Jay Civelli 2011-04-22 08:50:59 PDT
Option 1 sounds good, I'll go ahead and implement it this way.
I guess it probably does not matter for CSS subresources whether we enforce the charset or whether we force UTF-8, but I guess for consistency I'll preserve the encoding for these as well.
Comment 36 Jay Civelli 2011-04-22 13:49:26 PDT
Created attachment 90753 [details]
Now keeping original files encoding

With this new patch we now keep the original charset of the page and append our own meta tag to specify it (and we remove any meta tag specifying the charset).
Also addressed other comments from Adam and Alexey.
Comment 37 Jay Civelli 2011-04-22 13:52:00 PDT
Comment on attachment 90562 [details]
Moved code to WebCore.

View in context: https://bugs.webkit.org/attachment.cgi?id=90562&action=review

>>>> Source/WebCore/page/PageSerializer.cpp:127
>>>> +    appendAttribute(out, element, *attribute, namespaces);
>>> 
>>> This isn't quite correct in the case of an <object> tag, which uses the data attribute for the frame's URL.  HTMLFrameElementBase::isURLAttribute exists to solve this problem in other places.  See also HTMLObjectElement::imageSourceAttributeName for a slightly different solution to a similar problem.
>> 
>> I see. Since there is no getURLAttribute method I ended up iterating over the attributes and calling isURLAttribute to figure out the right one. Less than ideal, but I am not sure what else to do (short of introducing a new getURLAttribute method to Element).
> 
> You can add the method to HTMLFrameOwnerElement, right?

OK, done. Had to name it getFrameURLAttribute() as there is already an Element::getURLAttribute method.

>>>> Source/WebKit/chromium/src/WebPageSerializer.cpp:195
>>>> +        resource.data = WebCString(iter->data->data(), iter->data->size());
>>> 
>>> This creates a copy of all the images?  There isn't a way to just take a reference to the SharedData ?
>> 
>> Yes, it copies all resources.
>> There is no way in the current Chromium WebKit API to create a wrapper around SharedData.
>> I guess I could do that if you think that's necessary.
> 
> Sounds like a performance optimization we could consider in the future.  Perhaps a FIXME comment?

Added FIXME comment.

>> Source/WebKit/chromium/tests/WebPageNewSerializerTest.cpp:183
>> +TEST_F(WebPageNewSerializeTest, CSSResources)
> 
> I guess I don't see how this tests the case I was thinking about.  What about:
> 
> <div style="width:10; heigh:10; background-image: url(blue_background.png)">Hi</div>
> 
> ?

I does it, see css_test_page.html line 80.
Comment 38 Jay Civelli 2011-04-22 13:53:22 PDT
Comment on attachment 90610 [details]
Addressed Adam's and Alexey's comments

View in context: https://bugs.webkit.org/attachment.cgi?id=90610&action=review

>>> Source/WebCore/editing/MarkupAccumulator.h:96
>>> +    virtual void appendAttribute(Vector<UChar>& out, Element* element, const Attribute&, Namespaces*);
>> 
>> The parameter name "element" adds no information, so it should be removed.  [readability/parameter_name] [5]
> 
> I'd fix all the instance of this issue in this block of code.

Done.

>> Source/WebCore/page/PageSerializer.cpp:65
>> +static const WebCore::QualifiedName* frameOwnerSrcAttributes[] = {
> 
> This array does not appear to be used.

Ah right. Removed.

>> Source/WebCore/page/PageSerializer.cpp:81
>> +    // FIXME: ideally HTMLFrameOwnerElement would have a method to return the URL attribute name.
> 
> I'd just add this method to HTMLFrameOwnerElement

Done.

>> Source/WebCore/page/PageSerializer.cpp:133
>> +    if (element->hasTagName(HTMLNames::metaTag) && attribute.name() == HTMLNames::contentAttr) {
> 
> Do we want to make sure the http-equiv attribute of this element is Content-Type ?  Otherwise, you could be munging other http-equiv meta tags.

Good point. Now that we are keeping the original encoding I am skipping any <meta http-equiv="Content-Type"...

>> Source/WebCore/page/PageSerializer.cpp:136
>> +        if (charset != "utf-8" && charset != "utf8") {
> 
> Doe we need a case-insensitive compare here?  If not, we should ASSERT that charset is lowercase.

This code is gone.

>> Source/WebCore/page/PageSerializer.cpp:142
>> +    }
> 
> This isn't the only case.  There's also:
> 
> <meta charset="utf-8">

Also excluding these now.

>> Source/WebCore/page/PageSerializer.cpp:166
>> +    }
> 
> Why is this an error?  It can definitely happen.

OK, removed the LOG_ERROR. I was not sure which kind of frame owner would not have a src (since frame, object, embed would).

>> Source/WebCore/page/PageSerializer.cpp:219
>> +    m_resourceURLs.add(url);
> 
> I wonder if you want to add a <meta charset="UTF-8"> to the serialization just in case there's no meta element.
> 
> Also, is document always text/html?  What if it's application/svg+xml ?

I added a method to Document to get the contentType and it deals with xhtml and svg, and use that now.
Comment 39 WebKit Review Bot 2011-04-22 14:22:46 PDT
Attachment 90753 [details] did not build on chromium:
Build output: http://queues.webkit.org/results/8495506
Comment 40 Adam Barth 2011-04-22 14:40:07 PDT
Comment on attachment 90753 [details]
Now keeping original files encoding

View in context: https://bugs.webkit.org/attachment.cgi?id=90753&action=review

> Source/WebCore/dom/Document.cpp:1128
> +String Document::contentType()
> +{
> +    String content;
> +    if (m_document->isXHTMLDocument())
> +        content = "application/xhtml+xml";
> +    else if (m_document->isSVGDocument())
> +        content = "image/svg+xml";
> +    else
> +        content = "text/html";
> +
> +    return makeString(content, "; charset=", charset());
> +}

This doesn't seem right.  For example, what if the Content-Type is text/plain?  You need to ask the ResourceResponse that contained the document originally.  You probably need to ask DocumentLoader.
Comment 41 Alexey Proskuryakov 2011-04-22 14:47:29 PDT
I don't see a problem with that. Current document encoding is authoritative, as that's what is used in all the cases I mentioned. An HTTP  response may not have any charset in Content-Type, or it may have an invalid one.

On the other hand, I don't think that this function is needed. Can't we just use <meta charset>, not <meta http-equiv>?

Also, <meta>s of all sorts only work in HTML. You should be adding an encoding to <?xml?> for XML, and there is nothing that can be done for text.
Comment 42 Jay Civelli 2011-04-22 14:57:22 PDT
The reason I added this function is because in order to add the meta tag as we discussed before, I removed any existing meta-tags with a content-type.
May be that's a bad idea?
It seemed just trickier to add the meta only if it's not there. (I would have to wait for the end of the header, and I assumed such a tag should probably be one of the first tags in teh header as <title> and friends do need the encoding).
 
As a result of removing the existing the content-type meta tags, I removed the mime type of the document which is what I am attempting to regenerate.
My approach seems broken.

May be I should preparse the header before I even start serializing the DOM to figure-out if I need to specify the charset myself?
What do you guys think?
Comment 43 Simon Fraser (smfr) 2011-04-22 15:37:39 PDT
What is the serialization format? Is it something that should be standardized?
Comment 44 WebKit Review Bot 2011-04-22 16:38:48 PDT
Attachment 90753 [details] did not build on chromium:
Build output: http://queues.webkit.org/results/8495597
Comment 45 Adam Barth 2011-04-22 16:44:05 PDT
> What is the serialization format? Is it something that should be standardized?

The serialization format is HTML.
Comment 46 WebKit Review Bot 2011-04-22 17:39:32 PDT
Attachment 90753 [details] did not build on chromium:
Build output: http://queues.webkit.org/results/8487928
Comment 47 Jay Civelli 2011-04-22 17:43:55 PDT
Created attachment 90815 [details]
Only specifying the charset if the page does not do so.

I now only add the charset to pages that don't specify it already.
Also I specify the XML charset for XML documents (including XHTML and SVG) and ignore other types.
Comment 48 WebKit Review Bot 2011-04-22 20:04:09 PDT
Attachment 90815 [details] did not build on chromium:
Build output: http://queues.webkit.org/results/8497594
Comment 49 Alexey Proskuryakov 2011-04-22 21:36:44 PDT
Comment on attachment 90815 [details]
Only specifying the charset if the page does not do so.

View in context: https://bugs.webkit.org/attachment.cgi?id=90815&action=review

I have a number of comments, but I only looked into a few aspects of  the patch.

> Source/WebCore/ChangeLog:13
> +        * Android.mk:
> +        * CMakeLists.txt:
> +        * WebCore.gypi:

Still need many more build systems.

> Source/WebCore/ChangeLog:26
> +        * page/PageSerializer.h: Added.

Despite trying to follow this patch for a while now, I'm still not sure how to explain the relation of new code to what we have now (e.g. Web archives). I don't have any specific suggestions or requests, but if you could make it more clear, that would be great.

> Source/WebCore/dom/Document.cpp:1116
> +String Document::mimeType() const

This should be more explicit about what it returns. Like maybe suggestedMIMETypeForSerialization?

> Source/WebCore/html/HTMLFrameOwnerElement.cpp:114
> +    for (size_t i = 0; i < attributeMap->length(); ++i) {
> +        Attribute* attribute = attributeMap->attributeItem(i);
> +        if (isURLAttribute(attribute))

I bet we can do much better than iterate. Also, it's really confusing what this function even does. getSourceURLAttributeNameIfPresent?

From briefly looking at call sites, it appears that those should check that the elements are instances of HTMLFrameElementBase somehow, and then just use src. They won't know what to do with potential future frame owner elements anyway, so being future-proof beyond adding assertions doesn't really help.

> Source/WebCore/page/PageSerializer.cpp:91
> +    if (element->hasAttribute(HTMLNames::charsetAttr))
> +        return true;
> +
> +    String attributeValue;
> +    if (!(attributeValue = element->getAttribute(HTMLNames::http_equivAttr)).isNull()
> +        && attributeValue.lower().stripWhiteSpace() == "content-type")
> +        return true;

This logic doesn't match HTML5. Actual parsing logic lives in HTMLMetaCharsetParser, and I think that this check should be unified with it.

> Source/WebCore/page/PageSerializer.cpp:119
> +    // MarkupAccumulator does not serialie the <?xml ... line, so we add it explicitely to ensure the right encoding is specified.

Typos: serialie, explicitely.

I suspect that it's a bug that MarkupAccumulator doesn't do that. Specifically, bug 25206 tracks a consequence of that.

You should add guards (at least in the form of regression tests) to avoid getting two XML declarations - currently, this patch doesn't include any XML tests AFAICS. Obviously, it would be even better if you'd be willing to investigate and fix this in a prequel patch.

> Source/WebCore/page/PageSerializer.cpp:157
> +        String metaStart("<meta charset=\"");

I've lost track of what the best way to build strings is today. StringBuilder perhaps? Or just "<meta charset=\"" + m_document->charset() + "\">"?

> Source/WebCore/page/PageSerializer.cpp:237
> +    TextEncoding& textEncoding = TextEncoding(document->charset());

TextEncoding textEncoding(document->charset());

> Source/WebCore/page/PageSerializer.cpp:243
> +    CString frameHTML = textEncoding.encode(text.characters(), text.length(), QuestionMarksForUnencodables);

Why QuestionMarksForUnencodables here and elsewhere? The page may have characters outside the original encoding repertoire inserted via DOM. It's very easy to avoid losing information by using other options here.

> Source/WebCore/page/PageSerializer.cpp:307
> +        TextEncoding& textEncoding = TextEncoding(styleSheet->charset());

TextEncoding textEncoding(styleSheet->charset());

> Source/WebCore/page/PageSerializer.cpp:314
> +        m_resources->append(Resource(url, String("text/css"), SharedBuffer::create(text.data(), text.length())));

We probably need need to ensure @charset for those. Thankfully, it's much easier than HTML! It's also less important, and could easily wait for a follow-up patch.
Comment 50 Adam Barth 2011-04-22 22:25:24 PDT
> Despite trying to follow this patch for a while now, I'm still not sure how to explain the relation of new code to what we have now (e.g. Web archives). I don't have any specific suggestions or requests, but if you could make it more clear, that would be great.

Web archive is an Apple-proprietary format, which is coupled to the Objective-C ecosystem (e.g., via the use of NSKeyedEncoder).  MHTML is another serialization format, which is used by IE and Opera.

One perspective is that this mechanism is at a lower layer than the particular final serialization format.  This code serializes the DOM to HTML and encapsulates some subresources (such as images).  You could take the output of this class an complete the serialization to either web archive, MHTML, or some other on-disk format.

There might be some opportunity to re-use some of this mechanism in WebKit's web archive codepath, but the clients of that code path might want to make different trade-offs (e.g., about inline scripts, as discussed above).  Another perspective is that MarkupAccumulator is doing most of the heavy lifting here and is the most valuable component to re-use.
Comment 51 WebKit Review Bot 2011-04-22 22:44:55 PDT
Attachment 90815 [details] did not build on chromium:
Build output: http://queues.webkit.org/results/8495723
Comment 52 WebKit Review Bot 2011-04-22 23:45:54 PDT
Attachment 90815 [details] did not build on chromium:
Build output: http://queues.webkit.org/results/8503062
Comment 53 Simon Fraser (smfr) 2011-04-23 09:17:23 PDT
FYI one requirement we've found necessary that is poorly met by WebArchives is to be able to explode the serialized form into the original files, in order to edit those files to create bug reductions.

Would the current code be usable if we wanted, for example, to create a bundle (directory-based) archive, or a zip archive?
Comment 54 Adam Barth 2011-04-23 09:20:57 PDT
> Would the current code be usable if we wanted, for example, to create a bundle (directory-based) archive, or a zip archive?

Yep.  Chrome's current "Save As (complete)" feature produces a directory-based representation of the page.  The main HTML file is stored in next to a folder containing the subresources as individual files.  In this model, the exact on-disk representation is chosen by the embedder.
Comment 55 Jay Civelli 2011-04-25 12:28:32 PDT
Comment on attachment 90815 [details]
Only specifying the charset if the page does not do so.

View in context: https://bugs.webkit.org/attachment.cgi?id=90815&action=review

>> Source/WebCore/ChangeLog:13
>> +        * WebCore.gypi:
> 
> Still need many more build systems.

OK, I tried to add all of them. I am still missing the xcode one, I'll do it next (I don't think I can do it in a text editor without screwing things up, so I'll have to move my patch to my Mac).

>> Source/WebCore/ChangeLog:26
>> +        * page/PageSerializer.h: Added.
> 
> Despite trying to follow this patch for a while now, I'm still not sure how to explain the relation of new code to what we have now (e.g. Web archives). I don't have any specific suggestions or requests, but if you could make it more clear, that would be great.

As Adam said, the goal is to provide another way to save a page to disk. One of my objective is to have a subsequent patch that adds MHTML generation support (by leveraging the Archive/ArchiveFactory classes).

>> Source/WebCore/dom/Document.cpp:1116
>> +String Document::mimeType() const
> 
> This should be more explicit about what it returns. Like maybe suggestedMIMETypeForSerialization?

I am happy to change the name, but is the MIME type returned only specific to serialization? Are there cases where the document MIME type would not be considered to be one of these?

>> Source/WebCore/html/HTMLFrameOwnerElement.cpp:114
>> +        if (isURLAttribute(attribute))
> 
> I bet we can do much better than iterate. Also, it's really confusing what this function even does. getSourceURLAttributeNameIfPresent?
> 
> From briefly looking at call sites, it appears that those should check that the elements are instances of HTMLFrameElementBase somehow, and then just use src. They won't know what to do with potential future frame owner elements anyway, so being future-proof beyond adding assertions doesn't really help.

I believe since object and embed tags can be used as frames, then we cannot assume these tags are going to be HTMLFrameElementBase. And in the object tag case, the attribute to look at is not src but data.
I think ideally HTMLFrameOwnerElement would have a method that returns the actual URL attribute instead of isURLAttribute(). Implementing this (if this is actually the right think to do) might impact several more files, and I'd rather not do it as part this patch if that's OK with you.

>> Source/WebCore/page/PageSerializer.cpp:91
>> +        return true;
> 
> This logic doesn't match HTML5. Actual parsing logic lives in HTMLMetaCharsetParser, and I think that this check should be unified with it.

I tried to unify the code, but it turned out I had to use a Vector to unify the attributes which are of different types in the 2 locations. Please let me know if you can think of a better way to do this.

>> Source/WebCore/page/PageSerializer.cpp:119
>> +    // MarkupAccumulator does not serialie the <?xml ... line, so we add it explicitely to ensure the right encoding is specified.
> 
> Typos: serialie, explicitely.
> 
> I suspect that it's a bug that MarkupAccumulator doesn't do that. Specifically, bug 25206 tracks a consequence of that.
> 
> You should add guards (at least in the form of regression tests) to avoid getting two XML declarations - currently, this patch doesn't include any XML tests AFAICS. Obviously, it would be even better if you'd be willing to investigate and fix this in a prequel patch.

Typo fixed.
OK, I added a regression test for now.
I think I have I know how to fix the MarkupAccumulator bug, I'll try to do that in a follow-up patch.

>> Source/WebCore/page/PageSerializer.cpp:157
>> +        String metaStart("<meta charset=\"");
> 
> I've lost track of what the best way to build strings is today. StringBuilder perhaps? Or just "<meta charset=\"" + m_document->charset() + "\">"?

Switched to use makeString.

>> Source/WebCore/page/PageSerializer.cpp:237
>> +    TextEncoding& textEncoding = TextEncoding(document->charset());
> 
> TextEncoding textEncoding(document->charset());

Fixed.

>> Source/WebCore/page/PageSerializer.cpp:243
>> +    CString frameHTML = textEncoding.encode(text.characters(), text.length(), QuestionMarksForUnencodables);
> 
> Why QuestionMarksForUnencodables here and elsewhere? The page may have characters outside the original encoding repertoire inserted via DOM. It's very easy to avoid losing information by using other options here.

OK, I switched to using EntitiesForUnencodables that will encode characters outside of the encoding repertoire to XML entity.

>> Source/WebCore/page/PageSerializer.cpp:307
>> +        TextEncoding& textEncoding = TextEncoding(styleSheet->charset());
> 
> TextEncoding textEncoding(styleSheet->charset());

Done.

>> Source/WebCore/page/PageSerializer.cpp:314
>> +        m_resources->append(Resource(url, String("text/css"), SharedBuffer::create(text.data(), text.length())));
> 
> We probably need need to ensure @charset for those. Thankfully, it's much easier than HTML! It's also less important, and could easily wait for a follow-up patch.

OK, I added a FIXME and will keep that for a follow-up patch.
Comment 56 Jay Civelli 2011-04-25 12:29:40 PDT
Created attachment 90935 [details]
Addressing Alexey's comments
Comment 57 Early Warning System Bot 2011-04-25 12:49:43 PDT
Attachment 90935 [details] did not build on qt:
Build output: http://queues.webkit.org/results/8506427
Comment 58 Adam Barth 2011-04-25 14:12:38 PDT
Comment on attachment 90935 [details]
Addressing Alexey's comments

View in context: https://bugs.webkit.org/attachment.cgi?id=90935&action=review

I feel like we're getting the point where we should land this patch and iterate with incremental patches.  (Minor nits below.)

> Source/WebCore/html/parser/HTMLMetaCharsetParser.cpp:118
> +TextEncoding HTMLMetaCharsetParser::getEncodingFromMetaAttributes(const Vector<pair<AtomicString, String> >& attributes)

We usually don't start functions with the word "get".

> Source/WebCore/html/parser/HTMLMetaCharsetParser.h:52
> +    static TextEncoding getEncodingFromMetaAttributes(const Vector<pair<AtomicString, String> >&);

const Vector<pair<AtomicString, String> > is not a very semantic type...

> Source/WebCore/page/PageSerializer.cpp:57
> +

We shouldn't have this blank line here.

> Source/WebCore/page/PageSerializer.cpp:128
> +        if (head) {

Prefer early return.
Comment 59 Alexey Proskuryakov 2011-04-25 15:30:03 PDT
>> This should be more explicit about what it returns. Like maybe suggestedMIMETypeForSerialization?
>
> I am happy to change the name, but is the MIME type returned only specific to serialization? Are there 
> cases where the document MIME type would not be considered to be one of these?

Maybe suggestedMIMEType() is OK indeed. mimeType() feels like it is "real", original or authoritative in some way, which is probably more than what is offered here.

Another case where we need a MIME type is to fix bug 11049. I haven't checked if what you add matches what we need there (we'll need the test both spec and important browsers). Maybe you'll want to fix that bug as a follow-up, too?

> I tried to unify the code, but it turned out I had to use a Vector to unify the attributes which are of 
> different types in the 2 locations. Please let me know if you can think of a better way to do this.

This is not exactly what I had in mind. I've been thinking about just having a check for every single <meta> be unified with HTMLCharsetParser, not about passing all metas to it. Not sure which is better.
Comment 60 Alexey Proskuryakov 2011-04-25 15:48:09 PDT
Comment on attachment 90935 [details]
Addressing Alexey's comments

View in context: https://bugs.webkit.org/attachment.cgi?id=90935&action=review

> This is not exactly what I had in mind

Please ignore this comment - I've now looked at the code, and understood what it's doing.

I agree that this should be landed sooner rather than later, and I avoided nitpicking on actual serialization code. r- for still not having all project file changes, and I think that my code quality comments below need to be addressed, too. But we are getting close, hopefully the next iteration will be landable.

> Source/WebCore/html/HTMLFrameOwnerElement.cpp:116
> +    for (size_t i = 0; i < attributeMap->length(); ++i) {
> +        Attribute* attribute = attributeMap->attributeItem(i);
> +        if (isURLAttribute(attribute))
> +            return &(attribute->name());
> +    }

This still stands out to me as something that should be fixed in initial patch. Is there some difficulty getting rid of the loop? The only non-null result this functions can ever return is srcAttr, so why does it even exist?

>> Source/WebCore/html/parser/HTMLMetaCharsetParser.cpp:118
>> +TextEncoding HTMLMetaCharsetParser::getEncodingFromMetaAttributes(const Vector<pair<AtomicString, String> >& attributes)
> 
> We usually don't start functions with the word "get".

Yes, in WebKit coding style, "get..." is for functions that return a result via a reference (or pointer) argument.

> Source/WebCore/page/PageSerializer.cpp:74
> +    for (size_t i = 0; i < WTF_ARRAY_LENGTH(ignoredTags); ++i) {
> +        if (element->hasTagName(*ignoredTags[i]))
> +            return true;
> +    }
> +    return false;

I suggest changing this to:

    return element->hasTagName(HTMLNames::scriptTag) || element->hasTagName(HTMLNames:: noscriptTag);

It's much shorter, and the current code isn't scalable anyway (you'd need to start using a HashMap with many ignored tags).

> Source/WebCore/page/PageSerializer.cpp:107
> +    PageSerializer* m_serializer;

What guarantees that this doesn't become a dangling pointer?

> Source/WebCore/page/PageSerializer.cpp:108
> +    Document* m_document;

And this?

> Source/WebCore/page/PageSerializer.cpp:124
> +    // HTML might or not have their encoding specified, we'll add it if it's missing.

This is not sufficient. It is very common to have a <meta> that is wrong - HTTP headers override <meta>, so people leave broken ones in files all the time.

> Source/WebCore/page/PageSerializer.cpp:183
> +    RefPtr<Attribute> attribute = Attribute::create(HTMLNames::srcAttr, url.string());
> +    appendAttribute(out, element, *attribute, namespaces);

Do you really need to use the internal Attribute class directly? What about Element::setAttribute()?

> Source/WebCore/page/PageSerializer.cpp:233
> +    if (!textEncoding.isValid()) {

You can just ASSERT, document->encoding() is never invalid.
Comment 61 Jay Civelli 2011-04-26 00:37:45 PDT
Comment on attachment 90935 [details]
Addressing Alexey's comments

View in context: https://bugs.webkit.org/attachment.cgi?id=90935&action=review

>> Source/WebCore/html/HTMLFrameOwnerElement.cpp:116
>> +    }
> 
> This still stands out to me as something that should be fixed in initial patch. Is there some difficulty getting rid of the loop? The only non-null result this functions can ever return is srcAttr, so why does it even exist?

As I mentioned in my previous comment, the problem is with object tags that I believe don't use src but data. SInce I believe this is the only weird case, I simplified the method.

>>> Source/WebCore/html/parser/HTMLMetaCharsetParser.cpp:118
>>> +TextEncoding HTMLMetaCharsetParser::getEncodingFromMetaAttributes(const Vector<pair<AtomicString, String> >& attributes)
>> 
>> We usually don't start functions with the word "get".
> 
> Yes, in WebKit coding style, "get..." is for functions that return a result via a reference (or pointer) argument.

Done.

>> Source/WebCore/html/parser/HTMLMetaCharsetParser.h:52
>> +    static TextEncoding getEncodingFromMetaAttributes(const Vector<pair<AtomicString, String> >&);
> 
> const Vector<pair<AtomicString, String> > is not a very semantic type...

Now typedefed.

>> Source/WebCore/page/PageSerializer.cpp:57
>> +
> 
> We shouldn't have this blank line here.

Removed.

>> Source/WebCore/page/PageSerializer.cpp:74
>> +    return false;
> 
> I suggest changing this to:
> 
>     return element->hasTagName(HTMLNames::scriptTag) || element->hasTagName(HTMLNames:: noscriptTag);
> 
> It's much shorter, and the current code isn't scalable anyway (you'd need to start using a HashMap with many ignored tags).

Good idea, done.

>> Source/WebCore/page/PageSerializer.cpp:107
>> +    PageSerializer* m_serializer;
> 
> What guarantees that this doesn't become a dangling pointer?

It is an internal class instanciated by PageSerializer that never outlives the PageSerializer or Document.

>> Source/WebCore/page/PageSerializer.cpp:124
>> +    // HTML might or not have their encoding specified, we'll add it if it's missing.
> 
> This is not sufficient. It is very common to have a <meta> that is wrong - HTTP headers override <meta>, so people leave broken ones in files all the time.

OK, reverted to what I was doing previously, which is to always remove meta charset tags and always add our own.

>> Source/WebCore/page/PageSerializer.cpp:128
>> +        if (head) {
> 
> Prefer early return.

Done.

>> Source/WebCore/page/PageSerializer.cpp:183
>> +    appendAttribute(out, element, *attribute, namespaces);
> 
> Do you really need to use the internal Attribute class directly? What about Element::setAttribute()?

I have tried hard in this patch to do the serialization without modifying the DOM to avoid any unintended consequences.
Wouldn't setting the src attribute on the frame cause it to load the fake URL?

>> Source/WebCore/page/PageSerializer.cpp:233
>> +    if (!textEncoding.isValid()) {
> 
> You can just ASSERT, document->encoding() is never invalid.

Done.
Comment 62 Jay Civelli 2011-04-26 00:52:02 PDT
(In reply to comment #59)
> >> This should be more explicit about what it returns. Like maybe suggestedMIMETypeForSerialization?
> >
> > I am happy to change the name, but is the MIME type returned only specific to serialization? Are there 
> > cases where the document MIME type would not be considered to be one of these?
> 
> Maybe suggestedMIMEType() is OK indeed. mimeType() feels like it is "real", original or authoritative in some way, which is probably more than what is offered here.
OK, changed to sugestedMimeType.

> Another case where we need a MIME type is to fix bug 11049. I haven't checked if what you add matches what we need there (we'll need the test both spec and important browsers). Maybe you'll want to fix that bug as a follow-up, too?
I'll try to take a look as a follow up.
 
> > I tried to unify the code, but it turned out I had to use a Vector to unify the attributes which are of 
> > different types in the 2 locations. Please let me know if you can think of a better way to do this.
> 
> This is not exactly what I had in mind. I've been thinking about just having a check for every single <meta> be unified with HTMLCharsetParser, not about passing all metas to it. Not sure which is better.
Comment 63 Jay Civelli 2011-04-26 01:09:36 PDT
Created attachment 91081 [details]
Patch
Comment 64 Alexey Proskuryakov 2011-04-26 08:35:15 PDT
Comment on attachment 91081 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=91081&action=review

Looks good to me.

> Source/WebCore/html/HTMLFrameOwnerElement.cpp:112
> +    return hasTagName(HTMLNames::objectTag) ? HTMLNames::dataAttr : HTMLNames::srcAttr;

I like this much better without a loop.

This returns a correct result for <embed>, but what about <applet>? As long as this is a method on HTMLFrameOwnerElement, it needs to do the right thing in all cases, and assert that an unexpected case didn't arise. In page serialization code, it would be more acceptable to only handle the cases you support. Perhaps this can be moved to PageSerializer.cpp.

> Source/WebCore/html/parser/HTMLMetaCharsetParser.cpp:108
> +    for (HTMLToken::AttributeList::const_iterator iter = tokenAttributes.begin();
> +         iter != tokenAttributes.end(); ++iter) {

This line is sufficiently short, no need to wrap it.

> Source/WebCore/page/PageSerializer.cpp:74
> +        attributes.append(make_pair(item->name().toString(), item->value()));

Will this code be fooled by namespaced attributes?

> Source/WebCore/page/PageSerializer.cpp:271
> +        if (!textEncoding.isValid()) {
> +            LOG_ERROR("Unsupported charset '%s', defaulting to UTF-8.", styleSheet->charset().ascii().data());

Missed this last time. I don't think that a stylesheet charset can be invalid either.
Comment 65 Jay Civelli 2011-04-26 11:01:41 PDT
Comment on attachment 91081 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=91081&action=review

>> Source/WebCore/html/HTMLFrameOwnerElement.cpp:112
>> +    return hasTagName(HTMLNames::objectTag) ? HTMLNames::dataAttr : HTMLNames::srcAttr;
> 
> I like this much better without a loop.
> 
> This returns a correct result for <embed>, but what about <applet>? As long as this is a method on HTMLFrameOwnerElement, it needs to do the right thing in all cases, and assert that an unexpected case didn't arise. In page serialization code, it would be more acceptable to only handle the cases you support. Perhaps this can be moved to PageSerializer.cpp.

Moved that code back to PageSerializer (where it was originally)

>> Source/WebCore/html/parser/HTMLMetaCharsetParser.cpp:108
>> +         iter != tokenAttributes.end(); ++iter) {
> 
> This line is sufficiently short, no need to wrap it.

Done.

>> Source/WebCore/page/PageSerializer.cpp:74
>> +        attributes.append(make_pair(item->name().toString(), item->value()));
> 
> Will this code be fooled by namespaced attributes?

Added a FIXME to look into this as a follow up patch if that's OK with you.

>> Source/WebCore/page/PageSerializer.cpp:271
>> +            LOG_ERROR("Unsupported charset '%s', defaulting to UTF-8.", styleSheet->charset().ascii().data());
> 
> Missed this last time. I don't think that a stylesheet charset can be invalid either.

Fixed.
Comment 66 Jay Civelli 2011-04-26 11:03:40 PDT
Created attachment 91128 [details]
Patch
Comment 67 Early Warning System Bot 2011-04-26 11:21:32 PDT
Attachment 91128 [details] did not build on qt:
Build output: http://queues.webkit.org/results/8508715
Comment 68 Jay Civelli 2011-04-26 11:36:16 PDT
Created attachment 91131 [details]
Patch
Comment 69 Early Warning System Bot 2011-04-26 11:57:21 PDT
Attachment 91131 [details] did not build on qt:
Build output: http://queues.webkit.org/results/8509709
Comment 70 Jay Civelli 2011-04-26 12:09:04 PDT
Created attachment 91142 [details]
Patch
Comment 71 WebKit Review Bot 2011-04-26 13:31:31 PDT
Attachment 91128 [details] did not build on mac:
Build output: http://queues.webkit.org/results/8507716
Comment 72 Adam Barth 2011-04-26 17:05:26 PDT
Comment on attachment 91142 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=91142&action=review

> Source/WebCore/page/PageSerializer.cpp:74
> +        // FIXME: we should deal appropriately with the attribute if they have a namespace.

we => We

> Source/WebCore/page/PageSerializer.cpp:88
> +    // FIXME: we should support all frame owners including applets.

we => We
Comment 73 Jay Civelli 2011-04-26 17:33:02 PDT
Created attachment 91194 [details]
Patch
Comment 74 Jay Civelli 2011-04-26 17:45:10 PDT
Created attachment 91196 [details]
Patch
Comment 75 WebKit Review Bot 2011-04-26 22:27:40 PDT
Attachment 91196 [details] did not build on mac:
Build output: http://queues.webkit.org/results/8508862
Comment 76 Jay Civelli 2011-04-26 23:00:10 PDT
Created attachment 91236 [details]
Patch for landing
Comment 77 Jay Civelli 2011-04-26 23:04:55 PDT
Created attachment 91237 [details]
Patch for landing
Comment 78 WebKit Review Bot 2011-04-26 23:26:39 PDT
Attachment 91196 [details] did not build on mac:
Build output: http://queues.webkit.org/results/8513237
Comment 79 WebKit Commit Bot 2011-04-27 05:55:08 PDT
Comment on attachment 91237 [details]
Patch for landing

Rejecting attachment 91237 [details] from commit-queue.

Failed to run "['./Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=eseidel-cq-sf', 'la..." exit_code: 1

Last 500 characters of output:
n/refs/remotes/origin/master/.rev_map.268f45cc-cd09-0410-ab3c-d52691b4dbfc ...
Currently at 85045 = c90ce6df6de66c3e8aa4e4647defcc79fbeb71d8
r85046 = 56faf6b1e02973a2afa610db422557d9b295b465
r85047 = 36d92b65972e8c122cdf958588b4d52a4d988a12
r85048 = 9f2306577fb4fcb9c7c377fcc4873aaa774e2e1a
Done rebuilding .git/svn/refs/remotes/origin/master/.rev_map.268f45cc-cd09-0410-ab3c-d52691b4dbfc
First, rewinding head to replay your work on top of it...
Fast-forwarded master to refs/remotes/origin/master.

Full output: http://queues.webkit.org/results/8508948
Comment 80 Jay Civelli 2011-04-27 08:51:41 PDT
Created attachment 91293 [details]
Patch for landing
Comment 81 WebKit Commit Bot 2011-04-27 20:36:54 PDT
Comment on attachment 91293 [details]
Patch for landing

Rejecting attachment 91293 [details] from commit-queue.

Failed to run "['./Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=cr-jail-4', 'apply-..." exit_code: 2

Last 500 characters of output:
hing file Source/WebKit/chromium/tests/data/pageserializer/import_style_from_link.css
patching file Source/WebKit/chromium/tests/data/pageserializer/import_styles.css
patching file Source/WebKit/chromium/tests/data/pageserializer/link_styles.css
patching file Source/WebKit/chromium/tests/data/pageserializer/simple.xhtml
patching file Source/WebKit/chromium/tests/data/pageserializer/top_frame.html

Failed to run "[u'/mnt/git/webkit-commit-queue/Tools/Scripts/svn-apply', u'--force']" exit_code: 1

Full output: http://queues.webkit.org/results/8518206
Comment 82 Alexey Proskuryakov 2011-04-27 21:02:54 PDT
Probably failing on CRLF line endings in vcproj, and will need to be landed manually. By the way, xcodeproj changes are not in ChangeLog.
Comment 83 Adam Barth 2011-04-28 02:13:46 PDT
Created attachment 91452 [details]
patch (updated to avoid Android.mk)
Comment 84 Adam Barth 2011-04-28 15:04:09 PDT
Comment on attachment 91452 [details]
patch (updated to avoid Android.mk)

Clearing flags on attachment: 91452

Committed r85244: <http://trac.webkit.org/changeset/85244>
Comment 85 Adam Barth 2011-04-28 15:04:20 PDT
All reviewed patches have been landed.  Closing bug.
Comment 86 WebKit Review Bot 2011-04-28 15:52:47 PDT
http://trac.webkit.org/changeset/85244 might have broken WinCE Release (Build)