Bug 65803 - Initial pass at a new XML tree builder
: Initial pass at a new XML tree builder
Status: RESOLVED FIXED
: WebKit
New Bugs
: 528+ (Nightly build)
: Unspecified Unspecified
: P2 Normal
Assigned To:
:
:
:
: 64396
  Show dependency treegraph
 
Reported: 2011-08-05 18:10 PST by
Modified: 2011-08-09 14:32 PST (History)


Attachments
Patch (16.80 KB, patch)
2011-08-05 18:36 PST, Jeffrey Pfau
no flags Review Patch | Details | Formatted Diff | Diff
Patch (30.95 KB, patch)
2011-08-08 14:23 PST, Jeffrey Pfau
no flags Review Patch | Details | Formatted Diff | Diff
Patch (31.05 KB, patch)
2011-08-08 18:16 PST, Jeffrey Pfau
no flags Review Patch | Details | Formatted Diff | Diff
Patch for landing (30.98 KB, patch)
2011-08-09 14:01 PST, Jeffrey Pfau
no flags Review Patch | Details | Formatted Diff | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2011-08-05 18:10:59 PST
Initial pass at a new XML tree builder
------- Comment #1 From 2011-08-05 18:36:17 PST -------
Created an attachment (id=103137) [details]
Patch
------- Comment #2 From 2011-08-05 19:25:30 PST -------
(From update of attachment 103137 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=103137&action=review

Overall, I would split most of this code out into a new object.  NewXMLDocumentParser is going to be a large complex class.  The more we can move off into smaller classes, the better off our lives will be.  For HTML, we split this code into two pieces (the TreeBuilder and the ConstructionSite), but that's probably not necessary for XML because this part of the algorithm is simpler.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:124
>  void NewXMLDocumentParser::finish()
>  {
> +    m_finishWasCalled = true;

Should we assert that !m_finishWasCalled ?  I think finish can only be called once.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:171
> +void NewXMLDocumentParser::pushCurrentNode(ContainerNode* n)

n => node ?

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:213
> +    ExceptionCode ec = 0;
> +
> +    document()->setXMLVersion(String::adopt(token.xmlVersion()), ec);
> +    document()->setXMLStandalone(token.xmlStandalone(), ec);

Does this ec just vanish?

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:221
> +    String extId(token.publicIdentifier().data(), token.publicIdentifier().size());
> +    String sysId(token.systemIdentifier().data(), token.systemIdentifier().size());

We should use complete words in variable names.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:232
> +    if ((extId == "-//W3C//DTD XHTML 1.0 Transitional//EN")
> +        || (extId == "-//W3C//DTD XHTML 1.1//EN")
> +        || (extId == "-//W3C//DTD XHTML 1.0 Strict//EN")
> +        || (extId == "-//W3C//DTD XHTML 1.0 Frameset//EN")
> +        || (extId == "-//W3C//DTD XHTML Basic 1.0//EN")
> +        || (extId == "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN")
> +        || (extId == "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN")
> +        || (extId == "-//WAPFORUM//DTD XHTML Mobile 1.0//EN")

These string constants should be AtomicStrings (as should extId) so we can do this comparison much faster.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:237
> +void NewXMLDocumentParser::processStartTag(const AtomicXMLToken& token)

This function is too long and should be broken up into it's component pieces.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:247
> +            if (attribute->name().prefix() == "xmlns")

This should also be an atomic string comparison.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:274
> +    if (token.attributes()) {
> +        for (unsigned i = 0; i < token.attributes()->length(); ++i) {
> +            Attribute* attribute = token.attributes()->attributeItem(i);
> +            ExceptionCode ec = 0;
> +            if (attribute->name().prefix() == "xmlns")
> +                newElement->setAttributeNS(XMLNSNames::xmlnsNamespaceURI, "xmlns:" + attribute->name().localName(), attribute->value(), ec);
> +            else if (attribute->name() == "xmlns")
> +                newElement->setAttributeNS(XMLNSNames::xmlnsNamespaceURI, xmlnsAtom, attribute->value(), ec);
> +            else {
> +                QualifiedName qName(attribute->prefix(), attribute->localName(), namespaceForPrefix(attribute->prefix(), nullAtom));
> +                newElement->setAttribute(qName, attribute->value());
> +            }
> +        }
> +    }

For example, this is a self-contained, logical unity.  Maybe this should be a method on the token to give its attributes to the element?

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:290
> +    if (isFirstElement && document()->frame())
> +        document()->frame()->loader()->dispatchDocumentElementAvailable();
> +
> +    m_currentNodeItem = m_currentNodeStack.last();

dispatchDocumentElementAvailable is likely to do a lot of work.  Do we want to make sure this object is in a consistent state before calling out into a bunch of complex code that my re-enter us or assume that we're in a consistent state?

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:298
> +    ContainerNode* n = m_currentNode;
> +    n->finishParsingChildren();

finishParsingChildren can run arbitrary code and re-enter this function.  There's no guantee that n will still be alive after calling this.  At the very least, we should grab a RefPtr to n before calling out to arbitrary code, but we should also ensure that we're in a consistent state.  (Also, we should rename n to something like node.)

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:358
> +        HTMLEntitySearch search;
> +        const AtomicString& name = token.name();
> +        for (unsigned i = 0; i < name.length(); ++i) {
> +            search.advance(name[i]);
> +            if (!search.isEntityPrefix())
> +                stopParsing();
> +                return;
> +        }
> +        search.advance(';');
> +        ExceptionCode ec = 0;
> +        UChar32 entityValue = search.currentValue();
> +        if (entityValue <= 0xFFFF)
> +            m_leafTextNode->appendData(String(reinterpret_cast<UChar*>(&entityValue), 1), ec);
> +        else {
> +            UChar utf16Pair[2] = { U16_LEAD(entityValue), U16_TRAIL(entityValue) };
> +            m_leafTextNode->appendData(String(utf16Pair, 2), ec);
> +        }

This also looks like a self-contained block of code that should be factored out.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:380
> +        AtomicString amp("amp");
> +        AtomicString apos("apos");
> +        AtomicString gt("gt");
> +        AtomicString lt("lt");
> +        AtomicString quot("quot");
> +        ExceptionCode ec = 0;
> +
> +        if (token.name() == amp)
> +            m_leafTextNode->appendData("&", ec);
> +        else if (token.name() == apos)
> +            m_leafTextNode->appendData("'", ec);
> +        else if (token.name() == gt)
> +            m_leafTextNode->appendData(">", ec);
> +        else if (token.name() == lt)
> +            m_leafTextNode->appendData("<", ec);
> +        else if (token.name() == quot)
> +            m_leafTextNode->appendData("\"", ec);
> +        else {
> +            stopParsing();
> +            return;
> +        }

I'd also factor this out into a static free function.  The constants should be DEFINE_STATIC_LOCAL so we only need to allocate / hash them once.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:395
> +void NewXMLDocumentParser::enterText()
> +{
> +    if (!m_sawFirstElement) {
> +        // FIXME: ensure it's just whitespace
> +        return;
> +    }
> +
> +    if (!m_leafTextNode) {
> +        m_leafTextNode = Text::create(document(), "");
> +        m_currentNode->parserAddChild(m_leafTextNode.get());
> +    }
> +}

Can we optimize this to allocate the text node with the first text token directly?  I'm not sure if it's expensive to allocate text node with an empty string and then fill it, but it could well cost an extra malloc.  This operation happens many, many times while parsing.

> Source/WebCore/xml/parser/NewXMLDocumentParser.h:67
> +        ContainerNode* m_node;
> +        AtomicString m_namespace;

Data members should be private.  You can expose getter/setting functions if you like.
------- Comment #3 From 2011-08-08 14:23:36 PST -------
Created an attachment (id=103292) [details]
Patch
------- Comment #4 From 2011-08-08 15:52:12 PST -------
(From update of attachment 103292 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=103292&action=review

Lots of detailed comments.  Mostly r- for manual refcounting.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:87
> +    m_finishWasCalled = true;

I thought you were going to ASSERT(!m_finishWasCalled) here?

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:118
>  void NewXMLDocumentParser::prepareToStopParsing()
>  {
> +    ScriptableDocumentParser::prepareToStopParsing();
>  }
>  
>  void NewXMLDocumentParser::stopParsing()
>  {
> +    ScriptableDocumentParser::stopParsing();
>  }

Why do these exist if we're just going to call through to the base class?

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:82
> +    default:
> +            break;

Bad indent.  Usually we skip the default case to have the compiler yell at us if we forget any cases.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:91
> +    if (node != m_document)
> +        node->ref();

Why do we need to use manual ref counting?  Can't we use RefPtr to do this work automatically?

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:95
> +    m_currentNode = node;

How does a currentNode differ from a currentNodeItem?

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:102
> +    if (!m_currentNode)
> +        return;

When would we ever pop the current node when m_currentNode is null?

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:106
> +    if (m_currentNode != m_document)
> +        m_currentNode->deref();

Frown.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:127
> +    m_document->setXMLVersion(String::adopt(token.xmlVersion()), ec);

We should use the copying string constructor rather than adopt here.  The copying constructor is faster for small strings because adopt incurs an extra malloc.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:133
> +    if (ec)
> +        m_parser->stopParsing();

Do we need to check for ec after each call?  What if the second call clobbers the ec from the first?

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:148
> +    AtomicString publicIdentifier(token.publicIdentifier().data(), token.publicIdentifier().size());
> +    AtomicString systemIdentifier(token.systemIdentifier().data(), token.systemIdentifier().size());

The design for the other fields is to have the AtomicXMLToken create these atomic strings on construction.  We'll need to do the work either way, but doing it that way lets us save work if we need to look at these values in multiple places.  That's probably not much of a win here, but it's useful to be consistent.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:160
> +        || (publicIdentifier == xhtmlMobile)
> +       )

These lines should be merged.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:166
> +    exitText();

Why don't the processing instruction or the xml declaration cause use to exitText?  Do we need to assert that there's not text to exit?

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:214
> +

Extra blank line.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:215
> +    enterText(String(token.characters().data(), token.characters().size()));

So, you've got an extra memcpy here.  You're memcpying data into this string, and then you'll end up memcpying into an existing text node, if there is one.  That turned out to be costly in the HTML parser, which is why we came up with the notion of an external character buffer.  You want to avoid memcpying the data into the String until as late as possible.  Doing the memcpy here is probably too early.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:220
> +    RefPtr<CDATASection> cdata = CDATASection::create(m_document, token.data());

No exitText?

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:231
> +    m_currentNode->parserAddChild(comment.get());
> +    if (!comment->attached())
> +        comment->attach();

This stanza is repeated all over the place.  Maybe it should be factored out?

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:236
> +    // FIXME: support internal subset

Comments should be complete sentences.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:246
> +    if (token.attributes()) {

Prefer early exit.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:247
> +        for (unsigned i = 0; i < token.attributes()->length(); ++i) {

unsigned => size_t

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:260
> +    if (token.attributes()) {

Prefer early exit.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:261
> +        for (unsigned i = 0; i < token.attributes()->length(); ++i) {

unsigned => size_t

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:262
> +            Attribute* attribute = token.attributes()->attributeItem(i);

It seems wasteful to iterate over the attributes twice.  Can we do the work in a single pass over the attributes?

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:265
> +                newElement->setAttributeNS(XMLNSNames::xmlnsNamespaceURI, "xmlns:" + attribute->name().localName(), attribute->value(), ec);

Why is there an ec parameter here?  Do we need to handle exceptions?  Can this run JavaScript synchronously (e.g., in a DOM mutation event)?

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:295
> +        m_leafTextNode->appendData("\"", ec);

Can these trigger synchronous execution of JavaScript?  In what cases could an ec be generated?

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:309
> +    for (unsigned i = 0; i < name.length(); ++i) {

unsigned => size_t

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:318
> +       enterText(String(reinterpret_cast<UChar*>(&entityValue), 1));

Yuck.  That's an extra malloc.  That's going to kill performance if there's a long string of entities.  This is related to enterText requiring us to copy into a String too early.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:321
> +        UChar utf16Pair[2] = { U16_LEAD(entityValue), U16_TRAIL(entityValue) };
> +        enterText(String(utf16Pair, 2));

Same problem here.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:337
> +        m_leafTextNode->appendData(text, ec);

Is this going to involve copying the whole string over and over again?  We might want to use something like StringBuilder to receive all these string fragments and the actually create the text node during exitText.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:356
> +    if (!prefix.isNull()) {
> +        if (m_currentNodeItem.hasNamespaceURI(prefix))

These can be merged onto one line.  Why isNull versus isEmpty?

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:366
> +    if (parent) {

Prefer early exit.
------- Comment #5 From 2011-08-08 16:57:42 PST -------
(In reply to comment #4)
> > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:91
> > +    if (node != m_document)
> > +        node->ref();
> 
> Why do we need to use manual ref counting?  Can't we use RefPtr to do this work automatically?

I don't really know why that was here. It was in the libxml2 version, and I figured it was there for a reason. Either that, or "it was in the HTML version", go for a lot of the things mentioned here.

> > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:102
> > +    if (!m_currentNode)
> > +        return;
> 
> When would we ever pop the current node when m_currentNode is null?

This is one instance of "it was in the libxml2 version".

> > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:148
> > +    AtomicString publicIdentifier(token.publicIdentifier().data(), token.publicIdentifier().size());
> > +    AtomicString systemIdentifier(token.systemIdentifier().data(), token.systemIdentifier().size());
> 
> The design for the other fields is to have the AtomicXMLToken create these atomic strings on construction.  We'll need to do the work either way, but doing it that way lets us save work if we need to look at these values in multiple places.  That's probably not much of a win here, but it's useful to be consistent.

The Atomic[Markup]Tokens use the DTD class from the [Markup]Token so they can A) only have the object if it's needed and B) can take the pointer from the non-atomic token. This is what the HTML tokens did when I took a look at them.

> > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:215
> > +    enterText(String(token.characters().data(), token.characters().size()));
> 
> So, you've got an extra memcpy here.  You're memcpying data into this string, and then you'll end up memcpying into an existing text node, if there is one.  That turned out to be costly in the HTML parser, which is why we came up with the notion of an external character buffer.  You want to avoid memcpying the data into the String until as late as possible.  Doing the memcpy here is probably too early.

I'm not really sure how to reconcile the problems of "create the token with the string passed in" with "don't copy the string", given that I need to get the string to enterText somehow.

> > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:247
> > +        for (unsigned i = 0; i < token.attributes()->length(); ++i) {
> 
> unsigned => size_t

Also in the libxml2 version

> > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:262
> > +            Attribute* attribute = token.attributes()->attributeItem(i);
> 
> It seems wasteful to iterate over the attributes twice.  Can we do the work in a single pass over the attributes?

I agree, and this was originally one loop, but we need the namespace before we can create the attribute (to ensure that it's of the right subclass) and, unless we loop through twice to get the namespaces first for the attributes anyway, we can't create a NamedNodeMap with the right namespaces. Theoretically, a map of prefixes to a list of attributes with that prefix could fix this problem, but that seems like it might not be optimal.

> > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:265
> > +                newElement->setAttributeNS(XMLNSNames::xmlnsNamespaceURI, "xmlns:" + attribute->name().localName(), attribute->value(), ec);
> 
> Why is there an ec parameter here?  Do we need to handle exceptions?  Can this run JavaScript synchronously (e.g., in a DOM mutation event)?

It seems to be a formality. setAttributeNS only sets ec if the URI is null and the prefix is non-empty.

> > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:295
> > +        m_leafTextNode->appendData("\"", ec);
> 
> Can these trigger synchronous execution of JavaScript?  In what cases could an ec be generated?

Likewise, this ec is actually unused within appendData.

> > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:337
> > +        m_leafTextNode->appendData(text, ec);
> 
> Is this going to involve copying the whole string over and over again?  We might want to use something like StringBuilder to receive all these string fragments and the actually create the text node during exitText.

This sounds reasonable. I'll try that.

> > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:356
> > +    if (!prefix.isNull()) {
> > +        if (m_currentNodeItem.hasNamespaceURI(prefix))
> 
> These can be merged onto one line.  Why isNull versus isEmpty?

The prefix won't come out of the tokenizer and be empty and non-null. I can ASSERT that.
------- Comment #6 From 2011-08-08 18:16:01 PST -------
Created an attachment (id=103324) [details]
Patch
------- Comment #7 From 2011-08-08 22:49:33 PST -------
(From update of attachment 103324 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=103324&action=review

I'm slightly worried about re-entrancy problems.  That was the hardest part about getting the HTML parser correct.  I don't want to hold you up, because ironing out all those issues can be tricky, but it's something to keep in mind that we might need to return to at a later date.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:93
> +    m_currentNodeStack.append(m_currentStackItem);

For a later patch, you might want to experiment with using a Vector and using a linked list.  We found that a linked list was faster for the HTML parser, but the HTML parser doesn't have as strong a stack discipline.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:101
> +    m_currentNodeStack.removeLast();

Is there a reason to keep m_currentNodeStack separate?  It seems like you can always just use m_currentNodeStack.last().

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:102
> +    m_currentStackItem = m_currentNodeStack.last();

Does this actually mutate m_currentStackItem?

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:111
> +    RefPtr<ProcessingInstruction> pi = ProcessingInstruction::create(m_document, token.target(), token.data());
> +    add(pi);

I would just combine these lines.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:171
> +    RefPtr<Element> newElement = m_document->createElement(qName, true);
> +    if (!newElement) {

How can this fail?

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:186
> +    if (token.selfClosing())
> +        newElement->finishParsingChildren();
> +    else
> +        pushCurrentNode(newElement.get());
> +
> +    m_currentStackItem = m_currentNodeStack.last();

I'm still worried about getting re-entered via finishParsingChildren().  We need to make sure we're in a consistent state before calling out to arbitrary JavaScript.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:218
> +    RefPtr<CDATASection> cdata = CDATASection::create(m_document, token.data());
> +    add(cdata);

I'd combine these lines.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:226
> +    RefPtr<Comment> comment = Comment::create(m_document, token.comment());
> +    add(comment);

These too.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:314
> +       appendToText(reinterpret_cast<UChar*>(&entityValue), 1);

Bad indent.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:343
> +    if (!m_sawFirstElement) {
> +        // FIXME: ensure it's just whitespace
> +        return;
> +    }

Is this a parse error?  I guess that's what the FIXME comment is about.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:354
> +    RefPtr<Text> text = Text::create(m_document, m_leafText->toString());
> +    add(text);

I'd combine these lines.

> Source/WebCore/xml/parser/XMLTreeBuilder.cpp:375
> +        m_namespace = parent->m_namespace;
> +        m_scopedNamespaces = parent->m_scopedNamespaces;

Bad indent.

> Source/WebCore/xml/parser/XMLTreeBuilder.h:107
> +    NodeStackItem m_currentStackItem;
> +    Vector<NodeStackItem> m_currentNodeStack;

I'm not sure whether m_currentStackItem is really adding any value given that m_currentNodeStack knows what the current item is anyway.
------- Comment #8 From 2011-08-08 22:53:03 PST -------
Yeah, the general thing to take away from this discussion is that the libxml2 version is not very high quality.  The unsigned vs. size_t thing has to do with 64 bit machines.  I believe unsigned is 32bits and size_t is 32 bits on 32 bit machines and 64 bits on 64 bit machines.  If you're dealing with something like the difference between two memory addresses, you need something like size_t that scales with the architecture, otherwise you get integer overflow vulnerabilities.

Long story short: always use size_t when dealing with sizes or offsets in memory.
------- Comment #9 From 2011-08-08 23:17:08 PST -------
(In reply to comment #7)
> > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:101
> > +    m_currentNodeStack.removeLast();
> 
> Is there a reason to keep m_currentNodeStack separate?  It seems like you can always just use m_currentNodeStack.last().

It is necessary to keep the top stack item separate from the working stack item because processStartTag can mutate the namespaces in m_currentStackItem (which are then used in the element and attribute creation). If the tag is not self-closing, these changes are pushed onto the stack. Otherwise, the changes are reset at the end of the function. It's probably better to only set m_currentStackItem inside of processStartTag and have it be either undefined outside of processStartTag or just passed to the functions that need to mutate it or use it (processNamespaces and processAttributes respectively).
> 
> > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:102
> > +    m_currentStackItem = m_currentNodeStack.last();
> 
> Does this actually mutate m_currentStackItem?

See above

> 
> > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:111
> > +    RefPtr<ProcessingInstruction> pi = ProcessingInstruction::create(m_document, token.target(), token.data());
> > +    add(pi);
> 
> I would just combine these lines.

Probably a good idea. I'll do that before I land it.

> > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:171
> > +    RefPtr<Element> newElement = m_document->createElement(qName, true);
> > +    if (!newElement) {
> 
> How can this fail?

I'm not sure, but I can take a closer look before I land it, and remove it if it can't.

> > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:343
> > +    if (!m_sawFirstElement) {
> > +        // FIXME: ensure it's just whitespace
> > +        return;
> > +    }
> 
> Is this a parse error?  I guess that's what the FIXME comment is about.

The XML spec stipulates that only comments, processing instructions and whitespace can appear outside of the root element, save for the declarations, which can only appear before everything else. However, the tokenizer spits out whitespace the same as regular text nodes, so we have to ignore whitespace tokens before the first element. Technically, this whitespace is "markup" and not "character data", but the tokenizer doesn't know that.

> > Source/WebCore/xml/parser/XMLTreeBuilder.h:107
> > +    NodeStackItem m_currentStackItem;
> > +    Vector<NodeStackItem> m_currentNodeStack;
> 
> I'm not sure whether m_currentStackItem is really adding any value given that m_currentNodeStack knows what the current item is anyway.

Again, see above comments
------- Comment #10 From 2011-08-08 23:53:41 PST -------
(In reply to comment #9)
> (In reply to comment #7)
> > > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:101
> > > +    m_currentNodeStack.removeLast();
> > 
> > Is there a reason to keep m_currentNodeStack separate?  It seems like you can always just use m_currentNodeStack.last().
> 
> It is necessary to keep the top stack item separate from the working stack item because processStartTag can mutate the namespaces in m_currentStackItem (which are then used in the element and attribute creation). If the tag is not self-closing, these changes are pushed onto the stack. Otherwise, the changes are reset at the end of the function. It's probably better to only set m_currentStackItem inside of processStartTag and have it be either undefined outside of processStartTag or just passed to the functions that need to mutate it or use it (processNamespaces and processAttributes respectively).

I would pass it around as an argument rather than storing it as a member, but that's a judgment call.  Keeping it as a member makes me think that's it's always something interesting (e.g., the top of the stack).
------- Comment #11 From 2011-08-09 14:01:07 PST -------
Created an attachment (id=103392) [details]
Patch for landing
------- Comment #12 From 2011-08-09 14:32:19 PST -------
(From update of attachment 103392 [details])
Clearing flags on attachment: 103392

Committed r92709: <http://trac.webkit.org/changeset/92709>
------- Comment #13 From 2011-08-09 14:32:24 PST -------
All reviewed patches have been landed.  Closing bug.