Bug 64566 - New XML tokenizer
Summary: New XML tokenizer
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: XML (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Vicki Pfau
URL:
Keywords:
Depends on: 64764 65000
Blocks: 64396
  Show dependency treegraph
 
Reported: 2011-07-14 15:31 PDT by Vicki Pfau
Modified: 2011-07-26 19:26 PDT (History)
5 users (show)

See Also:


Attachments
Patch (52.30 KB, patch)
2011-07-25 17:04 PDT, Vicki Pfau
no flags Details | Formatted Diff | Diff
Patch (52.33 KB, patch)
2011-07-25 18:19 PDT, Vicki Pfau
no flags Details | Formatted Diff | Diff
Patch (53.52 KB, patch)
2011-07-26 10:59 PDT, Vicki Pfau
no flags Details | Formatted Diff | Diff
Patch for landing (54.23 KB, patch)
2011-07-26 16:54 PDT, Vicki Pfau
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Vicki Pfau 2011-07-14 15:31:29 PDT
For the new XML parser, we'll need a tokenizer for the source before we can parse the tokens.
Comment 1 Vicki Pfau 2011-07-25 17:04:43 PDT
Created attachment 101943 [details]
Patch
Comment 2 WebKit Review Bot 2011-07-25 17:07:01 PDT
Attachment 101943 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1

Source/WebCore/xml/parser/XMLTokenizer.cpp:227:  Non-label code inside switch statements should be indented.  [whitespace/indent] [4]
Total errors found: 1 in 11 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 3 WebKit Review Bot 2011-07-25 17:55:07 PDT
Comment on attachment 101943 [details]
Patch

Attachment 101943 [details] did not pass cr-mac-ews (chromium):
Output: http://queues.webkit.org/results/9253101
Comment 4 WebKit Review Bot 2011-07-25 18:09:23 PDT
Comment on attachment 101943 [details]
Patch

Attachment 101943 [details] did not pass mac-ews (mac):
Output: http://queues.webkit.org/results/9253107
Comment 5 WebKit Review Bot 2011-07-25 18:14:03 PDT
Comment on attachment 101943 [details]
Patch

Attachment 101943 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/9250169
Comment 6 Vicki Pfau 2011-07-25 18:19:39 PDT
Created attachment 101957 [details]
Patch
Comment 7 Adam Barth 2011-07-25 18:21:01 PDT
Comment on attachment 101943 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=101943&action=review

This looks like a great start.  Obviously, it would be nice for it to build before landing.  :)

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:62
> +        m_token.print();

DEBUG only?

This function will obviously evolve.  :)

> Source/WebCore/xml/parser/XMLTokenizer.cpp:47
> +inline bool isNumeric(UChar cc)

cc => c (I need to go through the HTML parser and fix this there too.)

> Source/WebCore/xml/parser/XMLTokenizer.cpp:55
> +inline bool isLetter(UChar cc)
> +{
> +    return (cc >= 'A' && cc <= 'Z') || (cc >= 'a' && cc <= 'z');
> +}

I think these functions already exist in ASCIITypes.h

> Source/WebCore/xml/parser/XMLTokenizer.cpp:62
> +inline bool isValidNameStart(UChar cc)

You might find that this function is hot, but optimizations can come later.

> Source/WebCore/xml/parser/XMLTokenizer.cpp:472
> +        DEFINE_STATIC_LOCAL(String, xmlString, ("xml "));

Is the space necessary after the xml?  There can't be other sorts of whitespace?

> Source/WebCore/xml/parser/XMLTokenizer.h:110
> +    bool gotError() const { return m_parseError; }

I'm not in love with gotError as a name.  Maybe errorDuringParsing ?  In any case, you should rename the member variable to match the accessor.

> Source/WebCore/xml/parser/XMLTokenizer.h:112
> +    virtual bool shouldSkipNullCharacters() const { return false; }

I thought this wasn't virtual anymore?
Comment 8 Adam Barth 2011-07-25 18:21:28 PDT
Comment on attachment 101957 [details]
Patch

(modulo comments above)
Comment 9 WebKit Review Bot 2011-07-25 18:23:49 PDT
Attachment 101957 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1

Source/WebCore/xml/parser/XMLTokenizer.cpp:227:  Non-label code inside switch statements should be indented.  [whitespace/indent] [4]
Total errors found: 1 in 11 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 10 Vicki Pfau 2011-07-25 18:37:55 PDT
(In reply to comment #7)
> (From update of attachment 101943 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=101943&action=review
> 
> This looks like a great start.  Obviously, it would be nice for it to build before landing.  :)

That's what I get for only checking to make sure it builds in Debug!


> > Source/WebCore/xml/parser/XMLTokenizer.cpp:55
> > +inline bool isLetter(UChar cc)
> > +{
> > +    return (cc >= 'A' && cc <= 'Z') || (cc >= 'a' && cc <= 'z');
> > +}
> 
> I think these functions already exist in ASCIITypes.h

I see an ASCIICTypes.h, which uses unsigned short and not UChar. Close enough?

> > Source/WebCore/xml/parser/XMLTokenizer.cpp:472
> > +        DEFINE_STATIC_LOCAL(String, xmlString, ("xml "));
> 
> Is the space necessary after the xml?  There can't be other sorts of whitespace?

Good catch. This can be any whitespace accepted by the tokenizer as whitespace. I'll fix that.

> > Source/WebCore/xml/parser/XMLTokenizer.h:112
> > +    virtual bool shouldSkipNullCharacters() const { return false; }
> 
> I thought this wasn't virtual anymore?

Whoops, this snuck in under my radar. I implemented the tokenizer before I re-optimized the merged token class, and forgot to remove this.
Comment 11 Collabora GTK+ EWS bot 2011-07-25 19:21:25 PDT
Comment on attachment 101957 [details]
Patch

Attachment 101957 [details] did not pass gtk-ews (gtk):
Output: http://queues.webkit.org/results/9249240
Comment 12 Vicki Pfau 2011-07-26 10:59:14 PDT
Created attachment 102024 [details]
Patch
Comment 13 WebKit Review Bot 2011-07-26 11:02:19 PDT
Attachment 102024 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1

Source/WebCore/xml/parser/XMLTokenizer.cpp:212:  Non-label code inside switch statements should be indented.  [whitespace/indent] [4]
Total errors found: 1 in 11 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 14 Vicki Pfau 2011-07-26 15:09:19 PDT
Comment on attachment 102024 [details]
Patch

I realize now I forgot to fix one of the problems mentioned by abarth's previous comments
Comment 15 Vicki Pfau 2011-07-26 16:54:54 PDT
Created attachment 102074 [details]
Patch for landing
Comment 16 WebKit Review Bot 2011-07-26 16:57:43 PDT
Attachment 102074 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1

Source/WebCore/xml/parser/XMLTokenizer.cpp:212:  Non-label code inside switch statements should be indented.  [whitespace/indent] [4]
Source/WebCore/xml/parser/XMLTokenizer.cpp:490:  An else should appear on the same line as the preceding }  [whitespace/newline] [4]
Total errors found: 2 in 11 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 17 WebKit Review Bot 2011-07-26 19:25:58 PDT
Comment on attachment 102074 [details]
Patch for landing

Clearing flags on attachment: 102074

Committed r91811: <http://trac.webkit.org/changeset/91811>
Comment 18 WebKit Review Bot 2011-07-26 19:26:04 PDT
All reviewed patches have been landed.  Closing bug.