Bug 51311 - SegmentedString should provide column position
Summary: SegmentedString should provide column position
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: DOM (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks: 52013
  Show dependency treegraph
 
Reported: 2010-12-19 16:02 PST by Peter Rybin
Modified: 2011-01-11 08:45 PST (History)
6 users (show)

See Also:


Attachments
Proposed Patch. (17.69 KB, patch)
2010-12-19 16:19 PST, Peter Rybin
no flags Details | Formatted Diff | Diff
Patch (18.25 KB, patch)
2010-12-22 12:27 PST, Peter Rybin
no flags Details | Formatted Diff | Diff
Patch (23.62 KB, patch)
2010-12-25 15:37 PST, Peter Rybin
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Rybin 2010-12-19 16:02:58 PST
Script engine needs line/column position of <script> tag. Currently there is no way column value could be obtained.
SegmentedString should provide it.
Comment 1 Peter Rybin 2010-12-19 16:04:48 PST
This bug spun off from bug https://bugs.webkit.org/show_bug.cgi?id=45271 "HTML parser should provide script column position within HTML document to JavaScript engine"
Comment 2 Peter Rybin 2010-12-19 16:19:50 PST
Created attachment 76962 [details]
Proposed Patch.
Comment 3 Peter Rybin 2010-12-19 16:27:15 PST
Comment on attachment 76962 [details]
Proposed Patch.

 Comments/concerns:

1. SegmentedString now uses WTH::ZeroBasedNumber, but it's a dependency on JavaScriptCode/wtf/text (from WebCore/platform/text).

2. SegmentedString counts column number and also line number. We can remove this functionality from HTMLTokenizer now and stop passing lineNumber by reference to SegmentedString on each call.

3. I remember Adam was concerned about forks that appeared inside advance* methods. I'm afraid I don't see how I could do without them.

4. HTMLTreeBuilder uses m_parser new field, but it has to upcast in order to call textPosition. I don't see how to make it better.
Comment 4 Adam Barth 2010-12-21 00:47:54 PST
Comment on attachment 76962 [details]
Proposed Patch.

View in context: https://bugs.webkit.org/attachment.cgi?id=76962&action=review

> WebCore/html/parser/HTMLDocumentParser.cpp:82
> -    , m_treeBuilder(HTMLTreeBuilder::create(m_tokenizer.get(), document, reportErrors, usePreHTML5ParserQuirks(document)))
> +    , m_treeBuilder(HTMLTreeBuilder::create(this, m_tokenizer.get(), document, reportErrors, usePreHTML5ParserQuirks(document)))

Hum...  I'm not sure we want to teach the tree builder about the document parser...

> WebCore/html/parser/HTMLInputStream.h:140
> +        // Generated script part inherits current position.

Comments should be complete sentences.

> WebCore/html/parser/HTMLTreeBuilder.cpp:2783
> -    TextPosition0 position = TextPosition0(WTF::ZeroBasedNumber::fromZeroBasedInt(m_tokenizer->lineNumber()), WTF::ZeroBasedNumber::base());
> +    ScriptableDocumentParser* parserUpcast = m_parser;
> +    TextPosition0 position = parserUpcast->textPosition();

Why do we need to do this?  HTMLDocumentParser inherits from ScriptableDocumentParser...  Rather than pass in the whole HTMLDocumentParser, we should just pass in the textPosition.

> WebCore/html/parser/HTMLTreeBuilder.h:248
> +    HTMLDocumentParser* m_parser;

What's the point of storing this object in a member variable if you only ever use it in the constructor?

> WebCore/platform/text/SegmentedString.cpp:56
> +    m_lineBeginCharactersConsumedValue = other.m_lineBeginCharactersConsumedValue;

m_lineBeginCharactersConsumedValue is a mouthful.  What does this variable represent?

> WebCore/platform/text/SegmentedString.cpp:248
> +          ++lineNumber;
> +          ++m_currentLine;

Four-space indent please.

> WebCore/platform/text/SegmentedString.cpp:249
> +          m_lineBeginCharactersConsumedValue = numberOfCharactersConsumed();

I see.  This probably should be called m_numberOfCharactersConsumedPriorToCurrentLine to match m_numberOfCharactersConsumedPriorToCurrentString.

> WebCore/platform/text/SegmentedString.cpp:262
> +WTF::ZeroBasedNumber SegmentedString::currentColumn() const

So, this function is moderately slow.  Might be worth mentioning in the header.

> WebCore/platform/text/SegmentedString.cpp:272
> +void SegmentedString::setCurrentPosition(WTF::ZeroBasedNumber line, WTF::ZeroBasedNumber column, int offset)
> +{
> +    m_currentLine = line.zeroBasedInt();
> +    m_lineBeginCharactersConsumedValue = numberOfCharactersConsumedSlow() + offset - column.zeroBasedInt();
> +}

This offset business is pretty mysterious.  Can we use a more descriptive variable name?

> WebCore/platform/text/SegmentedString.h:167
> +            if (newLineFlag)
> +                m_lineBeginCharactersConsumedValue = numberOfCharactersConsumed();

Isn't the whole point of this codepath to avoid branching on newLineFlag?

> WebCore/platform/text/SegmentedString.h:193
> -            lineNumber += (*m_currentString.m_current == '\n') & m_currentString.doNotExcludeLineNumbers();
> +            int newLineFlag = (*m_currentString.m_current == '\n') & m_currentString.doNotExcludeLineNumbers();
> +            lineNumber += newLineFlag;
> +            m_currentLine += newLineFlag;
> +            if (newLineFlag)
> +                m_lineBeginCharactersConsumedValue = numberOfCharactersConsumed() + 1;

Same here.  These branches are bad for performance.  Did you benchmark this patch using the html-parser benchmark in WebCore/benchmarks ?
Comment 5 Adam Barth 2010-12-21 00:50:08 PST
> 1. SegmentedString now uses WTH::ZeroBasedNumber, but it's a dependency on JavaScriptCode/wtf/text (from WebCore/platform/text).

That's fine.  JavaScriptCore/wtf is a very low-level library.

> 2. SegmentedString counts column number and also line number. We can remove this functionality from HTMLTokenizer now and stop passing lineNumber by reference to SegmentedString on each call.

Great.  We should only track this in one place.

> 3. I remember Adam was concerned about forks that appeared inside advance* methods. I'm afraid I don't see how I could do without them.

Can you run the benchmark with and without your patch so we can see how much performance we're talking about?  Per-character branches usually are visible on the benchmark.

> 4. HTMLTreeBuilder uses m_parser new field, but it has to upcast in order to call textPosition. I don't see how to make it better.

Can't you just pass in the textPosition instead of passing in the whole object?

In general, this looks pretty good.  There are just some details to square away.
Comment 6 Peter Rybin 2010-12-21 14:57:35 PST
> > 3. I remember Adam was concerned about forks that appeared inside advance* methods. I'm afraid I don't see how I could do without them.
> 
> Can you run the benchmark with and without your patch so we can see how much performance we're talking about?  Per-character branches usually are visible on the benchmark.

I ran 3 times before and after, and in average I have:
4354 (base) and 4377 (with patch). It's definitely visible, but I'm not sure I see a really better solution.

> > 4. HTMLTreeBuilder uses m_parser new field, but it has to upcast in order to call textPosition. I don't see how to make it better.
> 
> Can't you just pass in the textPosition instead of passing in the whole object?

I'm sorry I don't quite see how I can pass it other way. Now I do a callback from tree builder as deep as to SegmentedString when I need a column value (at script tag only).

Passing it directly from parser as a parameter might be a bit messy. Here's a stack I would have to go through:
        WebCore::HTMLTreeBuilder::processScriptStartTag()
        WebCore::HTMLTreeBuilder::processStartTagForInHead()
        WebCore::HTMLTreeBuilder::processStartTag()
        WebCore::HTMLTreeBuilder::processToken()
        WebCore::HTMLTreeBuilder::constructTreeFromAtomicToken()
        WebCore::HTMLTreeBuilder::constructTreeFromToken()

Alternatively I can create a variable somewhere, pass it by reference into the tree builder and update this reference on each token in parser. It seems to be quite fragile to me.

Do I miss something?
Comment 7 Adam Barth 2010-12-22 00:40:28 PST
> > Can you run the benchmark with and without your patch so we can see how much performance we're talking about?  Per-character branches usually are visible on the benchmark.
> 
> I ran 3 times before and after, and in average I have:
> 4354 (base) and 4377 (with patch). It's definitely visible, but I'm not sure I see a really better solution.

Ok.  I think we can live with a 0.5% regression.

> > > 4. HTMLTreeBuilder uses m_parser new field, but it has to upcast in order to call textPosition. I don't see how to make it better.
> > 
> > Can't you just pass in the textPosition instead of passing in the whole object?
> 
> I'm sorry I don't quite see how I can pass it other way. Now I do a callback from tree builder as deep as to SegmentedString when I need a column value (at script tag only).
> 
> Passing it directly from parser as a parameter might be a bit messy. Here's a stack I would have to go through:
>         WebCore::HTMLTreeBuilder::processScriptStartTag()
>         WebCore::HTMLTreeBuilder::processStartTagForInHead()
>         WebCore::HTMLTreeBuilder::processStartTag()
>         WebCore::HTMLTreeBuilder::processToken()
>         WebCore::HTMLTreeBuilder::constructTreeFromAtomicToken()
>         WebCore::HTMLTreeBuilder::constructTreeFromToken()
> 
> Alternatively I can create a variable somewhere, pass it by reference into the tree builder and update this reference on each token in parser. It seems to be quite fragile to me.
> 
> Do I miss something?

Ah, I think the change to HTMLTreeBuilder::processScriptStartTag isn't in your patch.  :)
Comment 8 Peter Rybin 2010-12-22 12:23:01 PST
Comment on attachment 76962 [details]
Proposed Patch.

View in context: https://bugs.webkit.org/attachment.cgi?id=76962&action=review

>> WebCore/html/parser/HTMLDocumentParser.cpp:82
>> +    , m_treeBuilder(HTMLTreeBuilder::create(this, m_tokenizer.get(), document, reportErrors, usePreHTML5ParserQuirks(document)))
> 
> Hum...  I'm not sure we want to teach the tree builder about the document parser...

I need a parser to obtain a current position (line/column) when we run into script tag. Technically I only need something like position tracker, that would return me current position, but I didn't want to create a new interface.
I thought that since we already know about html tokenizer, html parser dependency doesn't make it much worse.

>> WebCore/html/parser/HTMLInputStream.h:140
>> +        // Generated script part inherits current position.
> 
> Comments should be complete sentences.

Done

>> WebCore/html/parser/HTMLTreeBuilder.cpp:2783
>> +    TextPosition0 position = parserUpcast->textPosition();
> 
> Why do we need to do this?  HTMLDocumentParser inherits from ScriptableDocumentParser...  Rather than pass in the whole HTMLDocumentParser, we should just pass in the textPosition.

m_parser->textPosition() does not compile, because textPosition has private access in HTMLDocumentParser (I do not know what for).
I cannot pass textPosition, because it's a variable. I need its current value when I get to <script> tag.

>> WebCore/html/parser/HTMLTreeBuilder.h:248
>> +    HTMLDocumentParser* m_parser;
> 
> What's the point of storing this object in a member variable if you only ever use it in the constructor?

That's not true. I store it in constructor and I use in processScriptStartTag.

>> WebCore/platform/text/SegmentedString.cpp:248
>> +          ++m_currentLine;
> 
> Four-space indent please.

Sorry.

>> WebCore/platform/text/SegmentedString.cpp:249
>> +          m_lineBeginCharactersConsumedValue = numberOfCharactersConsumed();
> 
> I see.  This probably should be called m_numberOfCharactersConsumedPriorToCurrentLine to match m_numberOfCharactersConsumedPriorToCurrentString.

Thanks. Done.

>> WebCore/platform/text/SegmentedString.cpp:262
>> +WTF::ZeroBasedNumber SegmentedString::currentColumn() const
> 
> So, this function is moderately slow.  Might be worth mentioning in the header.

I put a comment. I wouldn't say it's really slow though. It has no cycles, only 4-5 arithmetics and calls.

>> WebCore/platform/text/SegmentedString.cpp:272
>> +}
> 
> This offset business is pretty mysterious.  Can we use a more descriptive variable name?

I renamed them and added a comment.

>> WebCore/platform/text/SegmentedString.h:167
>> +                m_lineBeginCharactersConsumedValue = numberOfCharactersConsumed();
> 
> Isn't the whole point of this codepath to avoid branching on newLineFlag?

Yes. I cannot avoid braching, but I tried to preserve the old code style ( += hocus). If it doesn't make sense now, let me put increment inside condition.

>> WebCore/platform/text/SegmentedString.h:193
>> +                m_lineBeginCharactersConsumedValue = numberOfCharactersConsumed() + 1;
> 
> Same here.  These branches are bad for performance.  Did you benchmark this patch using the html-parser benchmark in WebCore/benchmarks ?

Yes. I posted numbers in previous reply.
Comment 9 Peter Rybin 2010-12-22 12:27:15 PST
Created attachment 77247 [details]
Patch
Comment 10 Eric Seidel (no email) 2010-12-24 01:51:41 PST
You should mention performance in your ChangeLog.  did you run WebCore/benchmarks/html-parser.html?
Comment 11 Peter Rybin 2010-12-24 10:07:11 PST
(In reply to comment #10)
> You should mention performance in your ChangeLog.  did you run WebCore/benchmarks/html-parser.html?

Yep, my numbers (mentioned above) are:
on test_shell WebCore/benchmarks/html-parser.html averaged by 3 runs: 4354 (base) and 4377 (with patch).

I'll put it in Changelog as percents in next patch.
Comment 12 Adam Barth 2010-12-24 10:20:11 PST
Comment on attachment 77247 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=77247&action=review

Thanks for sticking with this patch.  Some minor nits below.

> WebCore/html/parser/HTMLTreeBuilder.cpp:359
> -HTMLTreeBuilder::HTMLTreeBuilder(HTMLTokenizer* tokenizer, DocumentFragment* fragment, Element* contextElement, FragmentScriptingPermission scriptingPermission, bool usePreHTML5ParserQuirks)
> +HTMLTreeBuilder::HTMLTreeBuilder(HTMLDocumentParser* parser, HTMLTokenizer* tokenizer, DocumentFragment* fragment, Element* contextElement, FragmentScriptingPermission scriptingPermission, bool usePreHTML5ParserQuirks)

This part of the change still makes me sad.  If we're going to expose the whole HTMLDocumentParser, they we probably don't need to pass the tokenizer separately since you can get that from the parser.  Maybe we should pass a pointer to the input stream instead?

> WebCore/html/parser/HTMLTreeBuilder.cpp:2783
> +    ScriptableDocumentParser* parserUpcast = m_parser;
> +    TextPosition0 position = parserUpcast->textPosition();

I don't understand why wer need to do this casting.  Isn't textPosition available on HTMLDocumentParser directory?
Comment 13 Adam Barth 2010-12-24 10:21:14 PST
(In reply to comment #12)
> (From update of attachment 77247 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=77247&action=review
> 
> > WebCore/html/parser/HTMLTreeBuilder.cpp:2783
> > +    ScriptableDocumentParser* parserUpcast = m_parser;
> > +    TextPosition0 position = parserUpcast->textPosition();
> 
> I don't understand why wer need to do this casting.  Isn't textPosition available on HTMLDocumentParser directory?

You mention above that it's private on HTMLDocumentParser.  You should just make it public.
Comment 14 Peter Rybin 2010-12-25 15:32:54 PST
> > WebCore/html/parser/HTMLTreeBuilder.cpp:359
> > -HTMLTreeBuilder::HTMLTreeBuilder(HTMLTokenizer* tokenizer, DocumentFragment* fragment, Element* contextElement, FragmentScriptingPermission scriptingPermission, bool usePreHTML5ParserQuirks)
> > +HTMLTreeBuilder::HTMLTreeBuilder(HTMLDocumentParser* parser, HTMLTokenizer* tokenizer, DocumentFragment* fragment, Element* contextElement, FragmentScriptingPermission scriptingPermission, bool usePreHTML5ParserQuirks)

> This part of the change still makes me sad.  If we're going to expose the whole HTMLDocumentParser, they we probably don't need to pass the tokenizer separately since you can get that from the parser.  Maybe we should pass a pointer to the input stream instead?

Totally agree. I would prefer the first -- to get tokenizer from parser. After all it's probably just natural that tokenizer is a part of parser.

Exposing html stream seems less elegant to me.
Comment 15 Peter Rybin 2010-12-25 15:37:13 PST
Created attachment 77445 [details]
Patch
Comment 16 Adam Barth 2010-12-25 16:56:14 PST
Comment on attachment 77445 [details]
Patch

Ok.  Presumably we need one more patch to hook up the column number to the scripting engine and to add tests?
Comment 17 Peter Rybin 2010-12-25 17:27:32 PST
(In reply to comment #16)
> (From update of attachment 77445 [details])
> Ok.  Presumably we need one more patch to hook up the column number to the scripting engine and to add tests?

Actually we have already done scripting engine ( https://bugs.webkit.org/show_bug.cgi?id=45271 ), and I even checked up the Eclipse-based debugger for Chrome -- it sees correct columns already.

However I completely missed scripts inside attributes. Probably I should make a follow-up for them.

As to tests, my colleagues suggested I used their inspector-related testing framework that should give me access to column positions. For now it seems to me  an only way of getting column number from tests.
Comment 18 WebKit Commit Bot 2010-12-25 17:49:19 PST
Comment on attachment 77445 [details]
Patch

Clearing flags on attachment: 77445

Committed r74663: <http://trac.webkit.org/changeset/74663>
Comment 19 WebKit Commit Bot 2010-12-25 17:49:27 PST
All reviewed patches have been landed.  Closing bug.
Comment 20 WebKit Review Bot 2010-12-25 21:24:59 PST
http://trac.webkit.org/changeset/74663 might have broken GTK Linux 32-bit Release
The following tests are not passing:
media/controls-strict.html
Comment 21 Peter Rybin 2010-12-26 04:25:53 PST
(In reply to comment #20)
> http://trac.webkit.org/changeset/74663 might have broken GTK Linux 32-bit Release
> The following tests are not passing:
> media/controls-strict.html

I guess it's a false alert, the test didn't show up in
http://build.webkit.org/changes/24250
Comment 22 Andy Estes 2011-01-06 16:17:34 PST
The assertion added to HTMLDocumentParser::textPosition() (ASSERT(m_tokenizer->lineNumber() == line.zeroBasedInt()) is getting hit in debug builds. Visiting en.wikipedia.org is an easy way to trigger it.
Comment 23 Andy Estes 2011-01-06 16:24:50 PST
Filed https://bugs.webkit.org/show_bug.cgi?id=52033 to track the regression.
Comment 24 Peter Rybin 2011-01-11 06:29:07 PST
(In reply to comment #23)
> Filed https://bugs.webkit.org/show_bug.cgi?id=52033 to track the regression.

Sorry for not responding to it earlier -- Christmas holidays in Russia. As I understand, the bug is fixed now.
Comment 25 Adam Barth 2011-01-11 08:45:27 PST
> Sorry for not responding to it earlier -- Christmas holidays in Russia. As I understand, the bug is fixed now.

Yep.  No worries.  It gave me a chance to learn more about this code.