51311 – SegmentedString should provide column position

RESOLVED FIXED 51311

SegmentedString should provide column position

https://bugs.webkit.org/show_bug.cgi?id=51311

Summary SegmentedString should provide column position

Peter Rybin

Reported 2010-12-19 16:02:58 PST

Script engine needs line/column position of <script> tag. Currently there is no way column value could be obtained. SegmentedString should provide it.

Attachments
Proposed Patch. (17.69 KB, patch) 2010-12-19 16:19 PST, Peter Rybin	no flags	Details Formatted Diff Diff
Patch (18.25 KB, patch) 2010-12-22 12:27 PST, Peter Rybin	no flags	Details Formatted Diff Diff
Patch (23.62 KB, patch) 2010-12-25 15:37 PST, Peter Rybin	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Peter Rybin

Comment 1 2010-12-19 16:04:48 PST

This bug spun off from bug https://bugs.webkit.org/show_bug.cgi?id=45271 "HTML parser should provide script column position within HTML document to JavaScript engine"

Peter Rybin

Comment 2 2010-12-19 16:19:50 PST

Created attachment 76962 [details] Proposed Patch.

Peter Rybin

Comment 3 2010-12-19 16:27:15 PST

Comment on attachment 76962 [details] Proposed Patch. Comments/concerns: 1. SegmentedString now uses WTH::ZeroBasedNumber, but it's a dependency on JavaScriptCode/wtf/text (from WebCore/platform/text). 2. SegmentedString counts column number and also line number. We can remove this functionality from HTMLTokenizer now and stop passing lineNumber by reference to SegmentedString on each call. 3. I remember Adam was concerned about forks that appeared inside advance* methods. I'm afraid I don't see how I could do without them. 4. HTMLTreeBuilder uses m_parser new field, but it has to upcast in order to call textPosition. I don't see how to make it better.

Adam Barth

Comment 4 2010-12-21 00:47:54 PST

Comment on attachment 76962 [details] Proposed Patch. View in context: https://bugs.webkit.org/attachment.cgi?id=76962&action=review > WebCore/html/parser/HTMLDocumentParser.cpp:82 > - , m_treeBuilder(HTMLTreeBuilder::create(m_tokenizer.get(), document, reportErrors, usePreHTML5ParserQuirks(document))) > + , m_treeBuilder(HTMLTreeBuilder::create(this, m_tokenizer.get(), document, reportErrors, usePreHTML5ParserQuirks(document))) Hum... I'm not sure we want to teach the tree builder about the document parser... > WebCore/html/parser/HTMLInputStream.h:140 > + // Generated script part inherits current position. Comments should be complete sentences. > WebCore/html/parser/HTMLTreeBuilder.cpp:2783 > - TextPosition0 position = TextPosition0(WTF::ZeroBasedNumber::fromZeroBasedInt(m_tokenizer->lineNumber()), WTF::ZeroBasedNumber::base()); > + ScriptableDocumentParser* parserUpcast = m_parser; > + TextPosition0 position = parserUpcast->textPosition(); Why do we need to do this? HTMLDocumentParser inherits from ScriptableDocumentParser... Rather than pass in the whole HTMLDocumentParser, we should just pass in the textPosition. > WebCore/html/parser/HTMLTreeBuilder.h:248 > + HTMLDocumentParser* m_parser; What's the point of storing this object in a member variable if you only ever use it in the constructor? > WebCore/platform/text/SegmentedString.cpp:56 > + m_lineBeginCharactersConsumedValue = other.m_lineBeginCharactersConsumedValue; m_lineBeginCharactersConsumedValue is a mouthful. What does this variable represent? > WebCore/platform/text/SegmentedString.cpp:248 > + ++lineNumber; > + ++m_currentLine; Four-space indent please. > WebCore/platform/text/SegmentedString.cpp:249 > + m_lineBeginCharactersConsumedValue = numberOfCharactersConsumed(); I see. This probably should be called m_numberOfCharactersConsumedPriorToCurrentLine to match m_numberOfCharactersConsumedPriorToCurrentString. > WebCore/platform/text/SegmentedString.cpp:262 > +WTF::ZeroBasedNumber SegmentedString::currentColumn() const So, this function is moderately slow. Might be worth mentioning in the header. > WebCore/platform/text/SegmentedString.cpp:272 > +void SegmentedString::setCurrentPosition(WTF::ZeroBasedNumber line, WTF::ZeroBasedNumber column, int offset) > +{ > + m_currentLine = line.zeroBasedInt(); > + m_lineBeginCharactersConsumedValue = numberOfCharactersConsumedSlow() + offset - column.zeroBasedInt(); > +} This offset business is pretty mysterious. Can we use a more descriptive variable name? > WebCore/platform/text/SegmentedString.h:167 > + if (newLineFlag) > + m_lineBeginCharactersConsumedValue = numberOfCharactersConsumed(); Isn't the whole point of this codepath to avoid branching on newLineFlag? > WebCore/platform/text/SegmentedString.h:193 > - lineNumber += (*m_currentString.m_current == '\n') & m_currentString.doNotExcludeLineNumbers(); > + int newLineFlag = (*m_currentString.m_current == '\n') & m_currentString.doNotExcludeLineNumbers(); > + lineNumber += newLineFlag; > + m_currentLine += newLineFlag; > + if (newLineFlag) > + m_lineBeginCharactersConsumedValue = numberOfCharactersConsumed() + 1; Same here. These branches are bad for performance. Did you benchmark this patch using the html-parser benchmark in WebCore/benchmarks ?

Adam Barth

Comment 5 2010-12-21 00:50:08 PST

> 1. SegmentedString now uses WTH::ZeroBasedNumber, but it's a dependency on JavaScriptCode/wtf/text (from WebCore/platform/text). That's fine. JavaScriptCore/wtf is a very low-level library. > 2. SegmentedString counts column number and also line number. We can remove this functionality from HTMLTokenizer now and stop passing lineNumber by reference to SegmentedString on each call. Great. We should only track this in one place. > 3. I remember Adam was concerned about forks that appeared inside advance* methods. I'm afraid I don't see how I could do without them. Can you run the benchmark with and without your patch so we can see how much performance we're talking about? Per-character branches usually are visible on the benchmark. > 4. HTMLTreeBuilder uses m_parser new field, but it has to upcast in order to call textPosition. I don't see how to make it better. Can't you just pass in the textPosition instead of passing in the whole object? In general, this looks pretty good. There are just some details to square away.

Peter Rybin

Comment 6 2010-12-21 14:57:35 PST

> > 3. I remember Adam was concerned about forks that appeared inside advance* methods. I'm afraid I don't see how I could do without them. > > Can you run the benchmark with and without your patch so we can see how much performance we're talking about? Per-character branches usually are visible on the benchmark. I ran 3 times before and after, and in average I have: 4354 (base) and 4377 (with patch). It's definitely visible, but I'm not sure I see a really better solution. > > 4. HTMLTreeBuilder uses m_parser new field, but it has to upcast in order to call textPosition. I don't see how to make it better. > > Can't you just pass in the textPosition instead of passing in the whole object? I'm sorry I don't quite see how I can pass it other way. Now I do a callback from tree builder as deep as to SegmentedString when I need a column value (at script tag only). Passing it directly from parser as a parameter might be a bit messy. Here's a stack I would have to go through: WebCore::HTMLTreeBuilder::processScriptStartTag() WebCore::HTMLTreeBuilder::processStartTagForInHead() WebCore::HTMLTreeBuilder::processStartTag() WebCore::HTMLTreeBuilder::processToken() WebCore::HTMLTreeBuilder::constructTreeFromAtomicToken() WebCore::HTMLTreeBuilder::constructTreeFromToken() Alternatively I can create a variable somewhere, pass it by reference into the tree builder and update this reference on each token in parser. It seems to be quite fragile to me. Do I miss something?

Adam Barth

Comment 7 2010-12-22 00:40:28 PST

> > Can you run the benchmark with and without your patch so we can see how much performance we're talking about? Per-character branches usually are visible on the benchmark. > > I ran 3 times before and after, and in average I have: > 4354 (base) and 4377 (with patch). It's definitely visible, but I'm not sure I see a really better solution. Ok. I think we can live with a 0.5% regression. > > > 4. HTMLTreeBuilder uses m_parser new field, but it has to upcast in order to call textPosition. I don't see how to make it better. > > > > Can't you just pass in the textPosition instead of passing in the whole object? > > I'm sorry I don't quite see how I can pass it other way. Now I do a callback from tree builder as deep as to SegmentedString when I need a column value (at script tag only). > > Passing it directly from parser as a parameter might be a bit messy. Here's a stack I would have to go through: > WebCore::HTMLTreeBuilder::processScriptStartTag() > WebCore::HTMLTreeBuilder::processStartTagForInHead() > WebCore::HTMLTreeBuilder::processStartTag() > WebCore::HTMLTreeBuilder::processToken() > WebCore::HTMLTreeBuilder::constructTreeFromAtomicToken() > WebCore::HTMLTreeBuilder::constructTreeFromToken() > > Alternatively I can create a variable somewhere, pass it by reference into the tree builder and update this reference on each token in parser. It seems to be quite fragile to me. > > Do I miss something? Ah, I think the change to HTMLTreeBuilder::processScriptStartTag isn't in your patch. :)

Peter Rybin

Comment 8 2010-12-22 12:23:01 PST

Comment on attachment 76962 [details] Proposed Patch. View in context: https://bugs.webkit.org/attachment.cgi?id=76962&action=review >> WebCore/html/parser/HTMLDocumentParser.cpp:82 >> + , m_treeBuilder(HTMLTreeBuilder::create(this, m_tokenizer.get(), document, reportErrors, usePreHTML5ParserQuirks(document))) > > Hum... I'm not sure we want to teach the tree builder about the document parser... I need a parser to obtain a current position (line/column) when we run into script tag. Technically I only need something like position tracker, that would return me current position, but I didn't want to create a new interface. I thought that since we already know about html tokenizer, html parser dependency doesn't make it much worse. >> WebCore/html/parser/HTMLInputStream.h:140 >> + // Generated script part inherits current position. > > Comments should be complete sentences. Done >> WebCore/html/parser/HTMLTreeBuilder.cpp:2783 >> + TextPosition0 position = parserUpcast->textPosition(); > > Why do we need to do this? HTMLDocumentParser inherits from ScriptableDocumentParser... Rather than pass in the whole HTMLDocumentParser, we should just pass in the textPosition. m_parser->textPosition() does not compile, because textPosition has private access in HTMLDocumentParser (I do not know what for). I cannot pass textPosition, because it's a variable. I need its current value when I get to <script> tag. >> WebCore/html/parser/HTMLTreeBuilder.h:248 >> + HTMLDocumentParser* m_parser; > > What's the point of storing this object in a member variable if you only ever use it in the constructor? That's not true. I store it in constructor and I use in processScriptStartTag. >> WebCore/platform/text/SegmentedString.cpp:248 >> + ++m_currentLine; > > Four-space indent please. Sorry. >> WebCore/platform/text/SegmentedString.cpp:249 >> + m_lineBeginCharactersConsumedValue = numberOfCharactersConsumed(); > > I see. This probably should be called m_numberOfCharactersConsumedPriorToCurrentLine to match m_numberOfCharactersConsumedPriorToCurrentString. Thanks. Done. >> WebCore/platform/text/SegmentedString.cpp:262 >> +WTF::ZeroBasedNumber SegmentedString::currentColumn() const > > So, this function is moderately slow. Might be worth mentioning in the header. I put a comment. I wouldn't say it's really slow though. It has no cycles, only 4-5 arithmetics and calls. >> WebCore/platform/text/SegmentedString.cpp:272 >> +} > > This offset business is pretty mysterious. Can we use a more descriptive variable name? I renamed them and added a comment. >> WebCore/platform/text/SegmentedString.h:167 >> + m_lineBeginCharactersConsumedValue = numberOfCharactersConsumed(); > > Isn't the whole point of this codepath to avoid branching on newLineFlag? Yes. I cannot avoid braching, but I tried to preserve the old code style ( += hocus). If it doesn't make sense now, let me put increment inside condition. >> WebCore/platform/text/SegmentedString.h:193 >> + m_lineBeginCharactersConsumedValue = numberOfCharactersConsumed() + 1; > > Same here. These branches are bad for performance. Did you benchmark this patch using the html-parser benchmark in WebCore/benchmarks ? Yes. I posted numbers in previous reply.

Peter Rybin

Comment 9 2010-12-22 12:27:15 PST

Created attachment 77247 [details] Patch

Eric Seidel (no email)

Comment 10 2010-12-24 01:51:41 PST

You should mention performance in your ChangeLog. did you run WebCore/benchmarks/html-parser.html?

Peter Rybin

Comment 11 2010-12-24 10:07:11 PST

(In reply to comment #10) > You should mention performance in your ChangeLog. did you run WebCore/benchmarks/html-parser.html? Yep, my numbers (mentioned above) are: on test_shell WebCore/benchmarks/html-parser.html averaged by 3 runs: 4354 (base) and 4377 (with patch). I'll put it in Changelog as percents in next patch.

Adam Barth

Comment 12 2010-12-24 10:20:11 PST

Comment on attachment 77247 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=77247&action=review Thanks for sticking with this patch. Some minor nits below. > WebCore/html/parser/HTMLTreeBuilder.cpp:359 > -HTMLTreeBuilder::HTMLTreeBuilder(HTMLTokenizer* tokenizer, DocumentFragment* fragment, Element* contextElement, FragmentScriptingPermission scriptingPermission, bool usePreHTML5ParserQuirks) > +HTMLTreeBuilder::HTMLTreeBuilder(HTMLDocumentParser* parser, HTMLTokenizer* tokenizer, DocumentFragment* fragment, Element* contextElement, FragmentScriptingPermission scriptingPermission, bool usePreHTML5ParserQuirks) This part of the change still makes me sad. If we're going to expose the whole HTMLDocumentParser, they we probably don't need to pass the tokenizer separately since you can get that from the parser. Maybe we should pass a pointer to the input stream instead? > WebCore/html/parser/HTMLTreeBuilder.cpp:2783 > + ScriptableDocumentParser* parserUpcast = m_parser; > + TextPosition0 position = parserUpcast->textPosition(); I don't understand why wer need to do this casting. Isn't textPosition available on HTMLDocumentParser directory?

Adam Barth

Comment 13 2010-12-24 10:21:14 PST

(In reply to comment #12) > (From update of attachment 77247 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=77247&action=review > > > WebCore/html/parser/HTMLTreeBuilder.cpp:2783 > > + ScriptableDocumentParser* parserUpcast = m_parser; > > + TextPosition0 position = parserUpcast->textPosition(); > > I don't understand why wer need to do this casting. Isn't textPosition available on HTMLDocumentParser directory? You mention above that it's private on HTMLDocumentParser. You should just make it public.

Peter Rybin

Comment 14 2010-12-25 15:32:54 PST

> > WebCore/html/parser/HTMLTreeBuilder.cpp:359 > > -HTMLTreeBuilder::HTMLTreeBuilder(HTMLTokenizer* tokenizer, DocumentFragment* fragment, Element* contextElement, FragmentScriptingPermission scriptingPermission, bool usePreHTML5ParserQuirks) > > +HTMLTreeBuilder::HTMLTreeBuilder(HTMLDocumentParser* parser, HTMLTokenizer* tokenizer, DocumentFragment* fragment, Element* contextElement, FragmentScriptingPermission scriptingPermission, bool usePreHTML5ParserQuirks) > This part of the change still makes me sad. If we're going to expose the whole HTMLDocumentParser, they we probably don't need to pass the tokenizer separately since you can get that from the parser. Maybe we should pass a pointer to the input stream instead? Totally agree. I would prefer the first -- to get tokenizer from parser. After all it's probably just natural that tokenizer is a part of parser. Exposing html stream seems less elegant to me.

Peter Rybin

Comment 15 2010-12-25 15:37:13 PST

Created attachment 77445 [details] Patch

Adam Barth

Comment 16 2010-12-25 16:56:14 PST

Comment on attachment 77445 [details] Patch Ok. Presumably we need one more patch to hook up the column number to the scripting engine and to add tests?

Peter Rybin

Comment 17 2010-12-25 17:27:32 PST

(In reply to comment #16) > (From update of attachment 77445 [details]) > Ok. Presumably we need one more patch to hook up the column number to the scripting engine and to add tests? Actually we have already done scripting engine ( https://bugs.webkit.org/show_bug.cgi?id=45271 ), and I even checked up the Eclipse-based debugger for Chrome -- it sees correct columns already. However I completely missed scripts inside attributes. Probably I should make a follow-up for them. As to tests, my colleagues suggested I used their inspector-related testing framework that should give me access to column positions. For now it seems to me an only way of getting column number from tests.

WebKit Commit Bot

Comment 18 2010-12-25 17:49:19 PST

Comment on attachment 77445 [details] Patch Clearing flags on attachment: 77445 Committed r74663: <http://trac.webkit.org/changeset/74663>

WebKit Commit Bot

Comment 19 2010-12-25 17:49:27 PST

All reviewed patches have been landed. Closing bug.

WebKit Review Bot

Comment 20 2010-12-25 21:24:59 PST

http://trac.webkit.org/changeset/74663 might have broken GTK Linux 32-bit Release The following tests are not passing: media/controls-strict.html

Peter Rybin

Comment 21 2010-12-26 04:25:53 PST

(In reply to comment #20) > http://trac.webkit.org/changeset/74663 might have broken GTK Linux 32-bit Release > The following tests are not passing: > media/controls-strict.html I guess it's a false alert, the test didn't show up in http://build.webkit.org/changes/24250

Andy Estes

Comment 22 2011-01-06 16:17:34 PST

The assertion added to HTMLDocumentParser::textPosition() (ASSERT(m_tokenizer->lineNumber() == line.zeroBasedInt()) is getting hit in debug builds. Visiting en.wikipedia.org is an easy way to trigger it.

Andy Estes

Comment 23 2011-01-06 16:24:50 PST

Filed https://bugs.webkit.org/show_bug.cgi?id=52033 to track the regression.

Peter Rybin

Comment 24 2011-01-11 06:29:07 PST

(In reply to comment #23) > Filed https://bugs.webkit.org/show_bug.cgi?id=52033 to track the regression. Sorry for not responding to it earlier -- Christmas holidays in Russia. As I understand, the bug is fixed now.

Adam Barth

Comment 25 2011-01-11 08:45:27 PST

> Sorry for not responding to it earlier -- Christmas holidays in Russia. As I understand, the bug is fixed now. Yep. No worries. It gave me a chance to learn more about this code.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component DOM

Assignee

Nobody

Reported

2010-12-19 16:02 PST

Modified

2011-01-11 08:45 PST History

CC List

6 users Show

URL

Keywords

Depends on

Blocks

52013

Dependencies

tree graph