64132 – Implement WebVTT Cue Text Parsing rules and DOM construction for WebVTT elements

RESOLVED FIXED 64132

Implement WebVTT Cue Text Parsing rules and DOM construction for WebVTT elements

https://bugs.webkit.org/show_bug.cgi?id=64132

Summary Implement WebVTT Cue Text Parsing rules and DOM construction for WebVTT elements

Anna Cavender

Reported 2011-07-07 16:26:06 PDT

WebVTT Cue Text Parsing rules : http://www.whatwg.org/specs/web-apps/current-work/multipage/the-iframe-element.html#webvtt-cue-text-parsing-rules WebVTT Cue Text DOM construction rules : http://www.whatwg.org/specs/web-apps/current-work/multipage/the-iframe-element.html#webvtt-cue-text-dom-construction-rules Thanks to WebKit, we get some of these for free: WebVTT Italics Objects () WebVTT Bold Objects () WebVTT Underline Objects () WebVTT Ruby Objects (<ruby></ruby>) WebVTT Ruby Text Objects (<rt></rt>) But there are some we will need to implement: WebVTT Class Objects (<c></c>) WebVTT Voice Objects (<v></v>) WebVTT Leaf Node Objects WebVTT Text Objects WebVTT Timestamp Objects (<[time]>)

Attachments
first pass at WebVTT cue text parser/tokenizer (33.75 KB, patch) 2011-08-08 16:01 PDT, Anna Cavender	no flags	Details Formatted Diff Diff
adding classes and title found during parsing to fragment tree (34.20 KB, patch) 2011-08-09 11:25 PDT, Anna Cavender	no flags	Details Formatted Diff Diff
addressing Adam's comments (34.93 KB, patch) 2011-08-09 16:27 PDT, Anna Cavender	no flags	Details Formatted Diff Diff
using DataVector typedef and fixing a few bugs (37.71 KB, patch) 2011-08-24 12:00 PDT, Anna Cavender	no flags	Details Formatted Diff Diff
updating to ToT (37.70 KB, patch) 2011-08-29 13:04 PDT, Anna Cavender	no flags	Details Formatted Diff Diff
Show Obsolete (4) View All Add attachment proposed patch, testcase, etc.

Anna Cavender

Comment 1 2011-08-08 16:01:01 PDT

Created attachment 103308 [details] first pass at WebVTT cue text parser/tokenizer

Anna Cavender

Comment 2 2011-08-09 11:25:53 PDT

Created attachment 103376 [details] adding classes and title found during parsing to fragment tree

WebKit Review Bot

Comment 3 2011-08-09 11:29:24 PDT

Attachment 103376 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/CMakeLists.txt', u'Source/W..." exit_code: 1 Source/WebCore/platform/track/WebVTTTokenizer.cpp:86: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 15 files If any of these errors are false positives, please file a bug against check-webkit-style.

Adam Barth

Comment 4 2011-08-09 12:41:49 PDT

Comment on attachment 103376 [details] adding classes and title found during parsing to fragment tree View in context: https://bugs.webkit.org/attachment.cgi?id=103376&action=review This looks great, but I'd like to see another round. There seems to be a bunch of AtomicString/String confusion. AtomicStrings are strings where we've hashed the contents of the string in a giant hash table to use the same storage as other instances of the same string. They're useful when comparing strings for equality (because you can just compare their addresses) and when you want lots of common strings to share the same storage. The trade-off is that they're slower to construct (because you need to hash their contents). > Source/WebCore/platform/track/WebVTTParser.cpp:232 > + switch (m_token.type()) { > + case WebVTTTokenTypes::Character: I'd put this switch statement into its own function to keep the loop small and tight. For HTML and XML, we put this function into it's own class, but that's probably not necessary here. > Source/WebCore/platform/track/WebVTTParser.cpp:234 > + AtomicString content(m_token.characters().data(), m_token.characters().size()); Why AtomicString? String is probably more appropriate if you're going to just put this in a Text node. > Source/WebCore/platform/track/WebVTTParser.cpp:240 > + case WebVTTTokenTypes::StartTag: > + { These { braces aren't in the correct style. These two lines should be merged and the trailing } should be aligned with the "c" in case. > Source/WebCore/platform/track/WebVTTParser.cpp:248 > + else if (AtomicString(m_token.name().data()) == "c") Why bother to malloc a string if it's just one character long? You can just compare the length and the value of the character. Also, what happened to the length? > Source/WebCore/platform/track/WebVTTParser.cpp:271 > + if (tokenTagName == iTag > + || tokenTagName == bTag > + || tokenTagName == uTag > + || tokenTagName == rubyTag > + || tokenTagName == rtTag > + || AtomicString(m_token.name().data()) == "c" > + || AtomicString(m_token.name().data()) == "v") Rather than copy/pasting this code, I would create a free inline function. > Source/WebCore/platform/track/WebVTTParser.cpp:276 > + unsigned position = 0; You sure this shouldn't be size_t ? > Source/WebCore/platform/track/WebVTTParser.cpp:279 > + m_currentNode->parserAddChild(ProcessingInstruction::create(document, "timestamp", AtomicString(m_token.characters().data(), m_token.characters().size()))); AtomicString should probably be String again. > Source/WebCore/platform/track/WebVTTTokenizer.cpp:82 > + UChar cc = m_inputStreamPreprocessor.nextInputCharacter(); I presume the spec says to do the same preprocessing that we do for HTML (e.g., \r\n collapsing, etc).

Vicki Pfau

Comment 5 2011-08-09 13:46:19 PDT

Comment on attachment 103376 [details] adding classes and title found during parsing to fragment tree View in context: https://bugs.webkit.org/attachment.cgi?id=103376&action=review > Source/WebCore/platform/track/WebVTTToken.h:130 > + String m_annotation; > + String m_classes; > + String m_currentBuffer; These probably shouldn't be Strings. When you append a single character to a string, as you do in this tokenizer, it resizes (and therefore memcpys) the string. Since it does this in a loop, the loop becomes very inefficient. The reason that the other tokenizers use a Vector of UChars is because Vectors are more optimized for small resizes.

Anna Cavender

Comment 6 2011-08-09 16:26:58 PDT

Comment on attachment 103376 [details] adding classes and title found during parsing to fragment tree Thanks for the fast response, Adam! And thank you for the explanation too. View in context: https://bugs.webkit.org/attachment.cgi?id=103376&action=review >> Source/WebCore/platform/track/WebVTTParser.cpp:232 >> + case WebVTTTokenTypes::Character: > > I'd put this switch statement into its own function to keep the loop small and tight. For HTML and XML, we put this function into it's own class, but that's probably not necessary here. Done. >> Source/WebCore/platform/track/WebVTTParser.cpp:234 >> + AtomicString content(m_token.characters().data(), m_token.characters().size()); > > Why AtomicString? String is probably more appropriate if you're going to just put this in a Text node. Done. >> Source/WebCore/platform/track/WebVTTParser.cpp:240 >> + { > > These { braces aren't in the correct style. These two lines should be merged and the trailing } should be aligned with the "c" in case. Done. >> Source/WebCore/platform/track/WebVTTParser.cpp:248 >> + else if (AtomicString(m_token.name().data()) == "c") > > Why bother to malloc a string if it's just one character long? You can just compare the length and the value of the character. Also, what happened to the length? Done. Good point. >> Source/WebCore/platform/track/WebVTTParser.cpp:271 >> + || AtomicString(m_token.name().data()) == "v") > > Rather than copy/pasting this code, I would create a free inline function. Done. Nice. >> Source/WebCore/platform/track/WebVTTParser.cpp:276 >> + unsigned position = 0; > > You sure this shouldn't be size_t ? collectTimeStamp takes an unsigned, but I wrote that function too, so I'm open to suggestions :). >> Source/WebCore/platform/track/WebVTTParser.cpp:279 >> + m_currentNode->parserAddChild(ProcessingInstruction::create(document, "timestamp", AtomicString(m_token.characters().data(), m_token.characters().size()))); > > AtomicString should probably be String again. Done. >> Source/WebCore/platform/track/WebVTTToken.h:130 >> + String m_currentBuffer; > > These probably shouldn't be Strings. When you append a single character to a string, as you do in this tokenizer, it resizes (and therefore memcpys) the string. Since it does this in a loop, the loop becomes very inefficient. The reason that the other tokenizers use a Vector of UChars is because Vectors are more optimized for small resizes. Done. Changed to UChar Vectors. Thanks for the explanation. >> Source/WebCore/platform/track/WebVTTTokenizer.cpp:82 >> + UChar cc = m_inputStreamPreprocessor.nextInputCharacter(); > > I presume the spec says to do the same preprocessing that we do for HTML (e.g., \r\n collapsing, etc). Hmm. It doesn't say that specifically, but using the existing preprocessor made sense to me. >> Source/WebCore/platform/track/WebVTTTokenizer.cpp:86 >> + WEBVTT_BEGIN_STATE(DataState) { > > Non-label code inside switch statements should be indented. [whitespace/indent] [4] Funny, these are labels, but the style checker can't tell that.

Anna Cavender

Comment 7 2011-08-09 16:27:37 PDT

Created attachment 103415 [details] addressing Adam's comments

Vicki Pfau

Comment 8 2011-08-10 18:11:22 PDT

Comment on attachment 103415 [details] addressing Adam's comments View in context: https://bugs.webkit.org/attachment.cgi?id=103415&action=review > Source/WebCore/platform/track/WebVTTToken.h:97 > + Vector<UChar, 32> classes() This should be const Vector<UChar, 32>& classes() const, or const DataVector& classes() const (see below) to avoid copying > Source/WebCore/platform/track/WebVTTToken.h:115 > + Vector<UChar, 32> annotation() This should be const Vector<UChar, 32>& annotation() const, or const DataVector& annotation() const (see below) to avoid copying > Source/WebCore/platform/track/WebVTTToken.h:129 > + Vector<UChar, 32> m_annotation; > + Vector<UChar, 32> m_classes; If these are about the same length as a tag name or an attribute name, this is what the MarkupTokenBase::DataVector type (which is just a typedef'd Vector) is for, so it might make sense to use those here.

Anna Cavender

Comment 9 2011-08-24 11:59:51 PDT

Comment on attachment 103415 [details] addressing Adam's comments Thanks Jeffrey. View in context: https://bugs.webkit.org/attachment.cgi?id=103415&action=review >> Source/WebCore/platform/track/WebVTTToken.h:97 >> + Vector<UChar, 32> classes() > > This should be const Vector<UChar, 32>& classes() const, or const DataVector& classes() const (see below) to avoid copying Done. >> Source/WebCore/platform/track/WebVTTToken.h:115 >> + Vector<UChar, 32> annotation() > > This should be const Vector<UChar, 32>& annotation() const, or const DataVector& annotation() const (see below) to avoid copying Done. >> Source/WebCore/platform/track/WebVTTToken.h:129 >> + Vector<UChar, 32> m_classes; > > If these are about the same length as a tag name or an attribute name, this is what the MarkupTokenBase::DataVector type (which is just a typedef'd Vector) is for, so it might make sense to use those here. Good idea, thanks. Done.

Anna Cavender

Comment 10 2011-08-24 12:00:10 PDT

Created attachment 105034 [details] using DataVector typedef and fixing a few bugs

WebKit Review Bot

Comment 11 2011-08-24 12:02:59 PDT

Attachment 105034 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/CMakeLists.txt', u'Source/W..." exit_code: 1 Source/WebCore/platform/track/WebVTTTokenizer.cpp:86: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 15 files If any of these errors are false positives, please file a bug against check-webkit-style.

Anna Cavender

Comment 12 2011-08-29 13:04:09 PDT

Created attachment 105513 [details] updating to ToT

WebKit Review Bot

Comment 13 2011-08-29 13:07:27 PDT

Attachment 105513 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/CMakeLists.txt', u'Source/W..." exit_code: 1 Source/WebCore/platform/track/WebVTTTokenizer.cpp:86: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 15 files If any of these errors are false positives, please file a bug against check-webkit-style.

Adam Barth

Comment 14 2011-08-29 13:14:43 PDT

Comment on attachment 105513 [details] updating to ToT View in context: https://bugs.webkit.org/attachment.cgi?id=105513&action=review > Source/WebCore/platform/track/WebVTTParser.cpp:355 > + default: > + break; Generally we skip the default case in WebKit so the compiler tells us if we forgot any cases.

WebKit Review Bot

Comment 15 2011-08-29 13:47:20 PDT

Comment on attachment 105513 [details] updating to ToT Clearing flags on attachment: 105513 Committed r94011: <http://trac.webkit.org/changeset/94011>

WebKit Review Bot

Comment 16 2011-08-29 13:47:26 PDT

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component Media

Assignee

Anna Cavender

Reported

2011-07-07 16:26 PDT

Modified

2011-08-29 13:47 PDT History

CC List

8 users Show

URL

Keywords

Depends on

65884

Blocks

62882

Dependencies

tree graph