Bug 64132 - Implement WebVTT Cue Text Parsing rules and DOM construction for WebVTT elements
Summary: Implement WebVTT Cue Text Parsing rules and DOM construction for WebVTT elements
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Media (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Anna Cavender
URL:
Keywords:
Depends on: 65884
Blocks: 62882
  Show dependency treegraph
 
Reported: 2011-07-07 16:26 PDT by Anna Cavender
Modified: 2011-08-29 13:47 PDT (History)
8 users (show)

See Also:


Attachments
first pass at WebVTT cue text parser/tokenizer (33.75 KB, patch)
2011-08-08 16:01 PDT, Anna Cavender
no flags Details | Formatted Diff | Diff
adding classes and title found during parsing to fragment tree (34.20 KB, patch)
2011-08-09 11:25 PDT, Anna Cavender
no flags Details | Formatted Diff | Diff
addressing Adam's comments (34.93 KB, patch)
2011-08-09 16:27 PDT, Anna Cavender
no flags Details | Formatted Diff | Diff
using DataVector typedef and fixing a few bugs (37.71 KB, patch)
2011-08-24 12:00 PDT, Anna Cavender
no flags Details | Formatted Diff | Diff
updating to ToT (37.70 KB, patch)
2011-08-29 13:04 PDT, Anna Cavender
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Anna Cavender 2011-07-07 16:26:06 PDT
WebVTT Cue Text Parsing rules : http://www.whatwg.org/specs/web-apps/current-work/multipage/the-iframe-element.html#webvtt-cue-text-parsing-rules
WebVTT Cue Text DOM construction rules : http://www.whatwg.org/specs/web-apps/current-work/multipage/the-iframe-element.html#webvtt-cue-text-dom-construction-rules

Thanks to WebKit, we get some of these for free:
WebVTT Italics Objects (<i></i>)
WebVTT Bold Objects (<b></b>)
WebVTT Underline Objects (<u></u>)
WebVTT Ruby Objects (<ruby></ruby>)
WebVTT Ruby Text Objects (<rt></rt>)

But there are some we will need to implement:
WebVTT Class Objects (<c></c>)
WebVTT Voice Objects (<v></v>)
WebVTT Leaf Node Objects
WebVTT Text Objects
WebVTT Timestamp Objects (<[time]>)
Comment 1 Anna Cavender 2011-08-08 16:01:01 PDT
Created attachment 103308 [details]
first pass at WebVTT cue text parser/tokenizer
Comment 2 Anna Cavender 2011-08-09 11:25:53 PDT
Created attachment 103376 [details]
adding classes and title found during parsing to fragment tree
Comment 3 WebKit Review Bot 2011-08-09 11:29:24 PDT
Attachment 103376 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/CMakeLists.txt', u'Source/W..." exit_code: 1

Source/WebCore/platform/track/WebVTTTokenizer.cpp:86:  Non-label code inside switch statements should be indented.  [whitespace/indent] [4]
Total errors found: 1 in 15 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 4 Adam Barth 2011-08-09 12:41:49 PDT
Comment on attachment 103376 [details]
adding classes and title found during parsing to fragment tree

View in context: https://bugs.webkit.org/attachment.cgi?id=103376&action=review

This looks great, but I'd like to see another round.  There seems to be a bunch of AtomicString/String confusion.  AtomicStrings are strings where we've hashed the contents of the string in a giant hash table to use the same storage as other instances of the same string.  They're useful when comparing strings for equality (because you can just compare their addresses) and when you want lots of common strings to share the same storage.  The trade-off is that they're slower to construct (because you need to hash their contents).

> Source/WebCore/platform/track/WebVTTParser.cpp:232
> +        switch (m_token.type()) {
> +        case WebVTTTokenTypes::Character:

I'd put this switch statement into its own function to keep the loop small and tight.  For HTML and XML, we put this function into it's own class, but that's probably not necessary here.

> Source/WebCore/platform/track/WebVTTParser.cpp:234
> +            AtomicString content(m_token.characters().data(), m_token.characters().size());

Why AtomicString?  String is probably more appropriate if you're going to just put this in a Text node.

> Source/WebCore/platform/track/WebVTTParser.cpp:240
> +        case WebVTTTokenTypes::StartTag:
> +            {

These { braces aren't in the correct style.  These two lines should be merged and the trailing } should be aligned with the "c" in case.

> Source/WebCore/platform/track/WebVTTParser.cpp:248
> +            else if (AtomicString(m_token.name().data()) == "c")

Why bother to malloc a string if it's just one character long?  You can just compare the length and the value of the character.  Also, what happened to the length?

> Source/WebCore/platform/track/WebVTTParser.cpp:271
> +             if (tokenTagName == iTag
> +                || tokenTagName == bTag
> +                || tokenTagName == uTag
> +                || tokenTagName == rubyTag
> +                || tokenTagName == rtTag
> +                || AtomicString(m_token.name().data()) == "c"
> +                || AtomicString(m_token.name().data()) == "v")

Rather than copy/pasting this code, I would create a free inline function.

> Source/WebCore/platform/track/WebVTTParser.cpp:276
> +            unsigned position = 0;

You sure this shouldn't be size_t ?

> Source/WebCore/platform/track/WebVTTParser.cpp:279
> +                m_currentNode->parserAddChild(ProcessingInstruction::create(document, "timestamp", AtomicString(m_token.characters().data(), m_token.characters().size())));

AtomicString should probably be String again.

> Source/WebCore/platform/track/WebVTTTokenizer.cpp:82
> +    UChar cc = m_inputStreamPreprocessor.nextInputCharacter();

I presume the spec says to do the same preprocessing that we do for HTML (e.g., \r\n collapsing, etc).
Comment 5 Vicki Pfau 2011-08-09 13:46:19 PDT
Comment on attachment 103376 [details]
adding classes and title found during parsing to fragment tree

View in context: https://bugs.webkit.org/attachment.cgi?id=103376&action=review

> Source/WebCore/platform/track/WebVTTToken.h:130
> +    String m_annotation;
> +    String m_classes;
> +    String m_currentBuffer;

These probably shouldn't be Strings. When you append a single character to a string, as you do in this tokenizer, it resizes (and therefore memcpys) the string. Since it does this in a loop, the loop becomes very inefficient. The reason that the other tokenizers use a Vector of UChars is because Vectors are more optimized for small resizes.
Comment 6 Anna Cavender 2011-08-09 16:26:58 PDT
Comment on attachment 103376 [details]
adding classes and title found during parsing to fragment tree

Thanks for the fast response, Adam!  And thank you for the explanation too.

View in context: https://bugs.webkit.org/attachment.cgi?id=103376&action=review

>> Source/WebCore/platform/track/WebVTTParser.cpp:232
>> +        case WebVTTTokenTypes::Character:
> 
> I'd put this switch statement into its own function to keep the loop small and tight.  For HTML and XML, we put this function into it's own class, but that's probably not necessary here.

Done.

>> Source/WebCore/platform/track/WebVTTParser.cpp:234
>> +            AtomicString content(m_token.characters().data(), m_token.characters().size());
> 
> Why AtomicString?  String is probably more appropriate if you're going to just put this in a Text node.

Done.

>> Source/WebCore/platform/track/WebVTTParser.cpp:240
>> +            {
> 
> These { braces aren't in the correct style.  These two lines should be merged and the trailing } should be aligned with the "c" in case.

Done.

>> Source/WebCore/platform/track/WebVTTParser.cpp:248
>> +            else if (AtomicString(m_token.name().data()) == "c")
> 
> Why bother to malloc a string if it's just one character long?  You can just compare the length and the value of the character.  Also, what happened to the length?

Done.  Good point.

>> Source/WebCore/platform/track/WebVTTParser.cpp:271
>> +                || AtomicString(m_token.name().data()) == "v")
> 
> Rather than copy/pasting this code, I would create a free inline function.

Done.  Nice.

>> Source/WebCore/platform/track/WebVTTParser.cpp:276
>> +            unsigned position = 0;
> 
> You sure this shouldn't be size_t ?

collectTimeStamp takes an unsigned, but I wrote that function too, so I'm open to suggestions :).

>> Source/WebCore/platform/track/WebVTTParser.cpp:279
>> +                m_currentNode->parserAddChild(ProcessingInstruction::create(document, "timestamp", AtomicString(m_token.characters().data(), m_token.characters().size())));
> 
> AtomicString should probably be String again.

Done.

>> Source/WebCore/platform/track/WebVTTToken.h:130
>> +    String m_currentBuffer;
> 
> These probably shouldn't be Strings. When you append a single character to a string, as you do in this tokenizer, it resizes (and therefore memcpys) the string. Since it does this in a loop, the loop becomes very inefficient. The reason that the other tokenizers use a Vector of UChars is because Vectors are more optimized for small resizes.

Done.  Changed to UChar Vectors.  Thanks for the explanation.

>> Source/WebCore/platform/track/WebVTTTokenizer.cpp:82
>> +    UChar cc = m_inputStreamPreprocessor.nextInputCharacter();
> 
> I presume the spec says to do the same preprocessing that we do for HTML (e.g., \r\n collapsing, etc).

Hmm.  It doesn't say that specifically, but using the existing preprocessor made sense to me.

>> Source/WebCore/platform/track/WebVTTTokenizer.cpp:86
>> +    WEBVTT_BEGIN_STATE(DataState) {
> 
> Non-label code inside switch statements should be indented.  [whitespace/indent] [4]

Funny, these are labels, but the style checker can't tell that.
Comment 7 Anna Cavender 2011-08-09 16:27:37 PDT
Created attachment 103415 [details]
addressing Adam's comments
Comment 8 Vicki Pfau 2011-08-10 18:11:22 PDT
Comment on attachment 103415 [details]
addressing Adam's comments

View in context: https://bugs.webkit.org/attachment.cgi?id=103415&action=review

> Source/WebCore/platform/track/WebVTTToken.h:97
> +    Vector<UChar, 32> classes()

This should be const Vector<UChar, 32>& classes() const, or const DataVector& classes() const (see below) to avoid copying

> Source/WebCore/platform/track/WebVTTToken.h:115
> +    Vector<UChar, 32> annotation()

This should be const Vector<UChar, 32>& annotation() const, or const DataVector& annotation() const (see below) to avoid copying

> Source/WebCore/platform/track/WebVTTToken.h:129
> +    Vector<UChar, 32> m_annotation;
> +    Vector<UChar, 32> m_classes;

If these are about the same length as a tag name or an attribute name, this is what the MarkupTokenBase::DataVector type (which is just a typedef'd Vector) is for, so it might make sense to use those here.
Comment 9 Anna Cavender 2011-08-24 11:59:51 PDT
Comment on attachment 103415 [details]
addressing Adam's comments

Thanks Jeffrey.

View in context: https://bugs.webkit.org/attachment.cgi?id=103415&action=review

>> Source/WebCore/platform/track/WebVTTToken.h:97
>> +    Vector<UChar, 32> classes()
> 
> This should be const Vector<UChar, 32>& classes() const, or const DataVector& classes() const (see below) to avoid copying

Done.

>> Source/WebCore/platform/track/WebVTTToken.h:115
>> +    Vector<UChar, 32> annotation()
> 
> This should be const Vector<UChar, 32>& annotation() const, or const DataVector& annotation() const (see below) to avoid copying

Done.

>> Source/WebCore/platform/track/WebVTTToken.h:129
>> +    Vector<UChar, 32> m_classes;
> 
> If these are about the same length as a tag name or an attribute name, this is what the MarkupTokenBase::DataVector type (which is just a typedef'd Vector) is for, so it might make sense to use those here.

Good idea, thanks.  Done.
Comment 10 Anna Cavender 2011-08-24 12:00:10 PDT
Created attachment 105034 [details]
using DataVector typedef and fixing a few bugs
Comment 11 WebKit Review Bot 2011-08-24 12:02:59 PDT
Attachment 105034 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/CMakeLists.txt', u'Source/W..." exit_code: 1

Source/WebCore/platform/track/WebVTTTokenizer.cpp:86:  Non-label code inside switch statements should be indented.  [whitespace/indent] [4]
Total errors found: 1 in 15 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 12 Anna Cavender 2011-08-29 13:04:09 PDT
Created attachment 105513 [details]
updating to ToT
Comment 13 WebKit Review Bot 2011-08-29 13:07:27 PDT
Attachment 105513 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/CMakeLists.txt', u'Source/W..." exit_code: 1

Source/WebCore/platform/track/WebVTTTokenizer.cpp:86:  Non-label code inside switch statements should be indented.  [whitespace/indent] [4]
Total errors found: 1 in 15 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 14 Adam Barth 2011-08-29 13:14:43 PDT
Comment on attachment 105513 [details]
updating to ToT

View in context: https://bugs.webkit.org/attachment.cgi?id=105513&action=review

> Source/WebCore/platform/track/WebVTTParser.cpp:355
> +    default:
> +        break;

Generally we skip the default case in WebKit so the compiler tells us if we forgot any cases.
Comment 15 WebKit Review Bot 2011-08-29 13:47:20 PDT
Comment on attachment 105513 [details]
updating to ToT

Clearing flags on attachment: 105513

Committed r94011: <http://trac.webkit.org/changeset/94011>
Comment 16 WebKit Review Bot 2011-08-29 13:47:26 PDT
All reviewed patches have been landed.  Closing bug.