Bug 119423 - Improve srcset parser
Summary: Improve srcset parser
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Images (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Romain Perier
URL:
Keywords: InRadar
Depends on: 110252
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-01 19:09 PDT by Dean Jackson
Modified: 2013-09-17 13:05 PDT (History)
10 users (show)

See Also:


Attachments
Patch (4.66 KB, patch)
2013-08-19 12:24 PDT, Romain Perier
no flags Details | Formatted Diff | Diff
Patch (7.82 KB, patch)
2013-08-27 10:23 PDT, Romain Perier
no flags Details | Formatted Diff | Diff
Patch (9.99 KB, patch)
2013-09-02 03:59 PDT, Romain Perier
no flags Details | Formatted Diff | Diff
Patch (10.12 KB, patch)
2013-09-02 07:51 PDT, Romain Perier
no flags Details | Formatted Diff | Diff
Patch (17.71 KB, patch)
2013-09-10 04:20 PDT, Romain Perier
no flags Details | Formatted Diff | Diff
Patch (17.94 KB, patch)
2013-09-10 13:05 PDT, Romain Perier
no flags Details | Formatted Diff | Diff
Patch (23.92 KB, patch)
2013-09-13 04:27 PDT, Romain Perier
no flags Details | Formatted Diff | Diff
Patch (30.14 KB, patch)
2013-09-17 08:37 PDT, Romain Perier
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dean Jackson 2013-08-01 19:09:48 PDT
The srcset parser creates intermediate arrays of strings (multiple times). We should make a better parser than walks along the attribute value to extract the candidates.
Comment 1 Radar WebKit Bug Importer 2013-08-01 19:09:57 PDT
<rdar://problem/14628284>
Comment 2 Dean Jackson 2013-08-01 19:10:22 PDT
Note that the parser is duplicated in the preload scanner and the image element.
Comment 3 Dean Jackson 2013-08-12 14:14:40 PDT
(In reply to comment #2)
> Note that the parser is duplicated in the preload scanner and the image element.

It's not duplicated - ignore this. Still needs improvement though :)
Comment 4 Benjamin Poulain 2013-08-12 14:35:11 PDT
(In reply to comment #3)
> (In reply to comment #2)
> > Note that the parser is duplicated in the preload scanner and the image element.
> 
> It's not duplicated - ignore this. Still needs improvement though :)

Romain is getting started on this, he emailed me yesterday.
Comment 5 Romain Perier 2013-08-19 12:24:54 PDT
Created attachment 209108 [details]
Patch
Comment 6 Romain Perier 2013-08-19 12:25:29 PDT
See a first proposal in attachment
Comment 7 Benjamin Poulain 2013-08-20 01:22:34 PDT
Comment on attachment 209108 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=209108&action=review

I am going to bed, I'll finish the review on the next iteration :)

General comments:
-This looks very efficient allocation wise. (did you benchmark it?)
-You should have more variables with better names instead of reusing index and from.
-If you can avoid loading a character twice from memory, that is great.
-We should probably extract the whole parsing code in a static function with a explanative name (parseImagesWithScaleFromSrcsetAttribute()?). That's usually how we document things in WebKit.

> Source/WebCore/ChangeLog:14
> +        when parsing. That is more efficient on ARM devices and avoids trashing the data cache.

You can probably drop the "on ARM". I have the feeling this is gonna be faster everywhere.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:325
> +    size_t from = 0;

"from" is a little generic.
What about imageCandidateStringStart?

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:328
>  

Extra space.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:329
> +        if (from >= srcSetAttribute.length())

You could keep srcSetAttribute.length() in a temporary variable outside the loop. This would avoid querying for the string impl for each iterations.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:330
> +            break;

Since this is the only break condition for the loop, why not have while(from >= srcSetAttributeLength)?

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:335
> +        // Collect a sequence of characters that are not space, it's the url

Missing period.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:340
> +        size_t index = srcSetAttribute.find(isNotHTMLSpace, from);
> +        if (index >= separator) {
> +            from = separator + 1;
>              continue;
> -        if (!data.last().endsWith('x'))
> +        }

I would start with this, then find the separator.

The reason has to do with how you read the memory:
Let's say you have "    foo.jpg 2x". When executing "srcSetAttribute.find(',', from)", you will go over the spaces and find the separator. Then, you execute "srcSetAttribute.find(isNotHTMLSpace, from)" which will read the memory again from the beginning to find the character 'f'.

If you inverses the operations. You first skip all the HTMLSpace. Then you never have to go over them again. You take the position you found, and you look for the separator: "srcSetAttribute.find(',', index).

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:341
> +        from = index;

index is too generic. The previous "index" is "startOfTheURLString"

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:342
> +        index = srcSetAttribute.find(isHTMLSpace, index);

index -> endOfTheURLString?

You could also imagine doing this before finding the separator (to avoid loading the same characters again from memory).

In that case, "foo,bar.jpg 2x, would parse "foo,bar.jpg" as the URL. I don't know if that is okay? (hum, do we have tests for that kind of input anyway?)

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:347
> +        size_t url = from, spaceSep = index;

Each statement should get its own line. (WebKit coding style)

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:349
> +        // Collect a sequence of characters that are neither "," nor space, it's the descriptor

Missing period.
Comment 8 Romain Perier 2013-08-27 10:23:40 PDT
Created attachment 209785 [details]
Patch
Comment 9 Romain Perier 2013-08-27 10:29:25 PDT
This patch should fit your needs
Comment 10 Benjamin Poulain 2013-08-27 20:53:40 PDT
Comment on attachment 209785 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=209785&action=review

I just had a quick look but this seems good.

You need to rebaseline though. The patch does not apply on the bots.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:328
>  

You can remove this space.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:360
> +        imageScaleEnd = (imageScaleEnd == notFound) ? srcSetLen: imageScaleEnd;

missing space before :

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:365
> +            separator = srcSetAttribute.find(",", separator + 1);

"," -> ','

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:384
> +        image.imageURL = decodeURLEscapeSequences(StringImpl::createWithoutCopying(srcSetAttribute.characters() + imageUrlStart, imageUrlEnd - imageUrlStart));

You could do without the StirngImpl::createWithoutCopying(). It is not gonna be a big deal here to create a single string, and it will be safer not to share the memory.
Comment 11 Dean Jackson 2013-08-28 14:27:49 PDT
Great stuff Romain.

One area I'm especially interested in is data: URLs inside srcset. Ted tells me the spec says that we shouldn't universally split on commas. I wonder if we need more tests of this. Similarly, the inline URL might be a data: SVG which has inline commas as well.
Comment 12 Romain Perier 2013-08-29 05:22:40 PDT
(In reply to comment #11)
> Great stuff Romain.
> 
> One area I'm especially interested in is data: URLs inside srcset.
>
Mhhh right


> Ted tells me the spec says that we shouldn't universally split on commas.

Yes, but you can detect and skip invalid candidates (and switch to another one)

> I wonder if we need more tests of this. Similarly, the inline URL might be a >data: SVG which has inline commas as well.

It's an issue that I need to resolve, good catch. I will look at it.
Comment 13 Theresa O'Connor 2013-08-29 09:41:08 PDT
If your implementation follows the "Processing the image candidates" algorithm as specified, you won't have an issue with commas in data URLs.

http://www.whatwg.org/specs/web-apps/current-work/multipage/embedded-content-1.html#processing-the-image-candidates
http://www.w3.org/html/wg/drafts/srcset/w3c-srcset/#processing-the-image-candidates
Comment 14 Romain Perier 2013-09-02 03:59:37 PDT
Created attachment 210274 [details]
Patch
Comment 15 Romain Perier 2013-09-02 04:17:57 PDT
(In reply to comment #11)
> Great stuff Romain.
> 
> One area I'm especially interested in is data: URLs inside srcset. Ted tells me the spec says that we shouldn't universally split on commas. I wonder if we need more tests of this. Similarly, the inline URL might be a data: SVG which has inline commas as well.

Btw, as discussed with Benjamin an SVG might contain commas but also spaces... (the URL delimiter). The parser can handle several commas in the data but not space as it's the end of the url... (how do you make difference between a normal space and the space delimiter ? And yes I have fallowed the specification)
Comment 16 Romain Perier 2013-09-02 07:51:11 PDT
Created attachment 210285 [details]
Patch
Comment 17 Dean Jackson 2013-09-03 18:19:04 PDT
(In reply to comment #15)
> (In reply to comment #11)
> > Great stuff Romain.
> > 
> > One area I'm especially interested in is data: URLs inside srcset. Ted tells me the spec says that we shouldn't universally split on commas. I wonder if we need more tests of this. Similarly, the inline URL might be a data: SVG which has inline commas as well.
> 
> Btw, as discussed with Benjamin an SVG might contain commas but also spaces... (the URL delimiter). The parser can handle several commas in the data but not space as it's the end of the url... (how do you make difference between a normal space and the space delimiter ? And yes I have fallowed the specification)

I guess there are limits to what data: content we can put in a srcset attribute without having to escape things. I don' t think we should worry too much about it.
Comment 18 Dean Jackson 2013-09-03 18:28:03 PDT
Comment on attachment 210285 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=210285&action=review

I think this looks almost ready - just some minor comments.

However, the fact that this is custom parsing code makes me nervous. How confident are we that our tests cover all ranges of correct and incorrect input? It would be nice if we had some stress tests for really bad input (e.g. malformed data:, encoded commas, etc).

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:322
> +static void parseImagesWithScaleFromSrcsetAttribute(const String& srcSetAttribute, ImageCandidates& imageCandidates)

I guess for consistency "FromSrcsetAttribute" should be "FromSrcSetAttribute"

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:329
> +        // Collect a sequence of characters that are not space, it's the url.

I don't think we need this comment. Or it should be "Find the next non-space character in the attribute."

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:350
> +        // Collect a sequence of characters that are neither ',' nor space, it's the descriptor.
> +        size_t imageScaleStart = srcSetAttribute.find(isNotHTMLSpace, imageUrlEnd + 1);

The comment doesn't match the code. isNotHTMLSpace doesn't check for ','.
Comment 19 Romain Perier 2013-09-04 01:09:26 PDT
(In reply to comment #18)
> (From update of attachment 210285 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=210285&action=review
> 
> I think this looks almost ready - just some minor comments.
> 
> However, the fact that this is custom parsing code makes me nervous. How confident are we that our tests cover all ranges of correct and incorrect input? 

See fast/hidpi/image-srcset-invalid-inputs.html which contains a various range of invalid cases. However It does not include data: , perhaps I should write a separated unit test for data: (including valid and invalid cases). For valid inputs, we already have several unit tests. I could improve them by adding some weird valid inputs (containing a lot of spaces for example). 

What do you think ?
My implementation follows the specification , the only difference is that it handles better invalid inputs not covered by the
high level algorithm in the spec, nothing more, really :)

> It would be nice if we had some stress tests for really bad input (e.g. malformed data:, encoded commas, etc).
> 
> > Source/WebCore/html/parser/HTMLParserIdioms.cpp:322
> > +static void parseImagesWithScaleFromSrcsetAttribute(const String& srcSetAttribute, ImageCandidates& imageCandidates)
> 
> I guess for consistency "FromSrcsetAttribute" should be "FromSrcSetAttribute"
>

sure

> > Source/WebCore/html/parser/HTMLParserIdioms.cpp:329
> > +        // Collect a sequence of characters that are not space, it's the url.
> 
> I don't think we need this comment. Or it should be "Find the next non-space character in the attribute."
> 

This comment is for the block of code which extracts the url, not the for the line

> > Source/WebCore/html/parser/HTMLParserIdioms.cpp:350
> > +        // Collect a sequence of characters that are neither ',' nor space, it's the descriptor.
> > +        size_t imageScaleStart = srcSetAttribute.find(isNotHTMLSpace, imageUrlEnd + 1);
> 
> The comment doesn't match the code. isNotHTMLSpace doesn't check for ','.

Same thing as the previous comment, it is for the block of code which extracts the descriptor, not the for the line.

However if these comments are really useless, I can remove them, sure.
Comment 20 Romain Perier 2013-09-04 08:47:02 PDT
For really bad inputs I have something like the following test in mind:

<html>
<head>
<script src="resources/srcset-helper.js"></script>
</head>

<body id="body">
    <div>This test passes if this img tag below is empty and displays nothing. It ensures that the srcset attribute supports invalid inputs</div>
    <img height="100" width="100" src="" srcset="1x,,  ,   x    ,2x  , foo.jpg, 3x, bar.jpg 4x 100h, foo.jpg 5, bar.jpg dx,foo.jpg,bar.jpg, data:image/png;base64,ThisIsAMalFormedData%/,withForbidden,Characters:!?~&#') 6x, data: 7x, data:;charset= 8x, data:;charset=us-ASCII 9x, data:,BadData 10x, data:foo/bar,foobar 11x, data:foo/bar, 12x"></img>
</body>
</html>


The test passes. I can improve tests for valid inputs by adding some weird cases for data:URL , but nothing more (I have no other ideas)
Comment 21 Benjamin Poulain 2013-09-04 15:15:50 PDT
(In reply to comment #19)
> (In reply to comment #18)
> > (From update of attachment 210285 [details] [details])
> > View in context: https://bugs.webkit.org/attachment.cgi?id=210285&action=review
> > 
> > I think this looks almost ready - just some minor comments.
> > 
> > However, the fact that this is custom parsing code makes me nervous. How confident are we that our tests cover all ranges of correct and incorrect input? 
> 
> See fast/hidpi/image-srcset-invalid-inputs.html which contains a various range of invalid cases. However It does not include data: , perhaps I should write a separated unit test for data: (including valid and invalid cases). For valid inputs, we already have several unit tests. I could improve them by adding some weird valid inputs (containing a lot of spaces for example). 
> 
> What do you think ?

New separate test for data is a good idea.
Probably at least 2 tests, one for various valid input, and one for bad input.
Comment 22 Benjamin Poulain 2013-09-04 15:44:13 PDT
Comment on attachment 210285 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=210285&action=review

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:325
> +    unsigned srcSetLen = srcSetAttribute.length();

Len -> Length

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:328
>  

Blank line here.

>>> Source/WebCore/html/parser/HTMLParserIdioms.cpp:329
>>> +        // Collect a sequence of characters that are not space, it's the url.
>> 
>> I don't think we need this comment. Or it should be "Find the next non-space character in the attribute."
> 
> This comment is for the block of code which extracts the url, not the for the line

I think it is great to use spec text as comments. You should have "5." in front to make it more obvious.

You could also have a comment linking the spec before the algorithm.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:330
> +        size_t imageUrlStart = srcSetAttribute.find(isNotHTMLSpace, imageCandidateStart);

"4. Splitting loop: Skip whitespace."?

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:333
> +            imageCandidateStart = srcSetLen;
>              continue;

Shouldn't you just break here and be done with it?
Here all you will have is imageCandidateStart = srcSetLen, go back to the branch (imageCandidateStart < srcSetLen) and exit the loop.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:338
> +        if (srcSetAttribute[imageUrlStart] == ',') {
> +            imageCandidateStart = imageUrlStart + 1;
>              continue;
> +        }

Which cases does this cover?
The spec says "Collect a sequence of characters that are not space characters, and let that be url.", so it is not clear why the initial comma would not be part of the URL.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:342
> +            imageCandidateStart = srcSetLen;
> +            continue;

"break"?

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:347
> +        if (srcSetAttribute[imageUrlEnd - 1] == ',') {
> +            imageCandidateStart = imageUrlEnd;
> +            continue;
> +        }

An empty descriptor should still get the URL in the raw candidates.
The density would default to 1.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:349
> +        // Collect a sequence of characters that are neither ',' nor space, it's the descriptor.

7.


Gotta go, I'll finish the review on the next iteration.
Comment 23 Romain Perier 2013-09-05 04:15:48 PDT
Rapid reply

(In reply to comment #22)
> (From update of attachment 210285 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=210285&action=review
> 
> > Source/WebCore/html/parser/HTMLParserIdioms.cpp:325
> > +    unsigned srcSetLen = srcSetAttribute.length();
> 
> Len -> Length
> 

OK


> 
> >>> Source/WebCore/html/parser/HTMLParserIdioms.cpp:329
> >>> +        // Collect a sequence of characters that are not space, it's the url.
> >> 
> >> I don't think we need this comment. Or it should be "Find the next non-space character in the attribute."
> > 
> > This comment is for the block of code which extracts the url, not the for the line
> 
> I think it is great to use spec text as comments. You should have "5." in front to make it more obvious.
> 
> You could also have a comment linking the spec before the algorithm.

Yep, I agree. Good idea.

> 
> > Source/WebCore/html/parser/HTMLParserIdioms.cpp:330
> > +        size_t imageUrlStart = srcSetAttribute.find(isNotHTMLSpace, imageCandidateStart);
> 
> "4. Splitting loop: Skip whitespace."?

+1

> 
> > Source/WebCore/html/parser/HTMLParserIdioms.cpp:333
> > +            imageCandidateStart = srcSetLen;
> >              continue;
> 

Right, good catch !

> Shouldn't you just break here and be done with it?
> Here all you will have is imageCandidateStart = srcSetLen, go back to the branch (imageCandidateStart < srcSetLen) and exit the loop.
> 
> > Source/WebCore/html/parser/HTMLParserIdioms.cpp:338
> > +        if (srcSetAttribute[imageUrlStart] == ',') {
> > +            imageCandidateStart = imageUrlStart + 1;
> >              continue;
> > +        }
> 
> Which cases does this cover?
> The spec says "Collect a sequence of characters that are not space characters, and let that be url.", so it is not clear why the initial comma would not be part of the URL.

It covers this case srcset="foo.jpg 1x,, bar.jpg 2x",  or this one srcset="foo.jpg,          , bar.jpg 2x"
If the current candidate does not contain letters (!= space) , the first match will be a comma... if we don't do this step, the next url might be "," or even weird ",bar.jpg" (if the next element does not begin with a space).

> 
> > Source/WebCore/html/parser/HTMLParserIdioms.cpp:342
> > +            imageCandidateStart = srcSetLen;
> > +            continue;
> 
> "break"?

+1

> 
> > Source/WebCore/html/parser/HTMLParserIdioms.cpp:347
> > +        if (srcSetAttribute[imageUrlEnd - 1] == ',') {
> > +            imageCandidateStart = imageUrlEnd;
> > +            continue;
> > +        }
> 
> An empty descriptor should still get the URL in the raw candidates.
> The density would default to 1.

Have a look at the current parsing algorithm in the repository. It skips the current candidate if this one does not include a descriptor. I just wanted to avoid a regression.
What do we do for this case ? 1. We follow the spec,  2. We suppose that implicit descriptor is a bad idea because it might be misunderstood by the web developer, it would be better to only accept explicit descriptors.
As srcset might be a cross-browser feature, this is probably better to choose 1, imho.  Benjamin ? Dean ? opinion ?
Comment 24 Benjamin Poulain 2013-09-05 10:27:50 PDT
(In reply to comment #23)
> > > Source/WebCore/html/parser/HTMLParserIdioms.cpp:347
> > > +        if (srcSetAttribute[imageUrlEnd - 1] == ',') {
> > > +            imageCandidateStart = imageUrlEnd;
> > > +            continue;
> > > +        }
> > 
> > An empty descriptor should still get the URL in the raw candidates.
> > The density would default to 1.
> 
> Have a look at the current parsing algorithm in the repository. It skips the current candidate if this one does not include a descriptor. I just wanted to avoid a regression.
> What do we do for this case ? 1. We follow the spec,  2. We suppose that implicit descriptor is a bad idea because it might be misunderstood by the web developer, it would be better to only accept explicit descriptors.
> As srcset might be a cross-browser feature, this is probably better to choose 1, imho.  Benjamin ? Dean ? opinion ?

It is better to follow the spec.
This is still an experimental feature. It is not a problem changing the behavior and updating the tests accordingly.

Supporting candidates without descriptor seems like a sensible choice. I could see people ignoring "src" entirely and writing this kind of declaration:
    <img srcset="regular.png, highres.png 2x">
Comment 25 Romain Perier 2013-09-10 04:20:49 PDT
Created attachment 211189 [details]
Patch
Comment 26 Romain Perier 2013-09-10 04:23:47 PDT
See the new patch in attachment:

1. I have improved tests
2. I have added new ones (for data uri schemes)
3. I have fixed what was wrong
Comment 27 Romain Perier 2013-09-10 13:05:41 PDT
Created attachment 211229 [details]
Patch
Comment 28 Romain Perier 2013-09-10 13:07:18 PDT
Rebased with latest commits
Comment 29 Benjamin Poulain 2013-09-11 20:52:22 PDT
Comment on attachment 211229 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=211229&action=review

This looks great. Some nitpick bellow.

Let's iterate on this quickly so you can move on.

> Source/WebCore/ChangeLog:7
> +

You also updated the behaviors to follow the spec regarding data URLs. You should mention the behavior change in the description.

> Source/WebCore/ChangeLog:8
> +        No new tests, covered by existing tests.

This is no longer true.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:323
> +// http://www.w3.org/html/wg/drafts/srcset/w3c-srcset/Overview.html#processing-the-image-candidates

Draft change by definition.
IMHO, it is good to link the published version with a date: http://www.w3.org/TR/2013/WD-html-srcset-20130228/#processing-the-image-candidates

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:324
> +// Note: Some items in comments below make reference to the spec (prefixed by numbers).

You don't need this line, that is common enough in WebKit.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:334
> +        // 5. Collect a sequence of characters that are not space characters, and let that be url.

This should be moved to #344 now.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:336
> +        size_t imageUrlStart = srcSetAttribute.find(isNotHTMLSpace, imageCandidateStart); // 4. Splitting loop: Skip whitespace.

Move the comment above the block.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:343
> +        // If The current candidate is either totally empty or only contains space, skipping.
> +        if (srcSetAttribute[imageUrlStart] == ',') {
> +            imageCandidateStart = imageUrlStart + 1;
>              continue;
> +        }

I kind of agree with your reasoning here. It is probably worth mailing the standard group to have this clarified in the spec.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:344
> +        size_t imageUrlEnd = srcSetAttribute.find(isHTMLSpace, imageUrlStart + 1);

The comment 5. would be good here.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:346
> +        if (imageUrlEnd == notFound)
> +            break;

Not sure I agree with this.
if you have srcset="   foobar.jpg", you would fail here.

Is there a test covering such case?

(*)

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:349
> +            imageUrlEnd--;

--imageUrlEnd; usually in WebKit.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:354
> +
> +            // 7. Collect a sequence of characters that are not "," (U+002C) characters, and let that be descriptors.
> +

You can remove the blank lines above and above and bellow the comment.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:357
> +            size_t imageScaleStart = srcSetAttribute.find(isNotHTMLSpace, imageUrlEnd + 1);
> +            if (imageScaleStart == notFound)
> +                break;

Same as for (*), I am not sure about this.
You could have srcset="   foobar.jpg   "
Again, do we have tests for this?

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:365
> +                // Be sure that there are no other descriptors.

Be sure that -> Make sure there are no...

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:367
> +                while ((separator < srcSetLength - 1) && isHTMLSpace(srcSetAttribute[separator]))
> +                    separator++;

I think this may be abusing the separator variable. I would use an other variable and assign it to separator once you found a comma.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:368
> +                if ((separator < srcSetLength - 1) && srcSetAttribute[separator] != ',') {

I would add a comment explaining this failure case.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:371
> +                        separator = srcSetLength;

break?

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:381
> +                imgScaleFactor = charactersToFloat(srcSetAttribute.characters() + imageScaleStart, imageScaleEnd - imageScaleStart - 1, &validScaleFactor);

The -1 is not obvious here.
To explain it, you can just add more temporary variables. E.g.:
size_t scaleFactorStringLength = imageScaleEnd - imageScaleStart;
size_t scaleFactorStringLengthWithoutUnit = scaleFactorStringLength - 1;
imgScaleFactor = charactersToFloat(srcSetAttribute.characters() + imageScaleStart, scaleFactorStringLengthWithoutUnit, &validScaleFactor);

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:390
> -        image.imageURL = decodeURLEscapeSequences(data[0]);
> +        image.imageURL = decodeURLEscapeSequences(StringImpl::createWithoutCopying(srcSetAttribute.characters() + imageUrlStart, imageUrlEnd - imageUrlStart));

I feel like you mentioned why createWithoutCopying is safe here but I forgot. Could you explain again?
Comment 30 Romain Perier 2013-09-13 04:27:49 PDT
Created attachment 211535 [details]
Patch
Comment 31 Romain Perier 2013-09-13 05:19:15 PDT
Everything should be fixed.
 
> I feel like you mentioned why createWithoutCopying is safe here but I forgot. Could you explain again?

That's simple:
- StringImpl:createWithoutCopying() creates a StringImpl which shares the same buffer as srcsetAttribute
- String is created on the fly from the StringImpl object, it will delegate string features to it and share the same buffer, as well
- decodeURLEscapeSequences returns a copy of the input String in all cases.

So the resulting image.imageURL is a copy and does not share memory :)
Comment 32 Yoav Weiss 2013-09-13 13:51:00 PDT
There's no need to decodeURLEscapeSequences() the output URL. Data URIs work fine with the new spec-compliant parser (which splits the attribute value according to white spaces, rather than commas)
Comment 33 Romain Perier 2013-09-13 13:53:49 PDT
The question being do we need to support escaped data:URLs or not , I would say yes
Comment 34 Yoav Weiss 2013-09-13 14:13:05 PDT
> The question being do we need to support escaped data:URLs or not , I would say yes

Escaped data URIs don't seem to be supported on src on neither Blink nor Gecko.
Comment 35 Benjamin Poulain 2013-09-16 21:07:31 PDT
Comment on attachment 211535 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=211535&action=review

This looks great. Some minor comments bellow.

Can you please also add tests for a srcset descriptor with spaces around (if there is no such test yet). Something like srcset="    foobar.jpg", srcset="fobar.jpg    " and srcset="   foobar.jpg   ". (+ the variations with a scale factor if needed).

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:361
> +                size_t comma = imageScaleEnd;

comma -> commaPosition

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:362
> +                // Make sure there are no other descriptors

Missing period.

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:391
> +        image.imageURL = decodeURLEscapeSequences(StringImpl::createWithoutCopying(srcSetAttribute.characters() + imageUrlStart, imageUrlEnd - imageUrlStart));

Please check decodeURLEscapeSequences and do a follow up if the current behavior is erroneous.
Comment 36 Romain Perier 2013-09-17 08:37:26 PDT
Created attachment 211910 [details]
Patch
Comment 37 Romain Perier 2013-09-17 08:44:40 PDT
As discussed together, see the spec for data:URLs with escaped characters http://tools.ietf.org/html/rfc2397  (sorry this is the only document I found...).


Also I tested the following example with different browsers:
"
<html>
<img 
src="data:image/svg+xml,<svg%20xmlns='http://www.w3.org/2000/svg'%20width='100'%20height='100'><linearGradient%20id='gradient'><stop%20offset='10%'%20stop-color='#F00'/><stop%20offset='90%'%20stop-color='#fcc'/></linearGradient>
<rect%20fill='url(#gradient)'%20x='0'%20y='0'%20width='100%'%20height='100%'/></svg>"/>
</html>
"

Results:
1. It works with chromium/blink 29
2. It does not work with firefox 23/24
3. It does not work with IE 10

I think we should support this feature for src and srcset attributes. Also, as it's not syntaxically possible to support plain/text svgs in data:URLs (a svg contains space, and space is a delimiter for URLs), escaping spaces would be a good solution, imho.
Comment 38 WebKit Commit Bot 2013-09-17 13:05:43 PDT
Comment on attachment 211910 [details]
Patch

Clearing flags on attachment: 211910

Committed r155988: <http://trac.webkit.org/changeset/155988>
Comment 39 WebKit Commit Bot 2013-09-17 13:05:49 PDT
All reviewed patches have been landed.  Closing bug.