Bug 60637 - We should have a Content-Type parser
Summary: We should have a Content-Type parser
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Jay Civelli
URL:
Keywords:
Depends on:
Blocks: 7168
  Show dependency treegraph
 
Reported: 2011-05-11 10:15 PDT by Jay Civelli
Modified: 2011-05-11 19:41 PDT (History)
5 users (show)

See Also:


Attachments
Patch (12.06 KB, patch)
2011-05-11 10:33 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Patch (12.46 KB, patch)
2011-05-11 11:30 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
Patch (11.86 KB, patch)
2011-05-11 14:10 PDT, Jay Civelli
no flags Details | Formatted Diff | Diff
ContentTypeParser unit-tests (5.14 KB, application/octet-stream)
2011-05-11 17:25 PDT, Jay Civelli
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jay Civelli 2011-05-11 10:15:26 PDT
We should have a Content-Type parser that can parse the MIME type and all specified parameters.
This is needed for MHTML reading support.
The code in HTTPParsers only parse the MIME type and charset parameter.
Comment 1 Jay Civelli 2011-05-11 10:33:07 PDT
Created attachment 93146 [details]
Patch
Comment 2 Darin Adler 2011-05-11 10:47:03 PDT
Comment on attachment 93146 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=93146&action=review

Didn’t have chance for a whole review, but did have a couple comments.

> Source/JavaScriptCore/wtf/ASCIICType.h:70
> +    inline bool isASCIIControl(char c) { return c <= ' '; }

All functions in this file should be overridden for all the same types, not just for char.

This new function will return true for 0x80-0xFF on platforms where char is signed, so it’s incorrect and false for platforms where char is unsigned, so either way it’s incorrect.

I’m not sure that “control character” is a clearly-enough-defined concept that this helper belongs in this file; I’m especially unsure that space should be considered a control character.

> Source/WebCore/platform/network/ContentTypeParser.cpp:41
> +static void skipWhiteSpaces(const String& input, size_t& startIndex)

This skips spaces only. The term “white spaces” is usually reserved for places where there are things other than a space.
Comment 3 Jay Civelli 2011-05-11 11:30:36 PDT
Created attachment 93151 [details]
Patch
Comment 4 Jay Civelli 2011-05-11 11:31:27 PDT
Comment on attachment 93146 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=93146&action=review

>> Source/JavaScriptCore/wtf/ASCIICType.h:70
>> +    inline bool isASCIIControl(char c) { return c <= ' '; }
> 
> All functions in this file should be overridden for all the same types, not just for char.
> 
> This new function will return true for 0x80-0xFF on platforms where char is signed, so it’s incorrect and false for platforms where char is unsigned, so either way it’s incorrect.
> 
> I’m not sure that “control character” is a clearly-enough-defined concept that this helper belongs in this file; I’m especially unsure that space should be considered a control character.

You are right, space is not a control character (but DEL is that I had forgotten). I fixed that and tried to fix the signed/unsigned problems.
I thought ASCII control characters were common enough (see http://en.wikipedia.org/wiki/Ascii#ASCII_control_characters) that the function could be helpful to others.
Let me know if you feel strongly about it, in which case I'll move this function it to ContentTypeParser.cpp.

>> Source/WebCore/platform/network/ContentTypeParser.cpp:41
>> +static void skipWhiteSpaces(const String& input, size_t& startIndex)
> 
> This skips spaces only. The term “white spaces” is usually reserved for places where there are things other than a space.

Good point, changed.
Comment 5 Adam Barth 2011-05-11 11:50:15 PDT
Comment on attachment 93151 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=93151&action=review

I would have written this slightly differently, but this looks fine.  The approach you've used makes it slightly difficult to verify that you're always bounds checking correctly.  It would also be nice to put the grammar from the RFC into a comment in the code for documentation.

> Source/WebCore/platform/network/ContentTypeParser.cpp:95
> +    , m_inputIndex(0)

That's the point of this member?

> Source/WebCore/platform/network/ContentTypeParser.cpp:123
> +    size_t index = strlen(contentTypeParameterName);

It feels like contentTypeParameterName should be a DECLARE_STATIC_LOCAL(String, contentTypeParameterName, ("Content-Type")) in this function.  Then you could just use contentTypeParameterName.length().

> Source/WebCore/platform/network/ContentTypeParser.cpp:146
> +        // Should we tolerate spaces here?

What does the RFC say?

> Source/WebCore/platform/network/ContentTypeParser.cpp:170
> +        m_parameters.add(key, value);

What happens if there are more than one parameter with the same name?
Comment 6 Adam Barth 2011-05-11 11:52:17 PDT
If you're interested, here's a parser I wrote recently that uses a slightly different style:

http://trac.webkit.org/browser/trunk/Source/WebCore/page/ContentSecurityPolicy.cpp

That parser is somewhat different that this one because it uses a segment-and-parse strategy to get the particular error handling we're supposed to use.
Comment 7 Adam Barth 2011-05-11 11:54:25 PDT
Comment on attachment 93151 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=93151&action=review

I'm going to mark this R+, but other folks should feel free to comment as well.  Please address the (minor) feedback above before landing.  I presume this is going to be tested in subsequent MHTML patchs.

> Source/JavaScriptCore/wtf/ASCIICType.h:76
> +    inline bool isASCIIControl(char c) { return (c >= 0 && c < ' ') || c == 0x7F; }
> +    inline bool isASCIIControl(unsigned short c) { return c < ' ' || c == 0x7F; }
> +#if !COMPILER(MSVC) || defined(_NATIVE_WCHAR_T_DEFINED)
> +    inline bool isASCIIControl(wchar_t c) { return (c >= 0 && c < ' ') || c == 0x7F; }
> +#endif
> +    inline bool isASCIIControl(int c) { return (c >= 0 && c < ' ') || c == 0x7F; }
> +    inline bool isASCIIControl(unsigned c) { return c < ' ' || c == 0x7F; }

This feels like a lot of copy/paste, but I guess that's how this file rolls.
Comment 8 Darin Adler 2011-05-11 12:03:48 PDT
Comment on attachment 93151 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=93151&action=review

>> Source/JavaScriptCore/wtf/ASCIICType.h:76
>> +    inline bool isASCIIControl(unsigned c) { return c < ' ' || c == 0x7F; }
> 
> This feels like a lot of copy/paste, but I guess that's how this file rolls.

The one call site using this function isn’t getting a lot of benefit from it, because it has a lot of other special cases. I don’t think adding this is a good idea.

Also, this implementation assumes that char is signed. On platforms where char is unsigned we might get a warning, and a warning that may be treated as an error, that >= 0 is always true.

> Source/WebCore/platform/network/ContentTypeParser.cpp:49
> +    return !isASCIIControl(c) && c != ' ' && c != '"' && c != '(' && c != ')' && c != ',' && c != '/' && (c < ':' || c > '@') && (c < '[' || c > ']');

With so many specific rules here already, I think that the isASCIIControl logic should be just inlined here.

What does this function do for 0x80-0xFF. I can’t tell!

I suggest not abbreviated character as Char in this function.

> Source/WebCore/platform/network/ContentTypeParser.h:42
> +// FIXME: add support for comments.
> +class ContentTypeParser {
> +public:

Is this class used at all yet? Where are the test cases?

I don’t understand why m_inputIndex is a data member. It seem to be used only during the parsing process.
Comment 9 Adam Barth 2011-05-11 13:20:30 PDT
> Is this class used at all yet? Where are the test cases?

I asked Jay to split this part of the patch out from Bug 7168.  The test cases are in that bug (and I'm going to ask him to add more when that patch is ready for review).
Comment 10 Jay Civelli 2011-05-11 14:09:50 PDT
Comment on attachment 93151 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=93151&action=review

>>> Source/JavaScriptCore/wtf/ASCIICType.h:76
>>> +    inline bool isASCIIControl(unsigned c) { return c < ' ' || c == 0x7F; }
>> 
>> This feels like a lot of copy/paste, but I guess that's how this file rolls.
> 
> The one call site using this function isn’t getting a lot of benefit from it, because it has a lot of other special cases. I don’t think adding this is a good idea.
> 
> Also, this implementation assumes that char is signed. On platforms where char is unsigned we might get a warning, and a warning that may be treated as an error, that >= 0 is always true.

OK, I removed it and added it to the caller.

>> Source/WebCore/platform/network/ContentTypeParser.cpp:49
>> +    return !isASCIIControl(c) && c != ' ' && c != '"' && c != '(' && c != ')' && c != ',' && c != '/' && (c < ':' || c > '@') && (c < '[' || c > ']');
> 
> With so many specific rules here already, I think that the isASCIIControl logic should be just inlined here.
> 
> What does this function do for 0x80-0xFF. I can’t tell!
> 
> I suggest not abbreviated character as Char in this function.

Inlined isASCIIControl, now checking the token is ASCII, renamed without abbreviation.

>> Source/WebCore/platform/network/ContentTypeParser.cpp:95
>> +    , m_inputIndex(0)
> 
> That's the point of this member?

Ah! Unused, removed.

>> Source/WebCore/platform/network/ContentTypeParser.cpp:123
>> +    size_t index = strlen(contentTypeParameterName);
> 
> It feels like contentTypeParameterName should be a DECLARE_STATIC_LOCAL(String, contentTypeParameterName, ("Content-Type")) in this function.  Then you could just use contentTypeParameterName.length().

Cool, I did not know about that macro. Now using it.

>> Source/WebCore/platform/network/ContentTypeParser.cpp:146
>> +        // Should we tolerate spaces here?
> 
> What does the RFC say?

It says there should be no space. I was wondering wether we should be lenient or not.

>> Source/WebCore/platform/network/ContentTypeParser.cpp:170
>> +        m_parameters.add(key, value);
> 
> What happens if there are more than one parameter with the same name?

Changed the code to keep the last one, and added a comment in the header to document that behavior.

>> Source/WebCore/platform/network/ContentTypeParser.h:42
>> +public:
> 
> Is this class used at all yet? Where are the test cases?
> 
> I don’t understand why m_inputIndex is a data member. It seem to be used only during the parsing process.

m_inputIndex was unused and has been removed.
This is going to be used by https://bugs.webkit.org/show_bug.cgi?id=7168 where we need to parse multipart Content-Type, such as:
Content-Type: multipart/related; boundary="boundary-example"; type="text/html"

Also I have a bunch of unit-tests for this code that will make it to TestWebKitAPI at some point.
Comment 11 Jay Civelli 2011-05-11 14:10:16 PDT
Created attachment 93182 [details]
Patch
Comment 12 Adam Barth 2011-05-11 14:40:38 PDT
Comment on attachment 93182 [details]
Patch

ok
Comment 13 WebKit Commit Bot 2011-05-11 17:17:36 PDT
Comment on attachment 93182 [details]
Patch

Clearing flags on attachment: 93182

Committed r86289: <http://trac.webkit.org/changeset/86289>
Comment 14 WebKit Commit Bot 2011-05-11 17:17:43 PDT
All reviewed patches have been landed.  Closing bug.
Comment 15 Jay Civelli 2011-05-11 17:25:28 PDT
Created attachment 93218 [details]
ContentTypeParser unit-tests

These are unit-tests used when writing ContentTypeParser.
They'll be integrated to TestWebKitAPI once dslomov is done with transtionning it to GTest.
Comment 16 WebKit Review Bot 2011-05-11 19:41:44 PDT
http://trac.webkit.org/changeset/86289 might have broken GTK Linux 32-bit Release