60637 – We should have a Content-Type parser

RESOLVED FIXED 60637

We should have a Content-Type parser

https://bugs.webkit.org/show_bug.cgi?id=60637

Summary We should have a Content-Type parser

Jay Civelli

Reported 2011-05-11 10:15:26 PDT

We should have a Content-Type parser that can parse the MIME type and all specified parameters. This is needed for MHTML reading support. The code in HTTPParsers only parse the MIME type and charset parameter.

Attachments
Patch (12.06 KB, patch) 2011-05-11 10:33 PDT, Jay Civelli	no flags	Details Formatted Diff Diff
Patch (12.46 KB, patch) 2011-05-11 11:30 PDT, Jay Civelli	no flags	Details Formatted Diff Diff
Patch (11.86 KB, patch) 2011-05-11 14:10 PDT, Jay Civelli	no flags	Details Formatted Diff Diff
ContentTypeParser unit-tests (5.14 KB, application/octet-stream) 2011-05-11 17:25 PDT, Jay Civelli	no flags	Details
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Jay Civelli

Comment 1 2011-05-11 10:33:07 PDT

Created attachment 93146 [details] Patch

Darin Adler

Comment 2 2011-05-11 10:47:03 PDT

Comment on attachment 93146 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=93146&action=review Didn’t have chance for a whole review, but did have a couple comments. > Source/JavaScriptCore/wtf/ASCIICType.h:70 > + inline bool isASCIIControl(char c) { return c <= ' '; } All functions in this file should be overridden for all the same types, not just for char. This new function will return true for 0x80-0xFF on platforms where char is signed, so it’s incorrect and false for platforms where char is unsigned, so either way it’s incorrect. I’m not sure that “control character” is a clearly-enough-defined concept that this helper belongs in this file; I’m especially unsure that space should be considered a control character. > Source/WebCore/platform/network/ContentTypeParser.cpp:41 > +static void skipWhiteSpaces(const String& input, size_t& startIndex) This skips spaces only. The term “white spaces” is usually reserved for places where there are things other than a space.

Jay Civelli

Comment 3 2011-05-11 11:30:36 PDT

Created attachment 93151 [details] Patch

Jay Civelli

Comment 4 2011-05-11 11:31:27 PDT

Comment on attachment 93146 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=93146&action=review >> Source/JavaScriptCore/wtf/ASCIICType.h:70 >> + inline bool isASCIIControl(char c) { return c <= ' '; } > > All functions in this file should be overridden for all the same types, not just for char. > > This new function will return true for 0x80-0xFF on platforms where char is signed, so it’s incorrect and false for platforms where char is unsigned, so either way it’s incorrect. > > I’m not sure that “control character” is a clearly-enough-defined concept that this helper belongs in this file; I’m especially unsure that space should be considered a control character. You are right, space is not a control character (but DEL is that I had forgotten). I fixed that and tried to fix the signed/unsigned problems. I thought ASCII control characters were common enough (see http://en.wikipedia.org/wiki/Ascii#ASCII_control_characters) that the function could be helpful to others. Let me know if you feel strongly about it, in which case I'll move this function it to ContentTypeParser.cpp. >> Source/WebCore/platform/network/ContentTypeParser.cpp:41 >> +static void skipWhiteSpaces(const String& input, size_t& startIndex) > > This skips spaces only. The term “white spaces” is usually reserved for places where there are things other than a space. Good point, changed.

Adam Barth

Comment 5 2011-05-11 11:50:15 PDT

Comment on attachment 93151 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=93151&action=review I would have written this slightly differently, but this looks fine. The approach you've used makes it slightly difficult to verify that you're always bounds checking correctly. It would also be nice to put the grammar from the RFC into a comment in the code for documentation. > Source/WebCore/platform/network/ContentTypeParser.cpp:95 > + , m_inputIndex(0) That's the point of this member? > Source/WebCore/platform/network/ContentTypeParser.cpp:123 > + size_t index = strlen(contentTypeParameterName); It feels like contentTypeParameterName should be a DECLARE_STATIC_LOCAL(String, contentTypeParameterName, ("Content-Type")) in this function. Then you could just use contentTypeParameterName.length(). > Source/WebCore/platform/network/ContentTypeParser.cpp:146 > + // Should we tolerate spaces here? What does the RFC say? > Source/WebCore/platform/network/ContentTypeParser.cpp:170 > + m_parameters.add(key, value); What happens if there are more than one parameter with the same name?

Adam Barth

Comment 6 2011-05-11 11:52:17 PDT

If you're interested, here's a parser I wrote recently that uses a slightly different style: http://trac.webkit.org/browser/trunk/Source/WebCore/page/ContentSecurityPolicy.cpp That parser is somewhat different that this one because it uses a segment-and-parse strategy to get the particular error handling we're supposed to use.

Adam Barth

Comment 7 2011-05-11 11:54:25 PDT

Comment on attachment 93151 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=93151&action=review I'm going to mark this R+, but other folks should feel free to comment as well. Please address the (minor) feedback above before landing. I presume this is going to be tested in subsequent MHTML patchs. > Source/JavaScriptCore/wtf/ASCIICType.h:76 > + inline bool isASCIIControl(char c) { return (c >= 0 && c < ' ') || c == 0x7F; } > + inline bool isASCIIControl(unsigned short c) { return c < ' ' || c == 0x7F; } > +#if !COMPILER(MSVC) || defined(_NATIVE_WCHAR_T_DEFINED) > + inline bool isASCIIControl(wchar_t c) { return (c >= 0 && c < ' ') || c == 0x7F; } > +#endif > + inline bool isASCIIControl(int c) { return (c >= 0 && c < ' ') || c == 0x7F; } > + inline bool isASCIIControl(unsigned c) { return c < ' ' || c == 0x7F; } This feels like a lot of copy/paste, but I guess that's how this file rolls.

Darin Adler

Comment 8 2011-05-11 12:03:48 PDT

Comment on attachment 93151 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=93151&action=review >> Source/JavaScriptCore/wtf/ASCIICType.h:76 >> + inline bool isASCIIControl(unsigned c) { return c < ' ' || c == 0x7F; } > > This feels like a lot of copy/paste, but I guess that's how this file rolls. The one call site using this function isn’t getting a lot of benefit from it, because it has a lot of other special cases. I don’t think adding this is a good idea. Also, this implementation assumes that char is signed. On platforms where char is unsigned we might get a warning, and a warning that may be treated as an error, that >= 0 is always true. > Source/WebCore/platform/network/ContentTypeParser.cpp:49 > + return !isASCIIControl(c) && c != ' ' && c != '"' && c != '(' && c != ')' && c != ',' && c != '/' && (c < ':' || c > '@') && (c < '[' || c > ']'); With so many specific rules here already, I think that the isASCIIControl logic should be just inlined here. What does this function do for 0x80-0xFF. I can’t tell! I suggest not abbreviated character as Char in this function. > Source/WebCore/platform/network/ContentTypeParser.h:42 > +// FIXME: add support for comments. > +class ContentTypeParser { > +public: Is this class used at all yet? Where are the test cases? I don’t understand why m_inputIndex is a data member. It seem to be used only during the parsing process.

Adam Barth

Comment 9 2011-05-11 13:20:30 PDT

> Is this class used at all yet? Where are the test cases? I asked Jay to split this part of the patch out from Bug 7168. The test cases are in that bug (and I'm going to ask him to add more when that patch is ready for review).

Jay Civelli

Comment 10 2011-05-11 14:09:50 PDT

Comment on attachment 93151 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=93151&action=review >>> Source/JavaScriptCore/wtf/ASCIICType.h:76 >>> + inline bool isASCIIControl(unsigned c) { return c < ' ' || c == 0x7F; } >> >> This feels like a lot of copy/paste, but I guess that's how this file rolls. > > The one call site using this function isn’t getting a lot of benefit from it, because it has a lot of other special cases. I don’t think adding this is a good idea. > > Also, this implementation assumes that char is signed. On platforms where char is unsigned we might get a warning, and a warning that may be treated as an error, that >= 0 is always true. OK, I removed it and added it to the caller. >> Source/WebCore/platform/network/ContentTypeParser.cpp:49 >> + return !isASCIIControl(c) && c != ' ' && c != '"' && c != '(' && c != ')' && c != ',' && c != '/' && (c < ':' || c > '@') && (c < '[' || c > ']'); > > With so many specific rules here already, I think that the isASCIIControl logic should be just inlined here. > > What does this function do for 0x80-0xFF. I can’t tell! > > I suggest not abbreviated character as Char in this function. Inlined isASCIIControl, now checking the token is ASCII, renamed without abbreviation. >> Source/WebCore/platform/network/ContentTypeParser.cpp:95 >> + , m_inputIndex(0) > > That's the point of this member? Ah! Unused, removed. >> Source/WebCore/platform/network/ContentTypeParser.cpp:123 >> + size_t index = strlen(contentTypeParameterName); > > It feels like contentTypeParameterName should be a DECLARE_STATIC_LOCAL(String, contentTypeParameterName, ("Content-Type")) in this function. Then you could just use contentTypeParameterName.length(). Cool, I did not know about that macro. Now using it. >> Source/WebCore/platform/network/ContentTypeParser.cpp:146 >> + // Should we tolerate spaces here? > > What does the RFC say? It says there should be no space. I was wondering wether we should be lenient or not. >> Source/WebCore/platform/network/ContentTypeParser.cpp:170 >> + m_parameters.add(key, value); > > What happens if there are more than one parameter with the same name? Changed the code to keep the last one, and added a comment in the header to document that behavior. >> Source/WebCore/platform/network/ContentTypeParser.h:42 >> +public: > > Is this class used at all yet? Where are the test cases? > > I don’t understand why m_inputIndex is a data member. It seem to be used only during the parsing process. m_inputIndex was unused and has been removed. This is going to be used by https://bugs.webkit.org/show_bug.cgi?id=7168 where we need to parse multipart Content-Type, such as: Content-Type: multipart/related; boundary="boundary-example"; type="text/html" Also I have a bunch of unit-tests for this code that will make it to TestWebKitAPI at some point.

Jay Civelli

Comment 11 2011-05-11 14:10:16 PDT

Created attachment 93182 [details] Patch

Adam Barth

Comment 12 2011-05-11 14:40:38 PDT

Comment on attachment 93182 [details] Patch ok

WebKit Commit Bot

Comment 13 2011-05-11 17:17:36 PDT

Comment on attachment 93182 [details] Patch Clearing flags on attachment: 93182 Committed r86289: <http://trac.webkit.org/changeset/86289>

WebKit Commit Bot

Comment 14 2011-05-11 17:17:43 PDT

All reviewed patches have been landed. Closing bug.

Jay Civelli

Comment 15 2011-05-11 17:25:28 PDT

Created attachment 93218 [details] ContentTypeParser unit-tests These are unit-tests used when writing ContentTypeParser. They'll be integrated to TestWebKitAPI once dslomov is done with transtionning it to GTest.

WebKit Review Bot

Comment 16 2011-05-11 19:41:44 PDT

http://trac.webkit.org/changeset/86289 might have broken GTK Linux 32-bit Release

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component WebCore Misc.

Assignee

Jay Civelli

Reported

2011-05-11 10:15 PDT

Modified

2011-05-11 19:41 PDT History

CC List

5 users Show

URL

Keywords

Depends on

Blocks

7168

Dependencies

tree graph