Bug 60637

Summary: We should have a Content-Type parser
Product: WebKit Reporter: Jay Civelli <jcivelli>
Component: WebCore Misc.Assignee: Jay Civelli <jcivelli>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, commit-queue, ddkilzer, eric, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on:    
Bug Blocks: 7168    
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
ContentTypeParser unit-tests none

Jay Civelli
Reported 2011-05-11 10:15:26 PDT
We should have a Content-Type parser that can parse the MIME type and all specified parameters. This is needed for MHTML reading support. The code in HTTPParsers only parse the MIME type and charset parameter.
Attachments
Patch (12.06 KB, patch)
2011-05-11 10:33 PDT, Jay Civelli
no flags
Patch (12.46 KB, patch)
2011-05-11 11:30 PDT, Jay Civelli
no flags
Patch (11.86 KB, patch)
2011-05-11 14:10 PDT, Jay Civelli
no flags
ContentTypeParser unit-tests (5.14 KB, application/octet-stream)
2011-05-11 17:25 PDT, Jay Civelli
no flags
Jay Civelli
Comment 1 2011-05-11 10:33:07 PDT
Darin Adler
Comment 2 2011-05-11 10:47:03 PDT
Comment on attachment 93146 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=93146&action=review Didn’t have chance for a whole review, but did have a couple comments. > Source/JavaScriptCore/wtf/ASCIICType.h:70 > + inline bool isASCIIControl(char c) { return c <= ' '; } All functions in this file should be overridden for all the same types, not just for char. This new function will return true for 0x80-0xFF on platforms where char is signed, so it’s incorrect and false for platforms where char is unsigned, so either way it’s incorrect. I’m not sure that “control character” is a clearly-enough-defined concept that this helper belongs in this file; I’m especially unsure that space should be considered a control character. > Source/WebCore/platform/network/ContentTypeParser.cpp:41 > +static void skipWhiteSpaces(const String& input, size_t& startIndex) This skips spaces only. The term “white spaces” is usually reserved for places where there are things other than a space.
Jay Civelli
Comment 3 2011-05-11 11:30:36 PDT
Jay Civelli
Comment 4 2011-05-11 11:31:27 PDT
Comment on attachment 93146 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=93146&action=review >> Source/JavaScriptCore/wtf/ASCIICType.h:70 >> + inline bool isASCIIControl(char c) { return c <= ' '; } > > All functions in this file should be overridden for all the same types, not just for char. > > This new function will return true for 0x80-0xFF on platforms where char is signed, so it’s incorrect and false for platforms where char is unsigned, so either way it’s incorrect. > > I’m not sure that “control character” is a clearly-enough-defined concept that this helper belongs in this file; I’m especially unsure that space should be considered a control character. You are right, space is not a control character (but DEL is that I had forgotten). I fixed that and tried to fix the signed/unsigned problems. I thought ASCII control characters were common enough (see http://en.wikipedia.org/wiki/Ascii#ASCII_control_characters) that the function could be helpful to others. Let me know if you feel strongly about it, in which case I'll move this function it to ContentTypeParser.cpp. >> Source/WebCore/platform/network/ContentTypeParser.cpp:41 >> +static void skipWhiteSpaces(const String& input, size_t& startIndex) > > This skips spaces only. The term “white spaces” is usually reserved for places where there are things other than a space. Good point, changed.
Adam Barth
Comment 5 2011-05-11 11:50:15 PDT
Comment on attachment 93151 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=93151&action=review I would have written this slightly differently, but this looks fine. The approach you've used makes it slightly difficult to verify that you're always bounds checking correctly. It would also be nice to put the grammar from the RFC into a comment in the code for documentation. > Source/WebCore/platform/network/ContentTypeParser.cpp:95 > + , m_inputIndex(0) That's the point of this member? > Source/WebCore/platform/network/ContentTypeParser.cpp:123 > + size_t index = strlen(contentTypeParameterName); It feels like contentTypeParameterName should be a DECLARE_STATIC_LOCAL(String, contentTypeParameterName, ("Content-Type")) in this function. Then you could just use contentTypeParameterName.length(). > Source/WebCore/platform/network/ContentTypeParser.cpp:146 > + // Should we tolerate spaces here? What does the RFC say? > Source/WebCore/platform/network/ContentTypeParser.cpp:170 > + m_parameters.add(key, value); What happens if there are more than one parameter with the same name?
Adam Barth
Comment 6 2011-05-11 11:52:17 PDT
If you're interested, here's a parser I wrote recently that uses a slightly different style: http://trac.webkit.org/browser/trunk/Source/WebCore/page/ContentSecurityPolicy.cpp That parser is somewhat different that this one because it uses a segment-and-parse strategy to get the particular error handling we're supposed to use.
Adam Barth
Comment 7 2011-05-11 11:54:25 PDT
Comment on attachment 93151 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=93151&action=review I'm going to mark this R+, but other folks should feel free to comment as well. Please address the (minor) feedback above before landing. I presume this is going to be tested in subsequent MHTML patchs. > Source/JavaScriptCore/wtf/ASCIICType.h:76 > + inline bool isASCIIControl(char c) { return (c >= 0 && c < ' ') || c == 0x7F; } > + inline bool isASCIIControl(unsigned short c) { return c < ' ' || c == 0x7F; } > +#if !COMPILER(MSVC) || defined(_NATIVE_WCHAR_T_DEFINED) > + inline bool isASCIIControl(wchar_t c) { return (c >= 0 && c < ' ') || c == 0x7F; } > +#endif > + inline bool isASCIIControl(int c) { return (c >= 0 && c < ' ') || c == 0x7F; } > + inline bool isASCIIControl(unsigned c) { return c < ' ' || c == 0x7F; } This feels like a lot of copy/paste, but I guess that's how this file rolls.
Darin Adler
Comment 8 2011-05-11 12:03:48 PDT
Comment on attachment 93151 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=93151&action=review >> Source/JavaScriptCore/wtf/ASCIICType.h:76 >> + inline bool isASCIIControl(unsigned c) { return c < ' ' || c == 0x7F; } > > This feels like a lot of copy/paste, but I guess that's how this file rolls. The one call site using this function isn’t getting a lot of benefit from it, because it has a lot of other special cases. I don’t think adding this is a good idea. Also, this implementation assumes that char is signed. On platforms where char is unsigned we might get a warning, and a warning that may be treated as an error, that >= 0 is always true. > Source/WebCore/platform/network/ContentTypeParser.cpp:49 > + return !isASCIIControl(c) && c != ' ' && c != '"' && c != '(' && c != ')' && c != ',' && c != '/' && (c < ':' || c > '@') && (c < '[' || c > ']'); With so many specific rules here already, I think that the isASCIIControl logic should be just inlined here. What does this function do for 0x80-0xFF. I can’t tell! I suggest not abbreviated character as Char in this function. > Source/WebCore/platform/network/ContentTypeParser.h:42 > +// FIXME: add support for comments. > +class ContentTypeParser { > +public: Is this class used at all yet? Where are the test cases? I don’t understand why m_inputIndex is a data member. It seem to be used only during the parsing process.
Adam Barth
Comment 9 2011-05-11 13:20:30 PDT
> Is this class used at all yet? Where are the test cases? I asked Jay to split this part of the patch out from Bug 7168. The test cases are in that bug (and I'm going to ask him to add more when that patch is ready for review).
Jay Civelli
Comment 10 2011-05-11 14:09:50 PDT
Comment on attachment 93151 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=93151&action=review >>> Source/JavaScriptCore/wtf/ASCIICType.h:76 >>> + inline bool isASCIIControl(unsigned c) { return c < ' ' || c == 0x7F; } >> >> This feels like a lot of copy/paste, but I guess that's how this file rolls. > > The one call site using this function isn’t getting a lot of benefit from it, because it has a lot of other special cases. I don’t think adding this is a good idea. > > Also, this implementation assumes that char is signed. On platforms where char is unsigned we might get a warning, and a warning that may be treated as an error, that >= 0 is always true. OK, I removed it and added it to the caller. >> Source/WebCore/platform/network/ContentTypeParser.cpp:49 >> + return !isASCIIControl(c) && c != ' ' && c != '"' && c != '(' && c != ')' && c != ',' && c != '/' && (c < ':' || c > '@') && (c < '[' || c > ']'); > > With so many specific rules here already, I think that the isASCIIControl logic should be just inlined here. > > What does this function do for 0x80-0xFF. I can’t tell! > > I suggest not abbreviated character as Char in this function. Inlined isASCIIControl, now checking the token is ASCII, renamed without abbreviation. >> Source/WebCore/platform/network/ContentTypeParser.cpp:95 >> + , m_inputIndex(0) > > That's the point of this member? Ah! Unused, removed. >> Source/WebCore/platform/network/ContentTypeParser.cpp:123 >> + size_t index = strlen(contentTypeParameterName); > > It feels like contentTypeParameterName should be a DECLARE_STATIC_LOCAL(String, contentTypeParameterName, ("Content-Type")) in this function. Then you could just use contentTypeParameterName.length(). Cool, I did not know about that macro. Now using it. >> Source/WebCore/platform/network/ContentTypeParser.cpp:146 >> + // Should we tolerate spaces here? > > What does the RFC say? It says there should be no space. I was wondering wether we should be lenient or not. >> Source/WebCore/platform/network/ContentTypeParser.cpp:170 >> + m_parameters.add(key, value); > > What happens if there are more than one parameter with the same name? Changed the code to keep the last one, and added a comment in the header to document that behavior. >> Source/WebCore/platform/network/ContentTypeParser.h:42 >> +public: > > Is this class used at all yet? Where are the test cases? > > I don’t understand why m_inputIndex is a data member. It seem to be used only during the parsing process. m_inputIndex was unused and has been removed. This is going to be used by https://bugs.webkit.org/show_bug.cgi?id=7168 where we need to parse multipart Content-Type, such as: Content-Type: multipart/related; boundary="boundary-example"; type="text/html" Also I have a bunch of unit-tests for this code that will make it to TestWebKitAPI at some point.
Jay Civelli
Comment 11 2011-05-11 14:10:16 PDT
Adam Barth
Comment 12 2011-05-11 14:40:38 PDT
Comment on attachment 93182 [details] Patch ok
WebKit Commit Bot
Comment 13 2011-05-11 17:17:36 PDT
Comment on attachment 93182 [details] Patch Clearing flags on attachment: 93182 Committed r86289: <http://trac.webkit.org/changeset/86289>
WebKit Commit Bot
Comment 14 2011-05-11 17:17:43 PDT
All reviewed patches have been landed. Closing bug.
Jay Civelli
Comment 15 2011-05-11 17:25:28 PDT
Created attachment 93218 [details] ContentTypeParser unit-tests These are unit-tests used when writing ContentTypeParser. They'll be integrated to TestWebKitAPI once dslomov is done with transtionning it to GTest.
WebKit Review Bot
Comment 16 2011-05-11 19:41:44 PDT
http://trac.webkit.org/changeset/86289 might have broken GTK Linux 32-bit Release
Note You need to log in before you can comment on or make changes to this bug.