Bug 25064 - Implement a content sniffer
Summary: Implement a content sniffer
Status: RESOLVED WONTFIX
Alias: None
Product: WebKit
Classification: Unclassified
Component: Platform (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Adam Barth
URL:
Keywords:
Depends on:
Blocks: 25939
  Show dependency treegraph
 
Reported: 2009-04-06 16:08 PDT by Adam Barth
Modified: 2011-05-23 13:44 PDT (History)
12 users (show)

See Also:


Attachments
Step 1: Add sniffer entry points (12.21 KB, patch)
2009-05-06 15:30 PDT, Adam Barth
mjs: review-
Details | Formatted Diff | Diff
Step 2: Hook sniffer into network stack (23.94 KB, patch)
2009-05-06 16:40 PDT, Adam Barth
no flags Details | Formatted Diff | Diff
Step 2: Hook sniffer into network stack (improved) (30.72 KB, patch)
2009-05-06 16:53 PDT, Adam Barth
no flags Details | Formatted Diff | Diff
Step 2: Hook sniffer into network stack (ready for review) (33.44 KB, patch)
2009-05-06 18:38 PDT, Adam Barth
mjs: review-
Details | Formatted Diff | Diff
Step 3: Implement sniffing algorithm (18.49 KB, patch)
2009-05-17 19:45 PDT, Adam Barth
mjs: review-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Barth 2009-04-06 16:08:16 PDT
Currently, WebKit relies on the networking library to perform content sniffing.  This means every port has to roll their own, creating compatibility / security differences between WebKit ports.  I spoke to Maceij on #webkit today, and we agreed that we should have our own sniffer in WebKit.

We should base our content sniffer on the HTML 5 sniffing algorithm.  Ideally, we'd converge with Firefox on a common, standardized algorithm.  http://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc might be a decent starting place.

If no one else gets to this soon, I'll probably be able to work on this in a few months.
Comment 1 Adam Barth 2009-05-06 15:30:49 PDT
Created attachment 30069 [details]
Step 1: Add sniffer entry points

Starting to work on this.  Adding entry points to the sniffing algorithm.
Comment 2 Adam Barth 2009-05-06 16:40:55 PDT
Created attachment 30073 [details]
Step 2: Hook sniffer into network stack

Work in progress.  I'll nominate for review when its appropriate.  Let me know if you have feedback about the design of how I've hooked this in.
Comment 3 Adam Barth 2009-05-06 16:53:10 PDT
Created attachment 30074 [details]
Step 2: Hook sniffer into network stack (improved)

Better separation of concerns.
Comment 4 Adam Barth 2009-05-06 18:38:32 PDT
Created attachment 30080 [details]
Step 2: Hook sniffer into network stack (ready for review)

This version is ready for review.
Comment 5 Adam Barth 2009-05-17 19:45:18 PDT
Created attachment 30433 [details]
Step 3: Implement sniffing algorithm

This is a port of the chromium sniffer.  This is probably a reasonable starting point.  We can tune it in subsequent steps, if you like.  This code is BSD licensed in the Chromium tree, which should fine to incorporate into WebKit.
Comment 6 Maciej Stachowiak 2009-05-22 21:36:23 PDT
Comment on attachment 30069 [details]
Step 1: Add sniffer entry points

This part looks ok. Though I am not sure it makes sense to send by itself.
Comment 7 Maciej Stachowiak 2009-05-22 21:42:40 PDT
Comment on attachment 30080 [details]
Step 2: Hook sniffer into network stack (ready for review)

The low-level details of the code look ok, but I'm not entirely happy with the design approach. It seems like overkill to create a generic forwarding resource handle client just to attach some built-in behavior to didReceiveResponse. I think it would make more sense to implement an appropriate method in ResourceHandleBase and require ResourceHandle subclasses to call that method at the appropriate point.

Also:

This adds a call to shouldSniffMIMEType, but no call to sniffMIMEType. Part 3 doesn't call it either, so how does any actual sniffing ever happen?

It might actually be simpler to combine this and part 1 with the patch to add the sniffer logic. Seems like it would be easier to review it all as a unit so that the reviewer can follow the logic.

r- for the above.
Comment 8 Adam Barth 2009-05-22 21:52:13 PDT
I'm happy to collapse these steps into one patch if that would be easier to review.  I was worried about creating something that was too big to review effectively.

No actual sniffing is going on yet.  The remaining steps are:

4) Call the sniffer once we receive bytes from the network (i.e., in didReceiveData).
5) Buffer network bytes if we can't guess the type from the first bytes we receive.
6) Disable the existing platform-level sniffer.

Would you like all these in one patch?
Comment 9 Maciej Stachowiak 2009-05-22 21:56:47 PDT
Comment on attachment 30433 [details]
Step 3: Implement sniffing algorithm

This can't land for at least the following reasons:

1) In light of the r- of part 2.
2) Because there are no test cases included.

Here are some other comments on the details:

A) These macros seem a little clunky. Perhaps it would be better to have a .in-type file to generate a source file that has the static sniffing tables.

> #define MAGIC_NUMBER(mimeType, magic) \
>   { (mimeType), (magic), sizeof(magic)-1, false },


B) For the sniffing rules that don't come from the HTML5 spec, should they be proposed for HTML5?

C) On sniffing feeds - I think the security risk of misinterpreting a non-feed XML document as a feed is low, so I would suggest to err on the side of being more generous. I think the code here will not sniff a feed in all the cases that Safari will, which is probably unwise.

D) Nothing in this patch set seems to turn off the built-in sniffing of NSURL, CFNetwork, or other network layers that may do their own sniffing.

In addition, I'll try to do some research on CFNetwork's current sniffing rules to see if there are any important ones that should be added here.

Nice work so far, please revise per above.
Comment 10 Maciej Stachowiak 2009-05-22 21:57:19 PDT
Comment on attachment 30069 [details]
Step 1: Add sniffer entry points

flagging as - in retrospect, because it doesn't make sense to land this in isolation, and the patchset should probably be combined anyway.
Comment 11 Adam Barth 2011-05-23 10:14:55 PDT
Sounds like we're not planning to do this anytime soon.