Bug 24051 - Soup backend needs content sniffing capabilities
Summary: Soup backend needs content sniffing capabilities
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKitGTK (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC All
: P2 Normal
Assignee: Gustavo Noronha (kov)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-02-19 18:20 PST by Gustavo Noronha (kov)
Modified: 2009-03-06 08:04 PST (History)
3 users (show)

See Also:


Attachments
implement content sniffing (3.90 KB, patch)
2009-02-19 18:21 PST, Gustavo Noronha (kov)
no flags Details | Formatted Diff | Diff
use new API in libsoup to request that content sniffing is done (1.84 KB, patch)
2009-02-20 13:39 PST, Gustavo Noronha (kov)
no flags Details | Formatted Diff | Diff
content sniffing for the soup backend (4.59 KB, patch)
2009-03-06 06:07 PST, Gustavo Noronha (kov)
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Gustavo Noronha (kov) 2009-02-19 18:20:54 PST
When server sends no Content-Type, we need to try to detect it through sniffing. People testing the download patch noticed that some pages asked for multiple downloads of random URLs while loading. Those are javascript, html snippets, and other kinds of files for which the server is not sending a Content-Type header.
Comment 1 Gustavo Noronha (kov) 2009-02-19 18:21:36 PST
Created attachment 27819 [details]
implement content sniffing
Comment 2 Dan Winship 2009-02-20 06:52:33 PST
> +    // We still don't know anything about Content-Type, so we will try
> +    // sniffing the contents of the file, and then report that we got
> +    // headers
> +    if (!soup_message_headers_get_content_type(msg->response_headers, NULL))
> +        return;

This assumes that if Content-Type is present, that it's correct. I'm not sure that's how other browsers work (and I think you definitely want to sniff if Content-Type is "application/octet-stream", right?)

Having to delay calling didReceiveResponse is annoying, because maybe the response is coming in in small packets, and so the first got-chunk might not have a lot of data in it, so ideally, you'd want gotChunkCallback to check the chunk size, and if it's less than a certain amount, buffer it and wait for more data (or the finished signal) before sniffing it... (which would totally be a pain, yes. Maybe not worth it. What do other backends do?).


There was a thread on webkit-dev a few months ago about "Harmonizing content sniffing" between web browsers: https://lists.webkit.org/pipermail/webkit-dev/2008-November/005697.html, which links to a page showing what some existing browsers (including Safari) appear to do with certain important web types. It might be useful to compare g_content_type_guess()'s results against those; we may need a web-specific wrapper around g_content_type_guess() in order to get better results.

If the right answer ends up being a complicated mix of trusting Content-Type vs not trusting it, and overriding some of g_content_type_guess()'s guesses, then maybe we should have a soup_message_guess_content_type().
Comment 3 Gustavo Noronha (kov) 2009-02-20 07:10:17 PST
(In reply to comment #2)
> > +    // We still don't know anything about Content-Type, so we will try
> > +    // sniffing the contents of the file, and then report that we got
> > +    // headers
> > +    if (!soup_message_headers_get_content_type(msg->response_headers, NULL))
> > +        return;
> 
> This assumes that if Content-Type is present, that it's correct. I'm not sure
> that's how other browsers work (and I think you definitely want to sniff if
> Content-Type is "application/octet-stream", right?)

Hmm, that's right. I can add that case as another exception. In my tests believing the Content-Type and only sniffing empty ones is already a big improvement over no sniffing at all, anyway.

> Having to delay calling didReceiveResponse is annoying, because maybe the
> response is coming in in small packets, and so the first got-chunk might not
> have a lot of data in it, so ideally, you'd want gotChunkCallback to check the
> chunk size, and if it's less than a certain amount, buffer it and wait for more
> data (or the finished signal) before sniffing it... (which would totally be a
> pain, yes. Maybe not worth it. What do other backends do?).

I can try to implement that. Other backends seem to delegate this job to their http libraries. The library probably does the necessary buffering and detects the type before reporting headers received to the client.

> There was a thread on webkit-dev a few months ago about "Harmonizing content
> sniffing" between web browsers:
> https://lists.webkit.org/pipermail/webkit-dev/2008-November/005697.html, which
> links to a page showing what some existing browsers (including Safari) appear
> to do with certain important web types. It might be useful to compare
> g_content_type_guess()'s results against those; we may need a web-specific
> wrapper around g_content_type_guess() in order to get better results.

True. I would like to have that work done after we land basic content sniffing first, though, because having no sniffing at all causes some sites to not render correctly, and with the download patch applied, all of the empty content types turn into download requests.
Comment 4 Gustavo Noronha (kov) 2009-02-20 07:18:54 PST
Comment on attachment 27819 [details]
implement content sniffing

Discussing with Dan we decided this makes sense in libsoup itself, so I'm clearing the review flag and will cook a patch to libsoup instead, and later create a patch to use the new API here.
Comment 5 Gustavo Noronha (kov) 2009-02-20 13:37:06 PST
I've written a patch for libsoup: http://bugzilla.gnome.org/show_bug.cgi?id=572589. I'm going to attach the patch that needs to be applied to WebKit after that is landed here.
Comment 6 Gustavo Noronha (kov) 2009-02-20 13:39:10 PST
Created attachment 27838 [details]
use new API in libsoup to request that content sniffing is done

I'm not setting review flag yet, as the other patch needs to be reviewed and applied to libsoup first. Notice that this patch also replaces the use of a deprecated flag with the current approach used by libsoup for not copying the downloaded data to the message_body.
Comment 7 Gustavo Noronha (kov) 2009-03-06 06:07:38 PST
Created attachment 28356 [details]
content sniffing for the soup backend

I am proposing this, since we still have no support for this in soup, but I have prepared a more robust patch to soup, and will rework the support for sniffing in our soup backend as soon as the patch is accepted there.
Comment 8 Holger Freyther 2009-03-06 07:43:12 PST
Comment on attachment 28356 [details]
content sniffing for the soup backend

> From 2c0d6818a8549c54ff66e9f0ee6a14e2c4c0be86 Mon Sep 17 00:00:00 2001
> From: Gustavo Noronha Silva <kov@kov.eti.br>
> Date: Fri, 6 Mar 2009 10:30:52 -0300
> Subject: [PATCH] 2009-03-06  Gustavo Noronha Silva  <gns@gnome.org>
> 
>         Reviewed by NOBODY (OOPS!).
> 
>         https://bugs.webkit.org/show_bug.cgi?id=24051
>         Soup backend needs content sniffing capabilities
> 
>         Perform content sniffing when using soup, so that we have a chance
>         of figuring out the Content-Type of the file if it's not sent by
>         the server.
> 
>         * platform/network/ResourceHandleInternal.h:
>         (WebCore::ResourceHandleInternal::ResourceHandleInternal):
>         * platform/network/soup/ResourceHandleSoup.cpp:
>         (WebCore::gotHeadersCallback):
>         (WebCore::gotChunkCallback):
> ---
>  WebCore/ChangeLog                                  |   17 ++++++++++++++++
>  WebCore/platform/network/ResourceHandleInternal.h  |    2 +
>  .../platform/network/soup/ResourceHandleSoup.cpp   |   21 +++++++++++++++++++-
>  3 files changed, 39 insertions(+), 1 deletions(-)
> 
> diff --git a/WebCore/ChangeLog b/WebCore/ChangeLog
> index 614e0db..2a18f28 100644
> --- a/WebCore/ChangeLog
> +++ b/WebCore/ChangeLog
> @@ -1,3 +1,20 @@
> +2009-03-06  Gustavo Noronha Silva  <gns@gnome.org>
> +
> +        Reviewed by NOBODY (OOPS!).
> +
> +        https://bugs.webkit.org/show_bug.cgi?id=24051
> +        Soup backend needs content sniffing capabilities
> +
> +        Perform content sniffing when using soup, so that we have a chance
> +        of figuring out the Content-Type of the file if it's not sent by
> +        the server.
> +
> +        * platform/network/ResourceHandleInternal.h:
> +        (WebCore::ResourceHandleInternal::ResourceHandleInternal):
> +        * platform/network/soup/ResourceHandleSoup.cpp:
> +        (WebCore::gotHeadersCallback):
> +        (WebCore::gotChunkCallback):
> +
>  2009-03-06  Hironori Bono  <hbono@chromium.org>
>  
>          Reviewed by Alexey Proskuryakov.
> diff --git a/WebCore/platform/network/ResourceHandleInternal.h b/WebCore/platform/network/ResourceHandleInternal.h
> index a55f945..875854b 100644
> --- a/WebCore/platform/network/ResourceHandleInternal.h
> +++ b/WebCore/platform/network/ResourceHandleInternal.h
> @@ -111,6 +111,7 @@ namespace WebCore {
>  #if USE(SOUP)
>              , m_msg(0)
>              , m_cancelled(false)
> +            , m_reportedHeaders(false)
>              , m_gfile(0)
>              , m_inputStream(0)
>              , m_cancellable(0)
> @@ -183,6 +184,7 @@ namespace WebCore {
>          SoupMessage* m_msg;
>          ResourceResponse m_response;
>          bool m_cancelled;
> +        bool m_reportedHeaders;
>          GFile* m_gfile;
>          GInputStream* m_inputStream;
>          GCancellable* m_cancellable;
> diff --git a/WebCore/platform/network/soup/ResourceHandleSoup.cpp b/WebCore/platform/network/soup/ResourceHandleSoup.cpp
> index 82ed142..2169450 100644
> --- a/WebCore/platform/network/soup/ResourceHandleSoup.cpp
> +++ b/WebCore/platform/network/soup/ResourceHandleSoup.cpp
> @@ -214,6 +214,14 @@ static void gotHeadersCallback(SoupMessage* msg, gpointer data)
>      if (!SOUP_STATUS_IS_SUCCESSFUL(msg->status_code))
>          return;
>  
> +    soup_message_set_flags(msg, SOUP_MESSAGE_OVERWRITE_CHUNKS);

okay, we can kill that too?


> +
> +    // We still don't know anything about Content-Type, so we will try
> +    // sniffing the contents of the file, and then report that we got
> +    // headers
> +    if (!soup_message_headers_get_content_type(msg->response_headers, NULL))
> +        return;
> +

according to Danw the header callback is only called once, so this guard is good enough.


do you happen to have a test case for that as well?
Comment 9 Gustavo Noronha (kov) 2009-03-06 07:54:55 PST
(In reply to comment #8)
> > +    soup_message_set_flags(msg, SOUP_MESSAGE_OVERWRITE_CHUNKS);
> 
> okay, we can kill that too?

Not really. We need to replace this with a call to soup_message_body_set_accumulate(msg->request_body, FALSE); I'm going to prepare a patch =).


> do you happen to have a test case for that as well?
 
I usually test this by going to failblog.org and seeing if the top and top-right corner ads load correctly, or if they become download requests. You can do the same test in lkml.org, and clicking a message.
Comment 10 Gustavo Noronha (kov) 2009-03-06 07:55:45 PST
Comment on attachment 28356 [details]
content sniffing for the soup backend

Landed as r41481; clearing the flag because I will use the same bug to post two more patches.