Bug 14298

Summary: REGRESSION: Wrong base URL used for pages loaded by custom protocol handler
Product: WebKit Reporter: Rush Manbert <rush>
Component: Page LoadingAssignee: Nobody <webkit-unassigned>
Status: UNCONFIRMED ---    
Severity: Major CC: abarth, aggarwalrachit33, airlinesphonenumber888, airlinesreservationsnumber, ap, azscreenrecorders1, baekchoonhwa, CassieGriffin, cogniscientofficial, dadwarner11, davidkeery, johnyaw652, ngockhanhlam87, ramtin.beheshti, sampledata17, showboxofficial01, Stengle123, thenarant, vidavera, vimalrajravi1, zwalikhan2211
Priority: P1 Keywords: InRadar, Regression
Version: 419.x   
Hardware: Mac   
OS: OS X 10.4   
Bug Depends on:    
Bug Blocks: 37641    
Attachments:
Description Flags
Test case project that does not contain WebKit.app none

Description Rush Manbert 2007-06-21 16:15:46 PDT
Summary: 
In the current shipping version of the WebKit framework, if I load the HTML for a web page using a custom protocol and set my desired base URL in the NSURLResponse by calling:
	[NSURLResponse initWithURL:  MIMEType: expectedContentLength: textEncodingName:
then that base URL is used to resolve any relative resource references in the loaded HTML. For instance, if I set the base URL to "file://localhost//Users/rmanbert/something/" and my HTML contains this element:
	<script type="text/javascript" src="test.js"></script>
then the full URL that is used will be "file://localhost//Users/rmanbert/something/test.js"

In the nightly build of WebKit, this behavior is different. It appears that the base URL I set in my NSURLResponse is being ignored, and the base URL from the NSURLRequest is being used instead. Given the same example above, if the NSURLRequest that I retrieve in my startLoading method of my protocol handler has a URL of the form "special://loadString/", but I set the URL in the NSURLResponse as I did above, then when my script element is processed, WebKit resolves the relative reference as "special://loadString/test.js".

I rely on being able to set the base URL in my startLoading method, and this change in behavior has broken my app. I also believe it is incorrect behavior because people may use custom protocols to implement all sorts of different content loading schemes. Consider a case where different "special://" URLs are used as selectors for HTML content that is baked into the app for security reasons. The easiest way to load a JavaScript library from within the HTML would be to use a relative URL and manage the base URL in the startLoading method. It would be entirely reasonable to set the base URL to reference the application bundle resource folder, but you certainly would not want the base URL to be something like "special:///selector1".

In my case, I have an app that must run on the Mac and on Windows. I have a custom markup language that extends XHTML. Displaying my content relies on transforming the XHTML source with XSLT, as well as performing other manipulations on the DOM tree. In order to minimize browser-related problems, I use libxml and libxslt to process my source into a HTML string, and I use a custom protocol handler to load the HTML string into my WebKit view. I use JavaScript libraries that are loaded by the HTML, so I need to control the base URL.

Steps to Reproduce: 
I have attached (Well, I hope I will be able to attach) a sample project in the file "WebKitBadBaseUrl.zip" that demonstrates the problem. It is based on Apple's Special Picture Protocol sample project, and I have modified it to simulate the way my application loads its web pages. The basic method for loading a page with a URL of the form "special:///file.html" is to read the contents of the file into a string. This simulates my rendering process. I then put the HTML string, plus a string that represents the path portion of the  desired base URL into a NSDictionary, which I attach to a new NSURLRequest. This request carries the URL "something://loadString/", which is exactly what my real application does. We then rely on the startLoading method in SpecialProtocol.m to extract the HTML string and the base URL path. The string becomes the returned data and the path gets incorporated into a file: URL to make the desired base URL.

The HTML contains a <script> tag that references a file that is stored in the application bundle resources folder. If the base URL works as expected, the web page is displayed. It contains a single line of text. If the base URL does not get set as desired, then this is detected in:
-(NSURLRequest *)webView:(WebView *)sender resource: willSendRequest: redirectResponse: fromDataSource:
in MyController.m. It displays an alert and exits the program with an error.

To demonstrate proper and improper behavior:
1) Build the project for Debug under OS X 10.4.x
2) Run the program and note that the web page is displayed. It displays the text "Content from test.html". This is the proper behavior because the relative reference of "test.js" in the HTML was resolved to the proper file URL for the test.js file stored in the application bundle resource folder.
3) Now we need to make the application run with the nightly build WebKit. Here is one method:
  a) Go to the SpecialPictureProtocol project window and open the "Executables" section
  b) Right click on SpecialPictureProtocol and select "Get Info" from the drop down menu
  c) Click on the "Arguments" tab
  d) Add two environment variables in the lower panel. Their names and values are as follow:
      WEBKIT_UNSET_DYLD_FRAMEWORK_PATH   YES
      DYLD_FRAMEWORK_PATH    $(PROJECT_DIR)/WebKit.app/Contents/Resources

      See the file CustomExecutableArgumentsSettings.tiff in the project directory for a screenshot.
4) Once the environment variables have been set, run the application again. This time you should see an error alert saying that there is a base URL error.
   

Expected Results: 
I expected the base URL that I set in my startLoading method to be used when resolving relative URLs within the loaded HTML.

Actual Results: 
The "dummy" special protocol URL was used as the base URL. This is the URL that was in the NSURLRequest that caused us to invoke startLoading.

Regression: 
This works as expected in WebKit frameworks released with 10.4.x. It does not work as expected in the nightly build.

Notes: 
The demonstration XCode project is in the attached file "WebKitBadBaseUrl.zip".

I was advised on the webkit-dev mailing list to file bugs with Apple and here, so that's what I'm doing.
Comment 1 Rush Manbert 2007-06-21 16:25:50 PDT
My test case project zip file is 7.7 Mb, which is over the allowed size. It contained the copy of WebKit.app that I had tested against.

I have removed WebKit.app from the project zip file and will attach it. This means that when you get to step 3 in the Steps to Reproduce section, you first need to copy a current version of WebKit.app into the project directory. The version that I used was 419.3, but you can probably use any current version. Alternatively, you could just put the project on a Leopard system and run against the WebKit framework that it contains. That shows the same error.
Comment 2 Rush Manbert 2007-06-21 16:27:47 PDT
Created attachment 15170 [details]
Test case project that does not contain WebKit.app

You must copy a current WebKit.app into the project directory before you can follow the steps to reproduce in the original bug report.
Comment 3 Alexey Proskuryakov 2007-06-22 02:49:29 PDT
I haven't attempted to verify this, but P1/Regression handling seems appropriate anyway.

If you have already filed a Radar bug, please cross-reference it with this Bugzilla one, and change "NeedsRadar" keyword to "InRadar"!
Comment 4 Rush Manbert 2007-06-22 09:30:51 PDT
The Radar bug is 5286150. I have added a comment to it that cross references to this bug.
Comment 5 Alexey Proskuryakov 2008-01-05 07:35:50 PST
I'm not really familiar with custom protocol loaders, so it seems strange to me that such a loader can affect document url or base url somehow. Is this documented or at least hinted at anywhere?

Of course, the most obvious and reliable way to change base URL when loading is to inject a BASE tag in the HTML source.
Comment 6 Rush Manbert 2008-01-08 13:14:45 PST
(In reply to comment #5)
Well, the hint is that when you initialize the NSURLResponse object that carries the metadata for the data you're loading, one of the things you specify is the URL. The documentation does not say how that URL is used. I will note, however, that the W3C HTML 4.01 specification contains a section titled "Links". It has a subsection that discusses URI resolution. I have copied it below:

***************************************************
12.4.1 Resolving relative URIs

User agents must calculate the base URI for resolving relative URIs according to [RFC1808], section 3. The following describes how [RFC1808] applies specifically to HTML.

User agents must calculate the base URI according to the following precedences (highest priority to lowest):

   1. The base URI is set by the BASE element.
   2. The base URI is given by meta data discovered during a protocol interaction, such as an HTTP header (see [RFC2616]).
   3. By default, the base URI is that of the current document. Not all HTML documents have a base URI (e.g., a valid HTML document may appear in an email and may not be designated by a URI). Such HTML documents are considered erroneous if they contain relative URIs and rely on a default base URI.

Additionally, the OBJECT and APPLET elements define attributes that take precedence over the value set by the BASE element. Please consult the definitions of these elements for more information about URI issues specific to them.
***********************************************************

To me, item 2, which specifies the base URI as given by the metadata, and the description (in APple's documentation) of NSURLResponse as containing the metadata, implies that the URL in the NSURLResponse is indeed supposed to be used when resolving URIs in the document. But I may be wrong.

When I originally wrote my custom protocol handler, I found that, after redirecting the original load request to a new custom protocol request, I was immediately seeing a load request for my CSS stylesheet, but the relative URL had been expanded using the custom protocol info, like this:
    myProtocol//myOp/myStylesheet.css

but myStylesheet.css was actualy a file, so I needed the URL as
    file://localhost/myStylesheet.css

and that provided the hook for me to resolve the actual full path to the stylesheet because the file: protocol handler would call my delegate method.

As an experiment, since nothing said how the NSURLResponse URL was used, I initialized it to the full file: URL for the page that I had rendered and loaded with my custom protocol. The URL that I had set into the NSURLResponse was used to generate the base URL and I started to see the expected URLs for my page resources. I had opened an Apple TSI in order to get help with this, and I was working with a support guy there. I told him how I had solved the problem and he didn't tell me that it was the wrong thing to do. (He didn't tell me it was the right thing to do either. ;-)

So all I can really say with authority is that it previously worked that way and now it doesn't, but I think the HTML spec implies that the previous behavior was correct. I don't understand why NSURLResponse has a URL field if it doesn't modify base URL. What good is it otherwise? Most of the examples I have seen just initialize it with the URL from the request, but they're not trying to achieve what I am. I have seen some examples where people seem to initialize it with a cache URL, but most of those posts were discussing the fact that they were having caching problems. It would be good if Apple documented how the URL in NSURLResponse gets used, but I can't find anything.

I implemented my browser in Windows using the IE COM control, and because of how it works, I did implement the code to insert a <base> element (or detect that one is already there and change its href attribute). Also, because of a separate resource caching bug in WebKit, I have implemented code that resolves all of my relative references in the page at the time I render it into a HTML: string. These changes have worked around the behavior of the newest WebKit release, but it sort of scares me to do it this way. It means that I must find all of the XHTML element types that can contain URLs, examine their attributes, and do the right thing. I think I have the correct list, but I may have missed something. Being able to set the base URL by initializing the NSURLResponse pushed the identification problem off to the WebKit, which already knows how to do all of that. Me having to do it is  a lot less elegant and more error prone.