Bug 22131 - Ports should be able to provide their own visited link hashing and types
Summary: Ports should be able to provide their own visited link hashing and types
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Platform (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Brett Wilson (Google)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-11-07 15:36 PST by Brett Wilson (Google)
Modified: 2008-11-21 09:09 PST (History)
3 users (show)

See Also:


Attachments
Patch for using 64 bits from a separate file (28.70 KB, patch)
2008-11-10 16:36 PST, Brett Wilson (Google)
eric: review+
Details | Formatted Diff | Diff
Patch with Eric's suggsetions (28.73 KB, patch)
2008-11-20 09:54 PST, Brett Wilson (Google)
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Brett Wilson (Google) 2008-11-07 15:36:02 PST
The current visited link system uses 32-bit hash values computed by Document to identify URLs.

Chromium needs to implement its own hashing for visited link coloring. We use 64 bit values because one of the design goals of history is to scale to ~1 million URLs (all your history forever) without any problems. Using a 32-bit hash would give unacceptably high collision rates. Safari only keeps history for a month, so this is less of an issue.

Chromium's visited link system is also a complex multiprocess hashtable which internally computes its own hash values. It uses salting (to avoid things like bug 22130) that must be kept in sync with all processes and its own hash algorithm, so this logic can not be moved to WebCore.

I think this can be done very cleanly be creating a new LinkHash.h file. It would contain a typedef for the hash value (so it can be 32 or 64 bits), and also a function for computing a hash of a string (moving the code from Document::visitedLinkHash). This would be nice for both the current WebKit code, as well as provide the ability for ports to provide their own hashing.
Comment 1 Brett Wilson (Google) 2008-11-10 16:36:51 PST
Created attachment 25033 [details]
Patch for using 64 bits from a separate file

This implements 64-bit hashes and separates them out into another file only. It doesn't actually use the additional 32 bits since I wasn't sure which hashing algorithm to use. The easy way out would be to just generate a salt on startup (I think the hashes are not persisted, right?) and use that as the top 32-bits.

For reference, Chromium uses MD5.

There was some additional discussion in this thread:
https://lists.webkit.org/pipermail/webkit-dev/2008-November/005641.html
Comment 2 Eric Seidel (no email) 2008-11-19 17:34:25 PST
Comment on attachment 25033 [details]
Patch for using 64 bits from a separate file

This looks great!

You use "hash" as an overloaded term, we need to qualify it's usage to make more sense.

 38 // Just use the low 32-bits of the 64-bit hash as the "hash" (which must be 32 bits).

Should be something like:
// Use the low 32 bits of the 64-bit LinkHash as the key for the HashSet


LinkHashHash needs a better name.  yes, I understand that I follows the pattern of other *Hash structs, but it's confusing. :)  LinkHashHashFunctions would be better.

Otherwise you're good to go.  I don't need to see this again.  There are obviously things which could be used to further improve hashing here, but this is a nice cleanup step at least.
Comment 3 Eric Seidel (no email) 2008-11-19 17:34:50 PST
CCing darin and mjs just in case they would have otherwise missed this.
Comment 4 Brett Wilson (Google) 2008-11-20 09:54:57 PST
Created attachment 25316 [details]
Patch with Eric's suggsetions

I'll wait for Darin or Maciej's input before checking this in.
Comment 5 Darin Adler 2008-11-20 10:12:27 PST
Comment on attachment 25316 [details]
Patch with Eric's suggsetions

This looks great to me. Excellent direction to go.

> +// Use the low 32-bits of the 64-bit LinkHash as the key for HashSets.
> +struct LinkHashHashFunction {

I understand the need/desire to not call this LinkHashHash. However, the other such structs are not named "hash function" for a reason -- they specify more than just the hash function; it's a complete specification of "how to hash this" that includes equality and key comparison.

I can certainly live with the current name, but maybe HashForLinkHash or LinkHashing would be better. Or perhaps the integer type is what should be renamed; it could be called LinkHashValue. Sorry I can't think of something better.

By the way, Maciej and I have an ongoing debate about the best types to use for 32-bit and 64-bit values. If we wanted to use the stdint.h types, then there are thousands of places where we do int and unsigned where we should do int32_t and uint32_t, and even then we'd have problems because those types don't work well with functions like printf and strtod and the like. But if we're using unsigned for uint32_t, then it seems strange to use uint64_t instead of just unsigned long long, which works on all the platforms we need it to. Anyway, no reason for you to get caught up in this debate. Consider this an aside.

> +// Resolves the potentially relative URL "attributeURL" relative to the given
> +// base URL, and returns the hash of the string that will be used for visited
> +// link coloring. It will return the special value of 0 if attributeURL does not
> +// look like a relative URL.
> +LinkHash visitedLinkHash(const KURL& base, const AtomicString& attributeURL);

I think the comment about the value of 0 is confusing. For one thing, attributeURL can be an absolute URL. And the implementation of this function only returns 0 if the passed in string is empty, so I think "does not look like a relative URL" is unnecessarily vague and confusing. Maybe there's some future direction you're pointing at here?

I have to admit I don't follow the rationale for landing code that makes 32-bit hashes and pretending they are 64-bit hashes and then rehashes it. We want actual 64-bit hashes for this change to really pay off, right? Please, lets do that. I don't think we need to use MD5 just to get good 64-bit hashes, but maybe that's a good choice. I'd hate to add an external dependency just for that, though. I don't understand the road map here.
Comment 6 Brett Wilson (Google) 2008-11-20 10:41:19 PST
(In reply to comment #5)
> I have to admit I don't follow the rationale for landing code that makes 32-bit
> hashes and pretending they are 64-bit hashes and then rehashes it. We want
> actual 64-bit hashes for this change to really pay off, right? Please, lets do
> that. I don't think we need to use MD5 just to get good 64-bit hashes, but
> maybe that's a good choice. I'd hate to add an external dependency just for
> that, though. I don't understand the road map here.

My original version of this code (which I didn't upload) had an ifdef for LinkHash so it could be 64 bits under PLATFORM(CHROMIUM) and remain 32 bits on other platforms. Maciej said on IRC he would prefer it be 64 bits for everybody.

I already have a hash function and everything implemented on the Chromium side that gives 64 bit values, and this code/requirement can't easily be changed. On the other hand, you guys have only 32-bit string hashing, and people on IRC (and you in your post here) seem undecided about what 64-bit hash function you want on your end.

If you tell me what you want, I'm happy to integrate it here for you. But I was trying not to get bogged down in this decision if you can't decide yet. If you would prefer, I can go back to my previous approach to keep you guys on 32 bit string hashes.
Comment 7 Darin Adler 2008-11-20 11:36:47 PST
(In reply to comment #6)
> My original version of this code (which I didn't upload) had an ifdef for
> LinkHash so it could be 64 bits under PLATFORM(CHROMIUM) and remain 32 bits on
> other platforms. Maciej said on IRC he would prefer it be 64 bits for
> everybody.

Yes, I think it should be 64-bit for everybody. I can't see any reason why some ports would want a 32-bit hash function.

> I already have a hash function and everything implemented on the Chromium side
> that gives 64 bit values, and this code/requirement can't easily be changed. On
> the other hand, you guys have only 32-bit string hashing, and people on IRC
> (and you in your post here) seem undecided about what 64-bit hash function you
> want on your end.

OK. But "you guys have only 32-bit string hashing" is referring to the WebKit project and you're a part of it! Lets find a 64-bit hash function. I wouldn't say so much that we're undecided. No one has made a concrete proposal yet.

Another way to put this is that I'd like to see code that treats everything as if it was 64-bit, all the way down to the actual hash computation function itself. If we're temporarily stuck using the 32-bit hash function, then only that single call site should know it. I don't want code that knows the hash is really 32-bit to be distributed in multiple places.

Then we just have to change that one spot to do the good hashing.

What about the "salting" issue?
Comment 8 Brett Wilson (Google) 2008-11-20 12:40:45 PST
It's not practical to move our visited link hashing logic into WebCore. It's a separate component of Chromium that manages the visited link database file I/O, multiprocess issues, etc. In this way, you can think of it like Qt's visited link system, where it needs to call out to a separate component.

The difference is that our implementation works pretty well with the way that your code handles visited link hashing, and it's possible to integrate the two in a reasonable way, while I don't think Qt has this option (I don't know what they can do to make it work, actually). We would integrate by routing the hash requests through the bridge layer to our application-specific visited link system. It would return the hash value which fits nicely with your current system.

Our visited link hash table maintains a hash table of 64-bit URL hashes with salting. We don't store the URL. URLs are added in the browser process where they're hashed (and where we don't use WebKit types), and the salt and hash table information is shared with each renderer process.

This hashtable is serialized to disk, so changing any aspect of the hash function would require regenerating every user's visited link database. In fact, even if you decided to use the exact same hash function as us, I would be inclined to keep our implementation separate. The visited link system is a nice self-contained component that internally handles this kind of thing. Parts of it run in the browser process where we don't have WebKit types.

We could create a bridging layer to call back to the WebKit hash function for adding URLs. But since we store URLs as 8-bit, doing bridging in this direction would require a string conversion or KURL construction for every addition to the database to WebCore::String types. 

Because of this, it makes a lot more sense to bridge the opposite way. We implement the visited link hashing function as a single line of code to call into our shared visited link system. This is already set up to work with 8-bit strings, which is also how we implement KURL, so there are no conversions necessary.

But even if your hashing function was identical to ours, I would still bridge out for this computation for the purposes of clean separation of our hashing code and backwards compatibility: our on-disk structures on our user's machines would then depend directly on how your hash functions are implemented so any changes on your end would break our existing users.

I hope that you agree this is a reasonable approach. This is for our protection as well as yours. I don't think WebKit would look kindly on me complaining about changes to the hashing algorithm because it breaks our users.

I'm happy to improve WebKit for the sake of improving WebKit and I'll implement hashing function you would prefer for your use. This is why I posted a question about what approach you wanted when I uploaded my patch in comment 1. Since you said there were no proposals, it were up to me I would use 16-bits of MD5 since that's what I already worked out for Chrome. I can implement this for this patch if you would like, or I can implement a different algorithm.
Comment 9 Brett Wilson (Google) 2008-11-20 12:42:15 PST
(In reply to comment #7)
> Another way to put this is that I'd like to see code that treats everything as
> if it was 64-bit, all the way down to the actual hash computation function
> itself. If we're temporarily stuck using the 32-bit hash function, then only
> that single call site should know it. I don't want code that knows the hash is
> really 32-bit to be distributed in multiple places.
> 
> Then we just have to change that one spot to do the good hashing.

This is already the case with my patch. Nothing makes any assumptions about the number of used bits in the hash outside of the hash function.

> What about the "salting" issue?

This patch does not address salting.
Comment 10 Darin Adler 2008-11-20 12:58:45 PST
OK, everything seems fine. It would be even better if we made the hash function be 64-bit and address the salting issue, but neither seems to be a hard requirement for checking in this change.
Comment 11 Brett Wilson (Google) 2008-11-20 13:25:05 PST
Thanks! (Sorry for the long previous post, I didn't want you to feel that I was trying to replace stuff in WebKit when I should be fixing it.)