Bug 150137
Summary: | IdentifierRepHash has a mismatch between hash() and equal() | ||
---|---|---|---|
Product: | WebKit | Reporter: | Filip Pizlo <fpizlo> |
Component: | JavaScriptCore | Assignee: | Nobody <webkit-unassigned> |
Status: | NEW | ||
Severity: | Normal | CC: | darin, msaboff, ysuzuki |
Priority: | P2 | ||
Version: | WebKit Nightly Build | ||
Hardware: | All | ||
OS: | All |
Filip Pizlo
IdentifierRepHash inherits equal() from PtrHash, so it defines equality as pointer equality. But it implements hash() by calling the StringImpl's hash method.
This seems either incorrect or inefficient.
The hash() method will return a hash value based on the string's value, not the string's identity. So, non-identical string might end up having the same value and so the same hash. Also, that hash takes more effort to compute than the normal PtrHash.
Quite likely, the PtrHash::hash method would be most efficient, since it doesn't require loading from the pointer. Loads are usually the more expensive things.
Also, this means that we cannot have a HashMap with IdnetifierRepHash that doesn't ref its strings and outlives some of the strings that are in it. This seems dirty, but it might actually be profitable in some cases, like InferredTypeTable. It's OK if that table has an entry for a dangling StringImpl*, because all this means is that if some StringImpl gets allocated in the same memory address, we'll already have a hypothesis about the type. It might be a wrong hypothesis, but that's still sound - at worst we'll come up with a looser type than what we could have done. This might be a profitable optimization if the reduction in ref/deref traffic produced a big enough speed-up.
Anyway, it seems like IdentifierRepHash is doing either too much work in hash() or too little work in equal(), and we should fix it to be consistent, probably by just replacing it with PtrHash.
Attachments | ||
---|---|---|
Add attachment proposed patch, testcase, etc. |
Yusuke Suzuki
I don't remember the exact bug number (maybe, in the context of Symbol and IdentifierRepHash), IIRC, when I attempt to change this to PtrHash, I heard the following reason why we did not use PtrHash; We intentionally use StringHasher instead of PtrHash because it gives nicer distribution for Strings and reduces collisions in the hash table.
So when changing this, I think measuring collision and performance is nice.
Yusuke Suzuki
(In reply to comment #1)
> I don't remember the exact bug number (maybe, in the context of Symbol and
> IdentifierRepHash), IIRC, when I attempt to change this to PtrHash, I heard
> the following reason why we did not use PtrHash; We intentionally use
> StringHasher instead of PtrHash because it gives nicer distribution for
> Strings and reduces collisions in the hash table.
> So when changing this, I think measuring collision and performance is nice.
Personally I guess PtrHash gives nice distribution. PtrHash maps intptr_t => unsigned. StringHasher maps UChar[arbitrary length] => unsigned. If PtrHash works well, it could provide nice distribution in this case I think.
Filip Pizlo
(In reply to comment #2)
> (In reply to comment #1)
> > I don't remember the exact bug number (maybe, in the context of Symbol and
> > IdentifierRepHash), IIRC, when I attempt to change this to PtrHash, I heard
> > the following reason why we did not use PtrHash; We intentionally use
> > StringHasher instead of PtrHash because it gives nicer distribution for
> > Strings and reduces collisions in the hash table.
> > So when changing this, I think measuring collision and performance is nice.
>
> Personally I guess PtrHash gives nice distribution. PtrHash maps intptr_t =>
> unsigned. StringHasher maps UChar[arbitrary length] => unsigned. If PtrHash
> works well, it could provide nice distribution in this case I think.
Do you recommend measuring distribution directly, or do you think it's OK to just see how performance is affected overall?
We may encounter a trade-off between a fast to compute hash function that gives worse distribution and a slow to compute hash function that gives better distribution. I don't have a philosophical preference between the two, but I just suspect that unless the distribution is really bad, the faster-to-compute hash function will give better overall performance. Maybe the best thing to do is just to look at end-to-end performance. Probably, it doesn't matter at all. In that case, PtrHash is nicer because it's the less surprising choice.
Darin Adler
As I recall, we were originally using PtrHash for this purpose (or a similar one in WebCore?) and Gavin changed to use the string hash instead to get a measurable performance boost. But I don’t think the change was to recompute a string hash each time; our strings memoize the result of the string hash function.
Yusuke Suzuki
(In reply to comment #3)
> Do you recommend measuring distribution directly, or do you think it's OK to
> just see how performance is affected overall?
I was thinking about HashTable::dumpStats.
DUMP_HASHTABLE_STATS can dump the statistics including collisions.
> We may encounter a trade-off between a fast to compute hash function that
> gives worse distribution and a slow to compute hash function that gives
> better distribution. I don't have a philosophical preference between the
> two, but I just suspect that unless the distribution is really bad, the
> faster-to-compute hash function will give better overall performance. Maybe
> the best thing to do is just to look at end-to-end performance. Probably,
> it doesn't matter at all. In that case, PtrHash is nicer because it's the
> less surprising choice.
Agreed. If the change does not pose any performance regression & there are enough tests, I think it indirectly indicates the distribution of PtrHash is not so bad.