Bug 132588 - support for navigator.hardwareConcurrency
: support for navigator.hardwareConcurrency
Status: RESOLVED FIXED
Product: WebKit
Classification: Unclassified
Component: WebCore Misc.
: 528+ (Nightly build)
: Unspecified Unspecified
: P2 Normal
Assigned To: Rik Cabanier
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-05-05 17:04 PDT by Rik Cabanier
Modified: 2015-05-03 22:37 PDT (History)
26 users (show)

See Also:


Attachments
Patch (4.26 KB, patch)
2014-05-06 12:26 PDT, Rik Cabanier
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from webkit-ews-13 for mac-mountainlion-wk2 (461.99 KB, application/zip)
2014-05-06 13:24 PDT, Build Bot
no flags Details
Patch (5.08 KB, patch)
2014-05-06 15:46 PDT, Rik Cabanier
no flags Details | Formatted Diff | Diff
Patch (5.33 KB, patch)
2014-05-08 09:44 PDT, Rik Cabanier
no flags Details | Formatted Diff | Diff
Patch (5.70 KB, patch)
2014-05-09 10:41 PDT, Rik Cabanier
no flags Details | Formatted Diff | Diff
Patch (5.28 KB, patch)
2014-05-10 13:55 PDT, Rik Cabanier
no flags Details | Formatted Diff | Diff
Patch (39.80 KB, patch)
2014-05-11 16:23 PDT, Rik Cabanier
no flags Details | Formatted Diff | Diff
Patch for landing (39.82 KB, patch)
2014-05-18 09:57 PDT, Rik Cabanier
no flags Details | Formatted Diff | Diff
Patch for landing (39.71 KB, patch)
2014-05-18 12:59 PDT, Rik Cabanier
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Rik Cabanier 2014-05-05 17:04:33 PDT
there's a thread on blink-dev [1] and whatwg [2] to create a new parameter on the navigator object that returns the maximum number of tasks that can run in parallel. 
Spec is not defined but there's a proposal on a wiki. [3]

Blink is going to implement this [4]. Mozilla is on the fence and there's been no signal from Microsoft.

1: https://groups.google.com/a/chromium.org/forum/#!topic/blink-dev/B6pQClqfCp4
2: http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2014-May/254200.html
3: http://wiki.whatwg.org/wiki/NavigatorCores
4: https://groups.google.com/a/chromium.org/forum/#!topic/blink-dev/xwl0ab20hVc
Comment 1 Geoffrey Garen 2014-05-05 17:37:36 PDT
Is this physical cores or logical cores?

Is this total in existence or total currently available, given other tasks on the system?

What is our response to the fingerprinting problem?
Comment 2 Alexey Proskuryakov 2014-05-05 17:57:02 PDT
In addition to more bits for fingerprinting, it's directly exploitable data - the web page will know whether you have expensive hardware.
Comment 3 Rik Cabanier 2014-05-05 20:25:07 PDT
(In reply to comment #2)
> In addition to more bits for fingerprinting, it's directly exploitable data - the web page will know whether you have expensive hardware.

How so? There are 8-core ARM systems which will be common in a year or two.
There's also an 8-core Intel Atom which is low cost. 
MacBook Air's (which are expensive) are still stuck on 4 cores.
Comment 4 Brady Eidson 2014-05-05 20:40:39 PDT
My turn to chime in.

- Reiterating the fingerprinting problem that's already been mentioned that we have no response to.
- "Number of physical cores" does not correlate to the number of simultaneous tasks supporter
- "Number of physical cores" does not correlate to computing resources currently available.

There modern approach to "let me do as much work as possible without hindering overall performance on the system" that at least some systems support.  Grand Central Dispatch on OS X, for example, will executed more tasks simultaneously when the resources are available but scale back when other apps demand some resources.

Is there a real problem that having this bit available would solve that couldn't also be solved by a GCD-style mechanism?
Comment 5 Filip Pizlo 2014-05-05 21:28:37 PDT
(In reply to comment #4)
> My turn to chime in.
> 
> - Reiterating the fingerprinting problem that's already been mentioned that we have no response to.
> - "Number of physical cores" does not correlate to the number of simultaneous tasks supporter
> - "Number of physical cores" does not correlate to computing resources currently available.
> 

I believe that they want the equivalent of hw.availcpu which isn't what either of your bullets tell you. 

> There modern approach to "let me do as much work as possible without hindering overall performance on the system" that at least some systems support.  Grand Central Dispatch on OS X, for example, will executed more tasks simultaneously when the resources are available but scale back when other apps demand some resources.
> 
> Is there a real problem that having this bit available would solve that couldn't also be solved by a GCD-style mechanism?

Yes. Absolutely. Numerical parallel algorithms are often written in a worker-per-CPU style and this ends up being faster than the mini-task model of GCD. 

But that's sort of besides the point - there are many things about GCD that would make it hard to do on the web anytime in the near future. Probably the biggest is that it relies on shared memory, which web workers don't give you.
Comment 6 Filip Pizlo 2014-05-05 21:30:35 PDT
(In reply to comment #2)
> In addition to more bits for fingerprinting, it's directly exploitable data - the web page will know whether you have expensive hardware.

How is this exploitable?

I don't buy that the number of cores reveals so many bits that we should care.
Comment 7 Rik Cabanier 2014-05-05 21:33:12 PDT
(In reply to comment #4)
> My turn to chime in.
> 
> - Reiterating the fingerprinting problem that's already been mentioned that we have no response to.

People on the blink team don't share this concern. I also don't see how valuable this information is, especially since it's easy to infer

> - "Number of physical cores" does not correlate to the number of simultaneous tasks supporter
> - "Number of physical cores" does not correlate to computing resources currently available.

Does WebKit limit the number of Web Worker threads like gecko does, or does it launch as many as the author requests?
If it limits the amount, that should be the number that this API returns.
If not, this API should return the number of logical CPU instances.

> There modern approach to "let me do as much work as possible without hindering overall performance on the system" that at least some systems support.  Grand Central Dispatch on OS X, for example, will executed more tasks simultaneously when the resources are available but scale back when other apps demand some resources.
> Is there a real problem that having this bit available would solve that couldn't also be solved by a GCD-style mechanism?

Even with a GCD-like solution, an author would often want to know how many threads can run at the same time so a task is not broken up in too many pieces or too many tasks are launched.
A GCD-like solution is much more difficult to engineer and to agree upon (and IMO for most cases overkill.) Desktop apps have been doing fine with just knowing the number of CPUs; what makes the web platform different?
Comment 8 Alexey Proskuryakov 2014-05-05 22:08:26 PDT
> How so? There are 8-core ARM systems which will be common in a year or two.
> There's also an 8-core Intel Atom which is low cost. 
> MacBook Air's (which are expensive) are still stuck on 4 cores.

The web page can already tell if you are on a Mac or on an iPhone, so adding the number of cores to the mix will give quite a bit of insight.

> How is this exploitable?

User's hardware will feed into one or more of the choices advertisers have on AdWords. This proposed feature is not about numeric algorithms - having each split bit of insight into who the user is significantly increases the price of ads, that's all. Yes, we are not talking about disclosing whether the user is a pregnant woman, which would reportedly make the ads 15x more valuable, but if you can achieve even a slight degree of correlation with user's income, that's money.
Comment 9 Filip Pizlo 2014-05-05 22:18:12 PDT
(In reply to comment #8)
> > How so? There are 8-core ARM systems which will be common in a year or two.
> > There's also an 8-core Intel Atom which is low cost. 
> > MacBook Air's (which are expensive) are still stuck on 4 cores.
> 
> The web page can already tell if you are on a Mac or on an iPhone, so adding the number of cores to the mix will give quite a bit of insight.
> 
> > How is this exploitable?
> 
> User's hardware will feed into one or more of the choices advertisers have on AdWords. This proposed feature is not about numeric algorithms - having each split bit of insight into who the user is significantly increases the price of ads, that's all. Yes, we are not talking about disclosing whether the user is a pregnant woman, which would reportedly make the ads 15x more valuable, but if you can achieve even a slight degree of correlation with user's income, that's money.

I get it. This reveals some tiny amount of information. I just don't see how it's enough to block a useful feature.
Comment 10 Brady Eidson 2014-05-06 11:48:12 PDT
(In reply to comment #7)
> (In reply to comment #4)
> > My turn to chime in.
> > 
> > - Reiterating the fingerprinting problem that's already been mentioned that we have no response to.
> 
> People on the blink team don't share this concern. I also don't see how valuable this information is, especially since it's easy to infer

Various security and privacy conscious groups have demonstrated just how much individuals can be uniquely identified by very few bits of info (and this is definitely a bit).  These groups are active in standards communities, such as W3C.  I've attended these face to face, and learned a lot about the fingerprinting problem from them.

It seems bizarre to me that the Blink team would discount what they've demonstrated.

Do they not care about the privacy aspect of "anonymous" tracking?
Or do they have evidence that what these groups have demonstrated is wrong?  Has this evidence been shared?

> > - "Number of physical cores" does not correlate to the number of simultaneous tasks supporter
> > - "Number of physical cores" does not correlate to computing resources currently available.
> 
> Does WebKit limit the number of Web Worker threads like gecko does, or does it launch as many as the author requests?
> If it limits the amount, that should be the number that this API returns.
> If not, this API should return the number of logical CPU instances.

In either case, this returns a static number that is almost always incorrect for the stated use cases of the API, because the number of workers *allowed* is not the same as the number cores *available*
> 
> > There modern approach to "let me do as much work as possible without hindering overall performance on the system" that at least some systems support.  Grand Central Dispatch on OS X, for example, will executed more tasks simultaneously when the resources are available but scale back when other apps demand some resources.
> > Is there a real problem that having this bit available would solve that couldn't also be solved by a GCD-style mechanism?
> 
> Even with a GCD-like solution, an author would often want to know how many threads can run at the same time so a task is not broken up in too many pieces or too many tasks are launched.

I'm not sure I can think of a case where a parallelizable task

> Desktop apps have been doing fine with just knowing the number of CPUs; 

They have?

The main reason GCD style mechanisms were brought into existence was because parallelization was seen as the future of improving system performance, with more and more processes and threads in flight at any given time on the machine.  Simply knowing the number of CPUs was no longer good enough to let everyone play nice.  Any one process is not in a position to know just how many resources are available at any time, nor is it in a position to adapt to rapidly changing system load.

While I have sympathy for hard core gamers who dedicate their entire machine to playing one game, or to scientific applications where the entire machine is dedicated to a single long running calculation, those are not the primary users of a modern machine.  Modern machines are expected to multitask well without the user realizing what's going on behind the scenes.

In a web environment, say you're running some scientific app in one tab, but also have a background tab open to an audio player and a 3rd tab with a casual game open.  Should all tabs suffer because the first is hogging everything?

>what makes the web platform different?

In most ways the web platform is not different, with my previous comment applying equally to it.

But, yes, the web platform *is* different in one important way - It is much newer and has less legacy than "the operating system", and therefore we can avoid some of the same mistakes.(In
Comment 11 Brady Eidson 2014-05-06 11:59:58 PDT
(In reply to comment #5)
> (In reply to comment #4)
> > My turn to chime in.
> > 
> > - Reiterating the fingerprinting problem that's already been mentioned that we have no response to.
> > - "Number of physical cores" does not correlate to the number of simultaneous tasks supporter
> > - "Number of physical cores" does not correlate to computing resources currently available.
> > 
> 
> I believe that they want the equivalent of hw.availcpu which isn't what either of your bullets tell you. 

If it is meant to adapt to dynamically changing system load, is there a notification API to tell the page when the value has changed?

Because I doubt we'd want an API that requires polling.

Also, this is not what Rik suggested it is...  I definitely agree that we're discussing an unspec'ed feature!  :)

> > There modern approach to "let me do as much work as possible without hindering overall performance on the system" that at least some systems support.  Grand Central Dispatch on OS X, for example, will executed more tasks simultaneously when the resources are available but scale back when other apps demand some resources.
> > 
> > Is there a real problem that having this bit available would solve that couldn't also be solved by a GCD-style mechanism?
> 
> Yes. Absolutely. Numerical parallel algorithms are often written in a worker-per-CPU style and this ends up being faster than the mini-task model of GCD. 

I'm not up to date with start of the art numerical parallel algorithms, so I'll definitely believe that this is how they are often written.

But are you actually saying that they cannot *possibly* be written with a mini-task model to perform almost as well?

Here's what I'm imagining.  It's simple and possibly quite naive, but it seems right to me.
An algorithm that wants to run 8 threads for 10 seconds each.
Scenario 1 - In the "I know I have 8 CPU cores" model, it spawns those 8 threads
Scenario 2 - In the "GCD mini task" scenario it queues up 8 tasks that might take ~1/4 second to complete, and when one completes it kicks off the next one.

In a perfectly ideal world of zero system load:
Scenario 1 spreads 80 seconds of work over 8 threads, and finishes in 80 seconds.
Scenario 2 spreads 80 seconds of work over 320 mini tasks, and finishes in 80 seconds + some GCD overhead time.

In a real world where the user is actively using their system:
Scenario 1 spreads 80 seconds of work over 8 threads, finishes in (possibly much) more than 80 seconds, and causes the system to be sluggish to the user in the meantime.
Scenario 2 spreads 80 seconds of work over 320 mini tasks, finishes in (possible much) more than 80 seconds, but the system remains perfectly responsive to the user in the meantime.

While 8 threads is the winner in the zero-load case, it's only the winner by a small amount.
The mini-task model is the clear winner in the "system is being used" case, with a significant advantage.
Comment 12 Brady Eidson 2014-05-06 12:05:58 PDT
(In reply to comment #11)
> (In reply to comment #5)
> > (In reply to comment #4)
> > > My turn to chime in.
> > > 
> > > - Reiterating the fingerprinting problem that's already been mentioned that we have no response to.
> > > - "Number of physical cores" does not correlate to the number of simultaneous tasks supporter
> > > - "Number of physical cores" does not correlate to computing resources currently available.
> > > 
> > 
> > I believe that they want the equivalent of hw.availcpu which isn't what either of your bullets tell you. 
> 
> If it is meant to adapt to dynamically changing system load, is there a notification API to tell the page when the value has changed?

Note that if this value *does* dynamically change, and we added a fudge factor, that would likely solve the fingerprinting concern.
Comment 13 Rik Cabanier 2014-05-06 12:26:29 PDT
Created attachment 230924 [details]
Patch
Comment 14 Filip Pizlo 2014-05-06 12:38:48 PDT
Comment on attachment 230924 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=230924&action=review

R=me but consider using hw.availcpu.  It *may* be what you actually want, but I'll let you decide.

I realize that there is still some discussion, but LGTM from me.  I'm leaving the "r?" to see if we achieve some consensus.

> Source/WebCore/page/Navigator.cpp:148
> +    sysctlbyname("hw.ncpu", &ncpu, &len, 0, 0);

I *think* you want hw.availcpu.  It's the same as ncpu in almost all cases, except where the developer has told the kernel to pretend like the machine has fewer cores for testing purposes.  Basically, if the kernel is told, via some boot-arg, that it should pretend like there are only 2 cores even though the machine has 16 cores, then ncpu==16 and availcpu==2.  In all other cases ncpu==availcpu.

So, this doesn't matter for end users - either value will be the same.  But parallel app developers often like to tell Instruments, or some other tool, to tell the system to limit the number of cores for testing.  It would be cool if WebKit relayed that information to the app.  All you have to do to make the "right thing" happen is just use hw.availcpu anywhere you think you want hw.ncpu. :-)
Comment 15 Brady Eidson 2014-05-06 12:45:21 PDT
Comment on attachment 230924 [details]
Patch

This patch
1 - Doesn't address the fingerprinting issue
2 - Implements a feature that a number of contributers/committers/reviewers are currently against.
Comment 16 Alexey Proskuryakov 2014-05-06 12:46:26 PDT
I still object against this.
Comment 17 Rik Cabanier 2014-05-06 12:53:31 PDT
(In reply to comment #15)
> (From update of attachment 230924 [details])
> This patch
> 1 - Doesn't address the fingerprinting issue
> 2 - Implements a feature that a number of contributers/committers/reviewers are currently against.

I will not land this patch until these issues and the name are resolved.
Comment 18 Filip Pizlo 2014-05-06 12:56:30 PDT
(In reply to comment #11)
> (In reply to comment #5)
> > (In reply to comment #4)
> > > My turn to chime in.
> > > 
> > > - Reiterating the fingerprinting problem that's already been mentioned that we have no response to.
> > > - "Number of physical cores" does not correlate to the number of simultaneous tasks supporter
> > > - "Number of physical cores" does not correlate to computing resources currently available.
> > > 
> > 
> > I believe that they want the equivalent of hw.availcpu which isn't what either of your bullets tell you. 
> 
> If it is meant to adapt to dynamically changing system load, is there a notification API to tell the page when the value has changed?

No.  hw.availcpu is used primarily, if not exclusively, by clients that only ask it once and never ask it again.

JavaScriptCore is one example of a framework that calls hw.availcpu.  We call it exactly once.

> 
> Because I doubt we'd want an API that requires polling.

You wouldn't do that.

> 
> Also, this is not what Rik suggested it is...  I definitely agree that we're discussing an unspec'ed feature!  :)

It is exactly what Rik suggested.  hw.availcpu is just the Instruments-friendly way of saying hw.ncpu and it's generally recommended that you use it anywhere that you would have used hw.ncpu.

> 
> > > There modern approach to "let me do as much work as possible without hindering overall performance on the system" that at least some systems support.  Grand Central Dispatch on OS X, for example, will executed more tasks simultaneously when the resources are available but scale back when other apps demand some resources.
> > > 
> > > Is there a real problem that having this bit available would solve that couldn't also be solved by a GCD-style mechanism?
> > 
> > Yes. Absolutely. Numerical parallel algorithms are often written in a worker-per-CPU style and this ends up being faster than the mini-task model of GCD. 
> 
> I'm not up to date with start of the art numerical parallel algorithms, so I'll definitely believe that this is how they are often written.
> 
> But are you actually saying that they cannot *possibly* be written with a mini-task model to perform almost as well?

Hold on a second.  Are you actually proposing GCD and shared memory for the web?  And if so, are you actually saying that we should hold off on all possible enhancements to the non-GCD message-passing concurrency model until the GCD+shared-memory model is implemented?

OK, now to answer your question: it's an open academic question whether all parallel algorithms can be efficiently and productively converted to use a task model.

Intuitively, the efficiency of tasks is never going to be as good: the way that parallel algorithms are written in terms of tasks is to create many small tasks and let the load balancer work it out.  But this prevents you from using algorithms that rely on projecting a topology onto the cores and it incurs scheduling overhead.  The topology bit is the most important.  Efficient parallel algorithms assume that moving data from one core to another is expensive, and are engineered to minimize the number of times that this happens.  But to do this, you basically want to be able to say: create me a thread per CPU.  It turns out that thread schedule algorithms will basically do the right thing if you do this.  If the machine is not otherwise under load, then each thread will basically stay on the same CPU for the duration of the algorithm, since the OS's thread migration heuristics understand that moving data between CPUs is costly.  If the machine does come under load, then the OS will put in a best effort to make things sensible.  In the worst case, it causes some load unbalancing.  Often these algorithms will have some amount of load balancing to handle this, but it's true that the whole point of the task model was to make this simpler.  It's just an open academic question as to whether it actually does make this simpler for algorithms whether moving data between CPUs must be made explicit.

The other issue is productivity.  It's a fun exercise to implement the MPI-based parallel sort where a CPU is like a first-class thing - the entire algorithm is written around one thread per CPU.  Then it's even funner to compare this to a task-based sort.  Much to my surprise, the MPI-based on is simpler.  It's important to be careful when drawing conclusions from this.  In general, it is absolutely true that task-based shared-memory is easier.  But if the reason why you're using tasks is to achieve a parallel speed-up on a complex algorithm then it's not necessarily the fastest path to a working solution.

The JavaScriptCore GC was a great example.  It was written in a short time and it relies heavily on thread-per-CPU.  Later, Geoff found a way to do it using GCD.  But it was more complicated and slower.  He had ideas of how to make it faster, but that would make it even more complicated.  Again, let's not try to generalize this: the point is that to write good code you need to choose the right tool for the job and sometimes thread-per-CPU is better.

> 
> Here's what I'm imagining.  It's simple and possibly quite naive, but it seems right to me.
> An algorithm that wants to run 8 threads for 10 seconds each.
> Scenario 1 - In the "I know I have 8 CPU cores" model, it spawns those 8 threads
> Scenario 2 - In the "GCD mini task" scenario it queues up 8 tasks that might take ~1/4 second to complete, and when one completes it kicks off the next one.
> 
> In a perfectly ideal world of zero system load:
> Scenario 1 spreads 80 seconds of work over 8 threads, and finishes in 80 seconds.
> Scenario 2 spreads 80 seconds of work over 320 mini tasks, and finishes in 80 seconds + some GCD overhead time.
> 
> In a real world where the user is actively using their system:
> Scenario 1 spreads 80 seconds of work over 8 threads, finishes in (possibly much) more than 80 seconds, and causes the system to be sluggish to the user in the meantime.
> Scenario 2 spreads 80 seconds of work over 320 mini tasks, finishes in (possible much) more than 80 seconds, but the system remains perfectly responsive to the user in the meantime.
> 
> While 8 threads is the winner in the zero-load case, it's only the winner by a small amount.
> The mini-task model is the clear winner in the "system is being used" case, with a significant advantage.

Yeah.  But my argument isn't solely about scheduling overhead.

In the model you propose, it's not clear to what extent GCD would know the cost model of moving a task spawned on one CPU to a different CPU and the consequences it could have for memory traffic.  I suspect it doesn't because it's optimized for responsiveness - it'll try to work-steal tasks onto whatever CPU is spare.  Scenario 2 with a loaded system may actually be *slower* than scenario 1 with a loaded system for some algorithms that try to ensure that their working set is local and cached.

Also, even if we assumed that GCD did have exactly the right heuristics, then algorithms that rely on careful data distribution would probably still be easier to write using threads.  I would agree that it's possible that GCD would outperform threads in that case, if GCD had some oracle that told it when and how to do load balancing.  But ultimately the point of making something parallel is as a productive way to speed up your code.  Look, I'd take a serial performance boost over one that involves threads any day of the week.  That's a no-brainer!  The only good reason for using parallelism is when the quickest path to achieving a speed-up requires it.  But then you want to make it easy to do and often, knowing the number of cores allows you to do it the quickest.

And, speaking of quickest, let's not forget that GCD is a shared memory model.  The very ability to shuffle tasks around between cores requires the ability to share memory!  We don't have shared memory concurrency in JavaScript.  I believe that it would be *far better* if JavaScript had shared memory concurrency, and if it did have such a thing, then clearly a GCD-like programming model would be the right way to go.  But we don't have any of this and it would take a long time to get it.
Comment 19 Build Bot 2014-05-06 13:24:18 PDT
Comment on attachment 230924 [details]
Patch

Attachment 230924 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://webkit-queues.appspot.com/results/6180927516442624

New failing tests:
svg/text/non-bmp-positioning-lists.svg
fast/dom/navigator-detached-no-crash.html
Comment 20 Build Bot 2014-05-06 13:24:23 PDT
Created attachment 230928 [details]
Archive of layout-test-results from webkit-ews-13 for mac-mountainlion-wk2

The attached test failures were seen while running run-webkit-tests on the mac-wk2-ews.
Bot: webkit-ews-13  Port: mac-mountainlion-wk2  Platform: Mac OS X 10.8.5
Comment 21 Brady Eidson 2014-05-06 15:17:06 PDT
(In reply to comment #18)
> (In reply to comment #11)
> > Also, this is not what Rik suggested it is...  I definitely agree that we're discussing an unspec'ed feature!  :)
> 
> It is exactly what Rik suggested.  hw.availcpu is just the Instruments-friendly way of saying hw.ncpu and it's generally recommended that you use it anywhere that you would have used hw.ncpu.

I misunderstood "availcpu" to mean "currently available cpus", but I guess it is "physically available CPUs"

If this really is a static, never-changing value, then the fingerprinting problem remains unsolved.

> > > > There modern approach to "let me do as much work as possible without hindering overall performance on the system" that at least some systems support.  Grand Central Dispatch on OS X, for example, will executed more tasks simultaneously when the resources are available but scale back when other apps demand some resources.
> > > > 
> > > > Is there a real problem that having this bit available would solve that couldn't also be solved by a GCD-style mechanism?
> > > 
> > > Yes. Absolutely. Numerical parallel algorithms are often written in a worker-per-CPU style and this ends up being faster than the mini-task model of GCD. 
> > 
> > I'm not up to date with start of the art numerical parallel algorithms, so I'll definitely believe that this is how they are often written.
> > 
> > But are you actually saying that they cannot *possibly* be written with a mini-task model to perform almost as well?
> 
> Hold on a second.  Are you actually proposing GCD and shared memory for the web?

GCD-style scheduling, yes.

Shared memory, no.

>  And if so, are you actually saying that we should hold off on all possible enhancements to the non-GCD message-passing concurrency model until the GCD+shared-memory model is implemented?

Absolutely not.

Suggesting we hold off on a *single* enhancement that involves adding new API that:
1 - Might not make sense now
2 - Might not make sense in the future
3 - Has to be supported forever
4 - Exposes user data in the meanwhile
...does not translate to "don't improve anything at all, ever"

Also note that I proposed GCD-style scheduling as a possible alternative to "let a web page know how many physical cores your CPU has so it can try to hog all of them at once"

There are likely other alternatives as well.

> OK, now to answer your question: it's an open academic question whether all parallel algorithms can be efficiently and productively converted to use a task model.
> 
> The topology bit is the most important.  Efficient parallel algorithms assume that moving data from one core to another is expensive, and are engineered to minimize the number of times that this happens.  But to do this, you basically want to be able to say: create me a thread per CPU.  It turns out that thread schedule algorithms will basically do the right thing if you do this.  If the machine is not otherwise under load, then each thread will basically stay on the same CPU for the duration of the algorithm, since the OS's thread migration heuristics understand that moving data between CPUs is costly.  If the machine does come under load, then the OS will put in a best effort to make things sensible.  In the worst case, it causes some load unbalancing.  Often these algorithms will have some amount of load balancing to handle this, but it's true that the whole point of the task model was to make this simpler.  It's just an open academic question as to whether it actually does make this simpler for algorithms whether moving data between CPUs must be made explicit.

My TL;DR of the above:
-An algorithm would love to be able to say "run this 1 thread on this 1 cpu"
-Thread schedulers try their best
-They can never succeed fully on a machine where the user is active because there's variable load on the machine
-The task model may or may not affect the "data moving between CPUs is a big penalty" goal.

I would posit that the very same way that a thread scheduler tries its best to keep 1 thread on 1 CPU whenever possible, a task scheduled can also try its best to keep a pool of tasks that are known to be related on 1 CPU whenever possible.

>  But if the reason why you're using tasks is to achieve a parallel speed-up on a complex algorithm then it's not necessarily the fastest path to a working solution.

The reason is not to make things faster.  The reason is to share system resources effectively.

Telling an app "You are allowed to use all resources in the system" was a no-brainer back in the MS-DOS days.

Today, we need to be able to tell an app "You are allowed to use all of the resources in the system that are available, but you will not be allowed to do so to the detriment of the other apps on the system, so at any given moment you may have more or less resources available than the previous moment."

> 
> The JavaScriptCore GC was a great example.  It was written in a short time and it relies heavily on thread-per-CPU.  Later, Geoff found a way to do it using GCD.  But it was more complicated and slower.  He had ideas of how to make it faster, but that would make it even more complicated.  Again, let's not try to generalize this: the point is that to write good code you need to choose the right tool for the job and sometimes thread-per-CPU is better.

I'm not arguing that a task model is always a better tool than thread-per-CPU.  Of course not.
Coding a task model will always be harder, in fact!

But there's tradeoffs everywhere.

"Sync XHR" let developers write good code.  Easy to write, easy to read, easy to understand.

If it had never existed and somebody was proposing it today, I would argue it was a terrible idea.

> > Here's what I'm imagining.  It's simple and possibly quite naive, but it seems right to me.
> > An algorithm that wants to run 8 threads for 10 seconds each.
> > Scenario 1 - In the "I know I have 8 CPU cores" model, it spawns those 8 threads
> > Scenario 2 - In the "GCD mini task" scenario it queues up 8 tasks that might take ~1/4 second to complete, and when one completes it kicks off the next one.
> > 
> > In a perfectly ideal world of zero system load:
> > Scenario 1 spreads 80 seconds of work over 8 threads, and finishes in 80 seconds.
> > Scenario 2 spreads 80 seconds of work over 320 mini tasks, and finishes in 80 seconds + some GCD overhead time.
> > 
> > In a real world where the user is actively using their system:
> > Scenario 1 spreads 80 seconds of work over 8 threads, finishes in (possibly much) more than 80 seconds, and causes the system to be sluggish to the user in the meantime.
> > Scenario 2 spreads 80 seconds of work over 320 mini tasks, finishes in (possible much) more than 80 seconds, but the system remains perfectly responsive to the user in the meantime.
> > 
> > While 8 threads is the winner in the zero-load case, it's only the winner by a small amount.
> > The mini-task model is the clear winner in the "system is being used" case, with a significant advantage.
> 
> Yeah.  But my argument isn't solely about scheduling overhead.
> 
> In the model you propose, it's not clear to what extent GCD would know the cost model of moving a task spawned on one CPU to a different CPU and the consequences it could have for memory traffic.  I suspect it doesn't because it's optimized for responsiveness - it'll try to work-steal tasks onto whatever CPU is spare.  Scenario 2 with a loaded system may actually be *slower* than scenario 1 with a loaded system for some algorithms that try to ensure that their working set is local and cached.

By associating tasks together, a GCD-style scheduler can have the same CPU affinity hints that a thread scheduler has, so it would be capable of the same *best* case scenario minus scheduling overhead.

i.e. For the science professor that starts a long running simulation then heads home for the evening with his computer otherwise idle, it would hum along swimmingly.

> And, speaking of quickest, let's not forget that GCD is a shared memory model.  The very ability to shuffle tasks around between cores requires the ability to share memory!  We don't have shared memory concurrency in JavaScript.  I believe that it would be *far better* if JavaScript had shared memory concurrency, and if it did have such a thing, then clearly a GCD-like programming model would be the right way to go.  But we don't have any of this and it would take a long time to get it.

"GCD for native OS X apps" assumes shared memory, yes.  But I think there's a lot of directions one could go designing a mini-task scheduler that do not rely on shared memory and wouldn't pay a substantial performance penalty.
Comment 22 Rik Cabanier 2014-05-06 15:46:25 PDT
Created attachment 230942 [details]
Patch
Comment 23 Filip Pizlo 2014-05-06 16:08:14 PDT
(In reply to comment #21)
> (In reply to comment #18)
> > (In reply to comment #11)
> > > Also, this is not what Rik suggested it is...  I definitely agree that we're discussing an unspec'ed feature!  :)
> > 
> > It is exactly what Rik suggested.  hw.availcpu is just the Instruments-friendly way of saying hw.ncpu and it's generally recommended that you use it anywhere that you would have used hw.ncpu.
> 
> I misunderstood "availcpu" to mean "currently available cpus", but I guess it is "physically available CPUs"
> 
> If this really is a static, never-changing value, then the fingerprinting problem remains unsolved.

The size of the window reveals more information than this.  Bigger monitors mean better hardware!  I think this is a weak argument.

> 
> > > > > There modern approach to "let me do as much work as possible without hindering overall performance on the system" that at least some systems support.  Grand Central Dispatch on OS X, for example, will executed more tasks simultaneously when the resources are available but scale back when other apps demand some resources.
> > > > > 
> > > > > Is there a real problem that having this bit available would solve that couldn't also be solved by a GCD-style mechanism?
> > > > 
> > > > Yes. Absolutely. Numerical parallel algorithms are often written in a worker-per-CPU style and this ends up being faster than the mini-task model of GCD. 
> > > 
> > > I'm not up to date with start of the art numerical parallel algorithms, so I'll definitely believe that this is how they are often written.
> > > 
> > > But are you actually saying that they cannot *possibly* be written with a mini-task model to perform almost as well?
> > 
> > Hold on a second.  Are you actually proposing GCD and shared memory for the web?
> 
> GCD-style scheduling, yes.
> 
> Shared memory, no.
> 
> >  And if so, are you actually saying that we should hold off on all possible enhancements to the non-GCD message-passing concurrency model until the GCD+shared-memory model is implemented?
> 
> Absolutely not.
> 
> Suggesting we hold off on a *single* enhancement that involves adding new API that:
> 1 - Might not make sense now
> 2 - Might not make sense in the future
> 3 - Has to be supported forever
> 4 - Exposes user data in the meanwhile
> ...does not translate to "don't improve anything at all, ever"
> 
> Also note that I proposed GCD-style scheduling as a possible alternative to "let a web page know how many physical cores your CPU has so it can try to hog all of them at once"
> 
> There are likely other alternatives as well.
> 
> > OK, now to answer your question: it's an open academic question whether all parallel algorithms can be efficiently and productively converted to use a task model.
> > 
> > The topology bit is the most important.  Efficient parallel algorithms assume that moving data from one core to another is expensive, and are engineered to minimize the number of times that this happens.  But to do this, you basically want to be able to say: create me a thread per CPU.  It turns out that thread schedule algorithms will basically do the right thing if you do this.  If the machine is not otherwise under load, then each thread will basically stay on the same CPU for the duration of the algorithm, since the OS's thread migration heuristics understand that moving data between CPUs is costly.  If the machine does come under load, then the OS will put in a best effort to make things sensible.  In the worst case, it causes some load unbalancing.  Often these algorithms will have some amount of load balancing to handle this, but it's true that the whole point of the task model was to make this simpler.  It's just an open academic question as to whether it actually does make this simpler for algorithms whether moving data between CPUs must be made explicit.
> 
> My TL;DR of the above:
> -An algorithm would love to be able to say "run this 1 thread on this 1 cpu"
> -Thread schedulers try their best
> -They can never succeed fully on a machine where the user is active because there's variable load on the machine
> -The task model may or may not affect the "data moving between CPUs is a big penalty" goal.
> 
> I would posit that the very same way that a thread scheduler tries its best to keep 1 thread on 1 CPU whenever possible, a task scheduled can also try its best to keep a pool of tasks that are known to be related on 1 CPU whenever possible.

If you want to write a parallel algorithm using web workers, then the best bet of how many workers to start is to match the number of cores.  This is how parallel code tends to get written using threads.

Consider what you would do without this API.  You'd still start a bunch of workers and have them do work.  You'd just have no idea how many to start.  You'd choose some number - say, 8.  That would be guaranteed to be suboptimal on any device that doesn't have 8 cores, and it would be pretty much optimal for those machines that have exactly 8 cores.

The point of this API would be to *allow* the app to play nice.

The lack of this API doesn't mean that apps can't do the bad thing - it just means that an app that wants to behave nicely won't have any way to do so.

> 
> >  But if the reason why you're using tasks is to achieve a parallel speed-up on a complex algorithm then it's not necessarily the fastest path to a working solution.
> 
> The reason is not to make things faster.  The reason is to share system resources effectively.
> 
> Telling an app "You are allowed to use all resources in the system" was a no-brainer back in the MS-DOS days.
> 
> Today, we need to be able to tell an app "You are allowed to use all of the resources in the system that are available, but you will not be allowed to do so to the detriment of the other apps on the system, so at any given moment you may have more or less resources available than the previous moment."

But starting N threads if you have N processors doesn't allow the app to hog all of the resources, anymore than running a single-threaded app on one processor would.  If other apps want to use any of the CPUs, they will be allowed to do so - after all, the OS can do context switching and it has a pretty good model of latency vs. throughput tasks.  The latency ones end up getting higher priority.

> 
> > 
> > The JavaScriptCore GC was a great example.  It was written in a short time and it relies heavily on thread-per-CPU.  Later, Geoff found a way to do it using GCD.  But it was more complicated and slower.  He had ideas of how to make it faster, but that would make it even more complicated.  Again, let's not try to generalize this: the point is that to write good code you need to choose the right tool for the job and sometimes thread-per-CPU is better.
> 
> I'm not arguing that a task model is always a better tool than thread-per-CPU.  Of course not.
> Coding a task model will always be harder, in fact!
> 
> But there's tradeoffs everywhere.
> 
> "Sync XHR" let developers write good code.  Easy to write, easy to read, easy to understand.
> 
> If it had never existed and somebody was proposing it today, I would argue it was a terrible idea.

Sync XHR is a bad analogy to what is being asked for here - that *did* allow a web app to stall something that is meaningful to the user.

Also, we *already have* web workers and it's already possible to start a bunch of workers and have them spin up your CPUs.  The only thing this API changes is that it allows an app to not be too intrusive, by limiting the number of workers it starts to some sensible number.

> 
> > > Here's what I'm imagining.  It's simple and possibly quite naive, but it seems right to me.
> > > An algorithm that wants to run 8 threads for 10 seconds each.
> > > Scenario 1 - In the "I know I have 8 CPU cores" model, it spawns those 8 threads
> > > Scenario 2 - In the "GCD mini task" scenario it queues up 8 tasks that might take ~1/4 second to complete, and when one completes it kicks off the next one.
> > > 
> > > In a perfectly ideal world of zero system load:
> > > Scenario 1 spreads 80 seconds of work over 8 threads, and finishes in 80 seconds.
> > > Scenario 2 spreads 80 seconds of work over 320 mini tasks, and finishes in 80 seconds + some GCD overhead time.
> > > 
> > > In a real world where the user is actively using their system:
> > > Scenario 1 spreads 80 seconds of work over 8 threads, finishes in (possibly much) more than 80 seconds, and causes the system to be sluggish to the user in the meantime.
> > > Scenario 2 spreads 80 seconds of work over 320 mini tasks, finishes in (possible much) more than 80 seconds, but the system remains perfectly responsive to the user in the meantime.
> > > 
> > > While 8 threads is the winner in the zero-load case, it's only the winner by a small amount.
> > > The mini-task model is the clear winner in the "system is being used" case, with a significant advantage.
> > 
> > Yeah.  But my argument isn't solely about scheduling overhead.
> > 
> > In the model you propose, it's not clear to what extent GCD would know the cost model of moving a task spawned on one CPU to a different CPU and the consequences it could have for memory traffic.  I suspect it doesn't because it's optimized for responsiveness - it'll try to work-steal tasks onto whatever CPU is spare.  Scenario 2 with a loaded system may actually be *slower* than scenario 1 with a loaded system for some algorithms that try to ensure that their working set is local and cached.
> 
> By associating tasks together, a GCD-style scheduler can have the same CPU affinity hints that a thread scheduler has, so it would be capable of the same *best* case scenario minus scheduling overhead.

Right, it *can* have.  Whether this gives you good performance is still an open question.

> 
> i.e. For the science professor that starts a long running simulation then heads home for the evening with his computer otherwise idle, it would hum along swimmingly.
> 
> > And, speaking of quickest, let's not forget that GCD is a shared memory model.  The very ability to shuffle tasks around between cores requires the ability to share memory!  We don't have shared memory concurrency in JavaScript.  I believe that it would be *far better* if JavaScript had shared memory concurrency, and if it did have such a thing, then clearly a GCD-like programming model would be the right way to go.  But we don't have any of this and it would take a long time to get it.
> 
> "GCD for native OS X apps" assumes shared memory, yes.  But I think there's a lot of directions one could go designing a mini-task scheduler that do not rely on shared memory and wouldn't pay a substantial performance penalty.

You can't do GCD-like scheduling in JavaScript without shared memory.  It's almost impossible - as in, you could do it with some brute force but it would be useless for the purpose of getting things to run fast.

GCD-like scheduling involves creating closures that carry code and data.  Crucially, there is no cost to "sending" the data other than the hardware cost of cache coherence.  In particular, there is no need to serialize the transitive closure of objects reachable from the data that the code may capture.  Having software that does this serialization would be sufficiently expensive that you would have to use very coarse-grained tasks rather than mini tasks.  Coarse-grained tasks are just web workers. ;-)

Let me just rant on this a bit more.  To make a JavaScript GCD-like system without shared memory, you'd need to provide for serialization of all of the data that the closure wants.  It turns into a hairy disaster very quickly.  In fact, there's an experimental language at IBM that has been in development for over 10 years, called, ironically, X10.  It serializes closures.  It was supposed to revolutionize parallel computing.  Hardly anybody uses this language.  Automatically serializing all captured state turned out to be super slow, so they started hacking the language with weird type rules that control what gets serialized, what gets proxied, what is immutable, etc.  I think that this aspect of the language has been universally panned.  Anyway, X10 is a dead language - and I think a major part of its failure is that it tried to mix the separate-heap (i.e. no shared memory) model with the task model.

And just for completeness, there are languages like ML and Erlang that allow separate heaps *and* tasks but they do that by making all objects immutable.  Of course that will never happen in JavaScript since it's an imperative language down to the core.

So it's a bit subtle: if you're willing to give up mutability, then anything is possible.  If you've got shared memory, then anything is possible.  But if you require a heap-per-task model then things get super hairy and that is why workers work the they work.
Comment 24 Brady Eidson 2014-05-06 17:21:10 PDT
(In reply to comment #23)
> (In reply to comment #21)
> > (In reply to comment #18)
> > > (In reply to comment #11)
> > > > Also, this is not what Rik suggested it is...  I definitely agree that we're discussing an unspec'ed feature!  :)
> > > 
> > > It is exactly what Rik suggested.  hw.availcpu is just the Instruments-friendly way of saying hw.ncpu and it's generally recommended that you use it anywhere that you would have used hw.ncpu.
> > 
> > I misunderstood "availcpu" to mean "currently available cpus", but I guess it is "physically available CPUs"
> > 
> > If this really is a static, never-changing value, then the fingerprinting problem remains unsolved.
> 
> The size of the window reveals more information than this.  Bigger monitors mean better hardware!  I think this is a weak argument.

It seems that a few folks in this discussion are not quite familiar with what browser fingerprinting means.

It's not about deciding something absolute and known about the user's hardware.
e.g. "This guy has a fully loaded Mac Pro with 3 high res monitors!

It's about uniquely identifying individual users based on bits of entropy from their browser, even if the user tries all the normal tricks to remain private (block cookies, private browsing, do not track, etc)
e.g. "From this user agent I've learned 21 bits of information about this unique user's browser.  If I see a visit from another user agent in the future with exactly the same 21 bits of data, I can be somewhat sure he is the same user."

Here's minutes from the face-to-face I'm directly familiar with:
http://www.w3.org/wiki/Fingerprinting

This writeup from the w3c also has links to more great papers:
https://www.w3.org/2014/strint/papers/41.pdf

Window size can be set programatically and changes dynamically, so it's not a great example of a fingerprinting bit.

This is why I would be *supportive* of a property that reflects dynamic load on the system, and even more supportive of a property whose precision is low enough such that we can add a random fudge factor.

Such a property would be useless as a bit of entropy.

> ...more stuff about threads and GCD and such below...

Right now I had time to respond to the fingerprinting concern.  I'll look at these later.
Comment 25 Brady Eidson 2014-05-07 21:30:31 PDT
r-

It seems there are two unresolved questions here:
1 - This introduces a new fingerprinting vector that I believe we can avoid.
2 - Many contributors are not happy with the actual use-case of the API itself.

On point #1, Filip and I had a discussion IRL today and I think we came up with a good solution. [1]

On point #2, I think there's more work proponents of this API have to do in convincing the opponents.

*[1] - I will detail this conversation in a separate comment.
Comment 26 Brady Eidson 2014-05-07 21:57:06 PDT
In person, Filip argued that the numerical algorithms that benefit the most from this currently only scale well up to 8 cores.

Past 8 threads, they start to lose additional efficiency gains from adding more threads, or at least from using absolutely all possible cores.

He also argued from a fingerprinting perspective, this is barely a bit of entropy, as values would be logarithmic - 1, 2, 4, 8, 16, etc. with the only common sizes being 1, 2, 4, and 8.

I reminded him that this is not actually true - Apple, for example, has shipped devices with 1, 2, 4, 6, 8, 12, 16, and 24 cores.  Additionally when you look at Intel's processor family (just to pick on desktop CPUs), you can have any number of logical cores that is a multiple of 2, up to about 24.

So in this conversation:
1 - I conceded that 1, 2, 4, and 8 cores are so common that they aren't a meaningful fingerprint.
2 - We both agreed that having one of the "weirder" number of cores above 8 is a much more meaningful fingerprinting bit.
3 - Filip conceded that possible answers higher than 8 aren't meaningful to known high performance programming techniques.
...and finally...
4 - We both agreed that for an actual number of cores higher than 8, the answer could be designed both to negate the fingerprinting concern *and* not hurt known high performance programming techniques.

With that conclusion we thought of two techniques that satisfy the compromise:
1 - For "number of cores more than 8", just always answer 8.
2 - For "number of cores more than 8", give a fudge factor that changes.  For example on one page visit a 12 core machine might report 10, and on another visit it might report 14.

As long as other people concerned about fingerprinting agree with the above, I think we can consider the fingerprinting problem solved.
Comment 27 Rik Cabanier 2014-05-07 22:05:44 PDT
(In reply to comment #26)
> 
> With that conclusion we thought of two techniques that satisfy the compromise:
> 1 - For "number of cores more than 8", just always answer 8.

Sounds reasonable.

> 2 - For "number of cores more than 8", give a fudge factor that changes.  For example on one page visit a 12 core machine might report 10, and on another visit it might report 14.

Maybe not more as that might throw off optimizations.

I would be happy with either choice.
Comment 28 Brady Eidson 2014-05-08 07:37:29 PDT
(In reply to comment #27)
> (In reply to comment #26)
> > 2 - For "number of cores more than 8", give a fudge factor that changes.  For example on one page visit a 12 core machine might report 10, and on another visit it might report 14.
> 
> Maybe not more as that might throw off optimizations.

Filip was quite convincing when he described that a common technique for these types of algorithms was to over-commit by a thread or two, as usually no appreciable slowdown is seen.

I'm not sure if that's universally applicable to all scenarios, though.
Comment 29 Filip Pizlo 2014-05-08 07:48:07 PDT
(In reply to comment #28)
> (In reply to comment #27)
> > (In reply to comment #26)
> > > 2 - For "number of cores more than 8", give a fudge factor that changes.  For example on one page visit a 12 core machine might report 10, and on another visit it might report 14.
> > 
> > Maybe not more as that might throw off optimizations.
> 
> Filip was quite convincing when he described that a common technique for these types of algorithms was to over-commit by a thread or two, as usually no appreciable slowdown is seen.
> 
> I'm not sure if that's universally applicable to all scenarios, though.

It may not be. My point was actually that whenever I've written parallel algorithms I've observed that if you lie to it and say you have one or two more cores than you really do, then the algorithm may perform a bit differently but never much worse. 

But, the school of thought I come from when I was doing this in academia called for any parallel algorithm to be tested by varying the number of threads from N = 1 to N = ncpu * 2, on the grounds that you want to see how it scales and you also want to see what happens under pathological load. You expect a perfect algorithm to see no degradation above N > ncpu, but in practice all of them degrade. My recollection, though, is that the degradation doesn't actually start until you're at N = ncpu * 1.5 or so.

At a meta level though, I don't see the point of telling a web page that you have all of the cores you actually have, if you have an unusually large number of cores. The web site is likely to be tested on the "median" core counts like 2, 4 and 8 - it's unlikely that you'll get much value add above 8 except maybe in very special cases***. So, I really like the idea of WebKit returning essentially min(hw.availcpu, 8). 

*** Photoshop!  It's various filters could effectively use all of my cores and I would want it to if I was using its web version. But it's already the case that Photoshop users tweak it's performance setting a - at least I always have for as long as I can remember. It wouldn't be too offensive to say that Photoshop defaults to 8 cores on my 128 core machine, but then I can go into the settings and tell it to crank it up.
Comment 30 Rik Cabanier 2014-05-08 09:44:37 PDT
Created attachment 231074 [details]
Patch
Comment 31 Rik Cabanier 2014-05-08 09:47:18 PDT
(In reply to comment #29)
> (In reply to comment #28)
> > (In reply to comment #27)
> > > (In reply to comment #26)
> > > > 2 - For "number of cores more than 8", give a fudge factor that changes.  For example on one page visit a 12 core machine might report 10, and on another visit it might report 14.
> > > 
> > > Maybe not more as that might throw off optimizations.
> > 
> > Filip was quite convincing when he described that a common technique for these types of algorithms was to over-commit by a thread or two, as usually no appreciable slowdown is seen.
> > 
> > I'm not sure if that's universally applicable to all scenarios, though.
> 
> It may not be. My point was actually that whenever I've written parallel algorithms I've observed that if you lie to it and say you have one or two more cores than you really do, then the algorithm may perform a bit differently but never much worse. 
> 
> But, the school of thought I come from when I was doing this in academia called for any parallel algorithm to be tested by varying the number of threads from N = 1 to N = ncpu * 2, on the grounds that you want to see how it scales and you also want to see what happens under pathological load. You expect a perfect algorithm to see no degradation above N > ncpu, but in practice all of them degrade. My recollection, though, is that the degradation doesn't actually start until you're at N = ncpu * 1.5 or so.
> 
> At a meta level though, I don't see the point of telling a web page that you have all of the cores you actually have, if you have an unusually large number of cores. The web site is likely to be tested on the "median" core counts like 2, 4 and 8 - it's unlikely that you'll get much value add above 8 except maybe in very special cases***. So, I really like the idea of WebKit returning essentially min(hw.availcpu, 8). 

I agree with this and updated the patch so it returns a maximum of 8.

> *** Photoshop!  It's various filters could effectively use all of my cores and I would want it to if I was using its web version. But it's already the case that Photoshop users tweak it's performance setting a - at least I always have for as long as I can remember. It wouldn't be too offensive to say that Photoshop defaults to 8 cores on my 128 core machine, but then I can go into the settings and tell it to crank it up.

Yeah, to do PS-like optimizations, we'll need more information about the memory size, scratch disks, etc.
Maybe later :-)
Comment 32 Brady Eidson 2014-05-08 10:13:13 PDT
Okay, I'm happy with returning a max of 8 cores.

I think there's still unhappiness with the entire concept from various others, so I'm not going to r+.

(But won't r-)
Comment 33 Geoffrey Garen 2014-05-08 11:46:18 PDT
Not a specific commentary on this patch, but relevant: In the web context, even if we have navigator.hardwareConcurrency, we will definitely also want some kind of WorkerSet, which has these abilities:

    - Spawns and kills workers automatically based on system load and program load
    - postMessage to the WorkerSet forwards to an automatically selected child worker, which is currently idle

The reason we will want this -- and what is different between the web and our experience with threads -- is that workers don't have condition variables. So, there's no efficient way for a pre-spawned worker to ask, "Is there more work for me to do?", and for the scheduler to choose the right worker to say "yes" to.

You could postMessage to a round-robin-selected worker in an array of navigator.hardwareConcurrency workers, but that worker might be busy handling messages you posted previously, so you'd get bad concurrency or no concurrency.

The only reliable way to postMessage to an idle worker, if there is no WorkerSet, is to hand-roll a WorkerSet, run a meta-program in each of its child workers, pass an initial message giving the meta-program a UID and a child program, and require the meta-program to postMessage back to the hand-rolled WorkerSet with its UID when its child program completes any task. This probably also requires all child programs to be written to a specific meta-program API, to provide the necessary hooks.

The hand-rolled WorkerSet is a pretty spaghetti way to program and, at a minimum, it doubles the synchronization overhead over a native WorkerSet solution. Also, it doesn't change the fundamental equation: Some version of WorkerSet is a requirement of any high performance parallel worker program.

Just spawning navigator.hardwareConcurrency workers only works in the trivial and rare case of "embarrassingly parallel" programs, where the programmer knows, in advance, exactly how much computation needs to happen, and the computation involves no internal dependencies. In any meaningful workload, you need a WorkerSet abstraction.
Comment 34 Eli Grey (:sephr) 2014-05-08 12:04:32 PDT
(In reply to comment #8)
> The web page can already tell if you are on a Mac or on an iPhone, so adding the number of cores to the mix will give quite a bit of insight.
> 
> > How is this exploitable?
> 
> User's hardware will feed into one or more of the choices advertisers have on AdWords. This proposed feature is not about numeric algorithms - having each split bit of insight into who the user is significantly increases the price of ads, that's all. Yes, we are not talking about disclosing whether the user is a pregnant woman, which would reportedly make the ads 15x more valuable, but if you can achieve even a slight degree of correlation with user's income, that's money.

I can already approximate your core count with the timing attack-based navigator.cores polyfill mentioned on the whatwg wiki that I made. It's not the most accurate ever, especially under high system load, but I can quite reliably detect whether or not you're on a dual core, quad core, or higher (Mac Pro) Mac product.
Comment 35 Eli Grey (:sephr) 2014-05-08 12:09:49 PDT
(In reply to comment #33)
> Not a specific commentary on this patch, but relevant: In the web context, even if we have navigator.hardwareConcurrency, we will definitely also want some kind of WorkerSet, which has these abilities:

We already have a 'WorkerSet' for managing all of these workers, it's called your OS scheduler.

> Just spawning navigator.hardwareConcurrency workers only works in the trivial and rare case of "embarrassingly parallel" programs, where the programmer knows, in advance, exactly how much computation needs to happen, and the computation involves no internal dependencies. In any meaningful workload, you need a WorkerSet abstraction.

Embarassingly parallel algorithms are the *only* use case that my navigator.cores proposal is for. If you have other needs such background processing for a non-parallelizable algorithm, then one worker per discrete "job" already enough.
Comment 36 Filip Pizlo 2014-05-08 12:18:13 PDT
(In reply to comment #33)
> Not a specific commentary on this patch, but relevant: In the web context, even if we have navigator.hardwareConcurrency, we will definitely also want some kind of WorkerSet, which has these abilities:
> 
>     - Spawns and kills workers automatically based on system load and program load
>     - postMessage to the WorkerSet forwards to an automatically selected child worker, which is currently idle
> 
> The reason we will want this -- and what is different between the web and our experience with threads -- is that workers don't have condition variables. So, there's no efficient way for a pre-spawned worker to ask, "Is there more work for me to do?", and for the scheduler to choose the right worker to say "yes" to.
> 
> You could postMessage to a round-robin-selected worker in an array of navigator.hardwareConcurrency workers, but that worker might be busy handling messages you posted previously, so you'd get bad concurrency or no concurrency.
> 
> The only reliable way to postMessage to an idle worker, if there is no WorkerSet, is to hand-roll a WorkerSet, run a meta-program in each of its child workers, pass an initial message giving the meta-program a UID and a child program, and require the meta-program to postMessage back to the hand-rolled WorkerSet with its UID when its child program completes any task. This probably also requires all child programs to be written to a specific meta-program API, to provide the necessary hooks.
> 
> The hand-rolled WorkerSet is a pretty spaghetti way to program and, at a minimum, it doubles the synchronization overhead over a native WorkerSet solution. Also, it doesn't change the fundamental equation: Some version of WorkerSet is a requirement of any high performance parallel worker program.
> 
> Just spawning navigator.hardwareConcurrency workers only works in the trivial and rare case of "embarrassingly parallel" programs, where the programmer knows, in advance, exactly how much computation needs to happen, and the computation involves no internal dependencies. In any meaningful workload, you need a WorkerSet abstraction.

I think that either WorkerSet can be implemented as a library, or we just need to figure out what is it about Workers that makes this hard to implement as a library.

Also, I don't think that it's fair to frame this as "the common case of WorkerSet" versus "the rare case of starting N workers".  Probably if the web really honestly never adopted shared memory, then the way to write large-scale concurrent code would become erlangesque: you'd start a worker for each object in your system, and this "object" would export its "methods" as messages that it could receive and the return value of those methods would be expressed as a second message it would send back to you.  There is probably an infinity of other styles for how workers could be used.
Comment 37 Alexey Proskuryakov 2014-05-08 12:31:07 PDT
> I can already approximate your core count with the timing attack-based navigator.cores polyfill 

This is not practical for the use case - an ad network can't spin all your CPU cores for a time that is sufficient to measure the number of cores indirectly, and it can't defer loading the ads until after that's done.

But the polyfill seems like it would be OK for the rare Photoshop-like use case indeed, making the proposed API less compelling.
Comment 38 Filip Pizlo 2014-05-08 12:34:59 PDT
(In reply to comment #35)
> (In reply to comment #33)
> > Not a specific commentary on this patch, but relevant: In the web context, even if we have navigator.hardwareConcurrency, we will definitely also want some kind of WorkerSet, which has these abilities:
> 
> We already have a 'WorkerSet' for managing all of these workers, it's called your OS scheduler.

I agree with the navigator.hardwareConcurrency proposal, but I will respond to this anyway because you're missing the point and I don't want other to also become confused.

A WorkerSet would give you a clean way of sending a message to one of many workers based on whichever one is idle.

The way to do this with workers right now would be one of:

- Implement your own work queue.  Probably, you'd do it by having a scoreboard of which workers are working on what, so you'd know which worker is idle when work becomes available.  You'd have your own worker pool and you could even poll hardwareConcurrency to adapt the worker pool size.  The claim is that with the current worker API, this could be sufficiently ugly that it would be worthwhile for the browser to provide such an abstraction.

- Start a fresh worker every time there was work to do and let it die when it was done.  This would work but with how workers are currently implemented, it would be inefficient when the work items are smallish.  It's not entirely clear that workers could be reimplemented to make such a "short-running worker" idiom faster than what a worker pool could do: a worker is a thread afterall and so it's a strong hint (or rather, a commandment) to the OS scheduler that you want it to be interleaved with all other threads.  A worker in a worker pool would run each task to completion.  Less interleaving opportunities typically means less overhead.

Neither of those things seems appealing.

If we thought that WorkerSet was the right abstraction then it would coincidentally also give you navigator.hardwareConcurrency because a WorkerSet could automatically start the right number of workers based on current system load.  In fact it could do somewhat better than hardwareConcurrency since it would have a story for adapting dynamically.

I suspect that we want WorkerSet in addition to, rather than instead of, navigator.hardwareConcurrency.
Comment 39 Filip Pizlo 2014-05-08 12:35:52 PDT
(In reply to comment #37)
> > I can already approximate your core count with the timing attack-based navigator.cores polyfill 
> 
> This is not practical for the use case - an ad network can't spin all your CPU cores for a time that is sufficient to measure the number of cores indirectly, and it can't defer loading the ads until after that's done.
> 
> But the polyfill seems like it would be OK for the rare Photoshop-like use case indeed, making the proposed API less compelling.

This would significantly increase the barrier of entry for people writing parallel code.

Are you still worried about this even with the min(hw.availcpu, 8) proposal?
Comment 40 Alexey Proskuryakov 2014-05-08 12:58:15 PDT
> Are you still worried about this even with the min(hw.availcpu, 8) proposal?

Not as much as without the cap, although any additional data that is exposed (and unchanged in private browsing mode, and unchanged across browser resets) helps fingerprinting. Ideally, we're splitting all users into five equal buckets (1, 2, 4, 6, 8), so they will be 5x easier to identify.

The approach feels so last-century though that I don't see how it's worth the long-term liability that exposing a web API imposes.
Comment 41 Eli Grey (:sephr) 2014-05-08 13:18:11 PDT
(In reply to comment #37)
> This is not practical for the use case - an ad network can't spin all your CPU cores for a time that is sufficient to measure the number of cores indirectly, and it can't defer loading the ads until after that's done.
> 
> But the polyfill seems like it would be OK for the rare Photoshop-like use case indeed, making the proposed API less compelling.

Unfortunately, not all ad networks are nice. Also, the ads do not have to be targeted to the core count on first load if native navigator.cores is not available. The value can be cached, and the execution of the polyfill can be can be postponed until idle. Estimation can pause immediately when the user moves their mouse, so the user may not even notice it.

Some ad networks wouldn't care about wasting your CPU resources if it means they can target for people with more expensive devices. You will have enough info from just cores + OS + UA + estimation runtime (measures speed per core).
Comment 42 Brady Eidson 2014-05-08 13:38:17 PDT
(In reply to comment #40)

> The approach feels so last-century though that I don't see how it's worth the long-term liability that exposing a web API imposes.

I feel this way, too.  Deeply.  Adding API now means supporting it "forever", and I think this is definitely *NOT* an API that's useful forever.  Hell, I'd wager it'll be pointless in a few years.

But my primary concern is really the fingerprinting...

> > Are you still worried about this even with the min(hw.availcpu, 8) proposal?
> 
> Not as much as without the cap, although any additional data that is exposed (and unchanged in private browsing mode, and unchanged across browser resets) helps fingerprinting. Ideally, we're splitting all users into five equal buckets (1, 2, 4, 6, 8), so they will be 5x easier to identify.

AFAIK, there's never been a machine that has 6 *logical* cores.

And the machines with only 1 core are extremely endangered, if not extinct.

So it's about 2, 4, and 8.  

I can imagine that in 5 years, the result of the API will almost always be 8, because supported devices will all have 8 or more cores.

I think if the CSC and math geeks all conclude that the API has real usefulness and is necessary, then the added fingerprinting it allows is an acceptable tradeoff.
Comment 43 Rik Cabanier 2014-05-08 14:06:30 PDT
(In reply to comment #42)
> (In reply to comment #40)
> 
> > The approach feels so last-century though that I don't see how it's worth the long-term liability that exposing a web API imposes.
> 
> I feel this way, too.  Deeply.  Adding API now means supporting it "forever", and I think this is definitely *NOT* an API that's useful forever.  Hell, I'd wager it'll be pointless in a few years.
> 
> But my primary concern is really the fingerprinting...
> 
> > > Are you still worried about this even with the min(hw.availcpu, 8) proposal?
> > 
> > Not as much as without the cap, although any additional data that is exposed (and unchanged in private browsing mode, and unchanged across browser resets) helps fingerprinting. Ideally, we're splitting all users into five equal buckets (1, 2, 4, 6, 8), so they will be 5x easier to identify.
> 
> AFAIK, there's never been a machine that has 6 *logical* cores.

AMD released machines with 3 and 6 cores: 
- http://en.wikipedia.org/wiki/AMD_Phenom
- http://www.anandtech.com/show/3674/amds-sixcore-phenom-ii-x6-1090t-1055t-reviewed
 
> And the machines with only 1 core are extremely endangered, if not extinct.

Intel is still releasing 1 core CPU's:
http://ark.intel.com/products/family/43521/Intel-Celeron-Processor/desktop#@Desktop
although they have hyper-threading enabled by default.

> So it's about 2, 4, and 8.  
> 
> I can imagine that in 5 years, the result of the API will almost always be 8, because supported devices will all have 8 or more cores.

By then, we either have a better solution or we can increase it to 16.

> I think if the CSC and math geeks all conclude that the API has real usefulness and is necessary, then the added fingerprinting it allows is an acceptable tradeoff.
Comment 44 Rik Cabanier 2014-05-08 16:01:15 PDT
What are people's opinion about this after this change?
Alexey, Oliver, Benjamin?
Comment 45 Eli Grey (:sephr) 2014-05-08 17:35:14 PDT
I would prefer 16 as a max value if there is going to be a max value at all. Current workstations are often 32 or 16 logical cores (16 or 8 physical with Intel HT), and I want to be able to have full access for LZMA2 compression.

I don't think there should be a limit at all because I can already detect if you have over 8 cores at http://wg.oftn.org/projects/core-estimator/demo/ (assuming low-moderate system load), and I will still be able to detect that you have over 8 cores after this patch as it only obscures the value. This hinders battery life in the race to idle/sleep if a parallel application artificially needs to take longer than it would normally.

Access to the full logical core count is okay, as the OS scheduler already knows which threads to prioritize to keep the system responsive. Worker threads are not high priority.
Comment 46 Rik Cabanier 2014-05-08 18:56:30 PDT
(In reply to comment #45)
> I would prefer 16 as a max value if there is going to be a max value at all. Current workstations are often 32 or 16 logical cores (16 or 8 physical with Intel HT), and I want to be able to have full access for LZMA2 compression.
> 
> I don't think there should be a limit at all because I can already detect if you have over 8 cores at http://wg.oftn.org/projects/core-estimator/demo/ (assuming low-moderate system load), and I will still be able to detect that you have over 8 cores after this patch as it only obscures the value. This hinders battery life in the race to idle/sleep if a parallel application artificially needs to take longer than it would normally.
> 
> Access to the full logical core count is okay, as the OS scheduler already knows which threads to prioritize to keep the system responsive. Worker threads are not high priority.

The reason we limited the core count was to address the fingerprinting concern. (yes, we know it can be figured out with a polyfill but historically people on the web platform were not been persuaded by that argument.)

The vast majority of systems in the world will have 8 or less CPU's so this should cover almost all users.
Is javascript is efficient enough to not saturate the memory bandwidth if you run more than 8 tasks. Have you done tests to determine if efficiency scales?
Comment 47 Ryosuke Niwa 2014-05-08 19:06:05 PDT
(In reply to comment #46)
> (In reply to comment #45)
>
> The reason we limited the core count was to address the fingerprinting concern. (yes, we know it can be figured out with a polyfill but historically people on the web platform were not been persuaded by that argument.)
> 
> The vast majority of systems in the world will have 8 or less CPU's so this should cover almost all users.
> Is javascript is efficient enough to not saturate the memory bandwidth if you run more than 8 tasks. Have you done tests to determine if efficiency scales?

Note that we could further mitigate the possibility of detecting the real number of cores using polyfill by restricting the number of logical cores we use to run workers.
Comment 48 Brady Eidson 2014-05-08 21:59:41 PDT
(In reply to comment #45)
> I would prefer 16 as a max value if there is going to be a max value at all. Current workstations are often 32 or 16 logical cores (16 or 8 physical with Intel HT), and I want to be able to have full access for LZMA2 compression.

16 - on todays hardware - rips the fingerprinting problem wide open again.  Since the proportion of folks with 12 or 16 (or more) cores is so small compared to the proportion with 8 or less, those folks can get fingerprinted *much* more effectively.

One thing many of us like about the 8 limit is that it folds these "small minority" users in with the 8-core crowd, making the fingerprinting bit much less valuable.
Comment 49 Brady Eidson 2014-05-08 22:00:27 PDT
(In reply to comment #47)
> (In reply to comment #46)
> > (In reply to comment #45)
> >
> > The reason we limited the core count was to address the fingerprinting concern. (yes, we know it can be figured out with a polyfill but historically people on the web platform were not been persuaded by that argument.)
> > 
> > The vast majority of systems in the world will have 8 or less CPU's so this should cover almost all users.
> > Is javascript is efficient enough to not saturate the memory bandwidth if you run more than 8 tasks. Have you done tests to determine if efficiency scales?
> 
> Note that we could further mitigate the possibility of detecting the real number of cores using polyfill by restricting the number of logical cores we use to run workers.

+1
Comment 50 Rik Cabanier 2014-05-08 23:12:00 PDT
(In reply to comment #47)
...
> Note that we could further mitigate the possibility of detecting the real number of cores using polyfill by restricting the number of logical cores we use to run workers.

Firefox limits the number of running workers per origin to 20 which seems high to me. WebKit could do something similar.
Running workers =/= total workers. It would be the number of workers that are actually running in parallel and not blocked on a message queue or async event
Comment 51 Darin Adler 2014-05-09 09:46:17 PDT
Comment on attachment 231074 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=231074&action=review

> Source/WebCore/ChangeLog:8
> +        Added a new API that returns the number of CPU cores up to 8.

Where does the arbitrary constant 8 come from?
Comment 52 Darin Adler 2014-05-09 09:47:03 PDT
Need some comment about the 8 rationale in the code. I see the discussion here in the bug now, but the code will last much longer than the bug report.
Comment 53 Rik Cabanier 2014-05-09 10:41:03 PDT
Created attachment 231165 [details]
Patch
Comment 54 Rik Cabanier 2014-05-09 15:07:32 PDT
(In reply to comment #52)
> Need some comment about the 8 rationale in the code. I see the discussion here in the bug now, but the code will last much longer than the bug report.

Is it OK in the latest update?
Comment 55 Darin Adler 2014-05-10 10:48:32 PDT
(In reply to comment #54)
> (In reply to comment #52)
> > Need some comment about the 8 rationale in the code. I see the discussion here in the bug now, but the code will last much longer than the bug report.
> 
> Is it OK in the latest update?

It’s good that there’s a comment. Formatting of the comment is wrong (comment inside a function should be indented). The number 8 should be a named constant. The comment is a bit too long. I would suggest something closer to this:

    // Enforce a maximum for the number of cores reported to mitigate
    // fingerprinting for the minority of machines with large numbers of cores.
    // If machines with more than 8 cores become commonplace, we should bump this number.
    const unsigned maxCoresToReport = 8;
Comment 56 Ryosuke Niwa 2014-05-10 11:06:03 PDT
Comment on attachment 231165 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=231165&action=review

This feature definitely needs a build flag.

> Source/WebCore/page/Navigator.cpp:151
> +#if PLATFORM(MAC) || PLATFORM(IOS)

Why can't we push this into some platform file?
Also, given JSC starts up concurrent threads for marking and JITing,
don't we have the same code there?
If so, we should move it to WTF to share the code.

> LayoutTests/fast/dom/navigator-hardwareConcurrency.html:11
> +		document.write("PASS");

Wrong indentation. Maybe there is a tab character here?
Comment 57 Rik Cabanier 2014-05-10 13:55:56 PDT
Created attachment 231233 [details]
Patch
Comment 58 Ryosuke Niwa 2014-05-10 16:04:43 PDT
Comment on attachment 231233 [details]
Patch

Since this is not a standardized API, there should be a build flag to disable it.
If it is, please refer to a specific version of the spec you used to implement it.
Comment 59 Eli Grey (:sephr) 2014-05-11 09:41:49 PDT
(In reply to comment #58)
> If it is, please refer to a specific version of the spec you used to implement it.

Rik's patch implements Navigator Hardware Concurrency version 1.0

1.0 permalink: http://wiki.whatwg.org/index.php?title=Navigator_HW_Concurrency&oldid=9585
Comment 60 Rik Cabanier 2014-05-11 16:23:14 PDT
Created attachment 231269 [details]
Patch
Comment 61 Rik Cabanier 2014-05-11 16:25:28 PDT
(In reply to comment #58)
> (From update of attachment 231233 [details])
> Since this is not a standardized API, there should be a build flag to disable it.
> If it is, please refer to a specific version of the spec you used to implement it.

I updated the patch to add a build flag. Not set to review yet as I want to wait for the EWS bots to pass.
Comment 62 Ryosuke Niwa 2014-05-11 16:57:27 PDT
(In reply to comment #59)
> (In reply to comment #58)
> > If it is, please refer to a specific version of the spec you used to implement it.
> 
> Rik's patch implements Navigator Hardware Concurrency version 1.0
> 
> 1.0 permalink: http://wiki.whatwg.org/index.php?title=Navigator_HW_Concurrency&oldid=9585

That's a proposal, not a specification.  There is a huge difference in someone making a proposal versus it becoming a standard and/or endorsed by other browser vendors.

Per previously agreed WebKit policy, this feature should be implemented behind a build flag.
Comment 63 Rik Cabanier 2014-05-12 16:22:28 PDT
(In reply to comment #62)
> 
> Per previously agreed WebKit policy, this feature should be implemented behind a build flag.

The patch is building on all platforms. Can you review it again?
Comment 64 Filip Pizlo 2014-05-12 16:41:05 PDT
Comment on attachment 231269 [details]
Patch

This looks good to me.  Rik, can you wait a bit before landing?  Judging by the feedback so far, there may yet be more feedback.
Comment 65 Rik Cabanier 2014-05-12 16:46:32 PDT
(In reply to comment #64)
> (From update of attachment 231269 [details])
> This looks good to me.  Rik, can you wait a bit before landing?  Judging by the feedback so far, there may yet be more feedback.

Will do. Should I wait until Thursday?
Comment 66 Ryosuke Niwa 2014-05-12 17:51:50 PDT
Comment on attachment 231269 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=231269&action=review

> Source/WebCore/page/Navigator.cpp:149
> +    if (WTF::numberOfProcessorCores() > maxCoresToReport)
> +        return maxCoresToReport;
> +
> +    return WTF::numberOfProcessorCores();

We should be storing the value returned by WTF::numberOfProcessorCores() to a local variable instead of calling it twice
even if numberOfProcessorCores had a static cache.

> LayoutTests/fast/dom/navigator-hardwareConcurrency.html:1
> +<html>

Missing DOCTYPE. There's no need to use quirks more here.

> LayoutTests/fast/dom/navigator-hardwareConcurrency.html:11
> +		document.write("PASS");

There is a tab character here.

> LayoutTests/fast/dom/navigator-hardwareConcurrency.html:13
> +		document.write("Fail, navigator.hardwareConcurrency is " + concurrency);

Ditto.
Comment 67 Rik Cabanier 2014-05-12 20:12:02 PDT
Comment on attachment 231269 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=231269&action=review

>> Source/WebCore/page/Navigator.cpp:149
>> +    return WTF::numberOfProcessorCores();
> 
> We should be storing the value returned by WTF::numberOfProcessorCores() to a local variable instead of calling it twice
> even if numberOfProcessorCores had a static cache.

why is that?
It seems that this would waste 4 bytes + the static initializer code.

>> LayoutTests/fast/dom/navigator-hardwareConcurrency.html:11
>> +		document.write("PASS");
> 
> There is a tab character here.

hmm. How did this happen?
will fix.
Comment 68 Alexey Proskuryakov 2014-05-12 22:09:22 PDT
Comment on attachment 231269 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=231269&action=review

I still think that it is undesirable to add this feature to the web platform. Please don't count the code comments below as general support.

> Source/WebCore/page/Navigator.cpp:144
> +    const int maxCoresToReport = 8;

Shouldn't this be as low as 2 for iOS?

> Source/WebCore/page/Navigator.cpp:146
> +    if (WTF::numberOfProcessorCores() > maxCoresToReport)

WTF headers should export public symbols to the global namespace with "using". Callers should not explicitly specify WTF like here.
Comment 69 Rik Cabanier 2014-05-12 23:20:22 PDT
(In reply to comment #68)
> (From update of attachment 231269 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=231269&action=review
> 
> I still think that it is undesirable to add this feature to the web platform. Please don't count the code comments below as general support.

How can we resolve this?

> > Source/WebCore/page/Navigator.cpp:144
> > +    const int maxCoresToReport = 8;
> 
> Shouldn't this be as low as 2 for iOS?

It will always be 2 on iOS (unless there's a new device with more).
People can already figure out what iOS device they're running so they know how many CPU's it has.

> > Source/WebCore/page/Navigator.cpp:146
> > +    if (WTF::numberOfProcessorCores() > maxCoresToReport)
> 
> WTF headers should export public symbols to the global namespace with "using". Callers should not explicitly specify WTF like here.

I find 100's of instances of 'WTF::' in the codebase. I can update the path.
Comment 70 Alexey Proskuryakov 2014-05-13 09:47:30 PDT
> > I still think that it is undesirable to add this feature to the web platform. Please don't count the code comments below as general support.
> 
> How can we resolve this?

This was discussed extensively in this bug, it's just not the right way to do multithreading, and not a reasonable API to expose and to support going forward. I don't think that repeating the same considerations will be productive.

I think that the best reason for WebKit to support navigator.hardwareConcurrency would be to not stay behind in website compatibility - assuming that multiple other engines implement it, and assuming that any important sites start using it. But being the first to expose this feature seems pointless and harmful. What are we trying to demo here - that websites can burn through your laptop battery by mining bitcoins even faster than before?

> It will always be 2 on iOS (unless there's a new device with more).

Right, I don't want to worry about it when/if such devices appear. Whenever this happens, these devices will be a rarity at first.

> I find 100's of instances of 'WTF::' in the codebase.

A lot of these are outright mistakes (like instances of "WTF::PassRefPtr" or "WTF::OrdinalNumber") that should be corrected as a matter of routine any time related code is touched. Some are consequences of mistakes made in WTF - there are a few public symbols that are not exported correctly, and that needs to be cleaned up eventually, preferably in separate patches. And a few cases are where disambiguation is truly required (like WTF::bind vs. std::bind), but that's a rare exception.
Comment 71 Rik Cabanier 2014-05-13 10:47:32 PDT
(In reply to comment #70)
> > > I still think that it is undesirable to add this feature to the web platform. Please don't count the code comments below as general support.
> > 
> > How can we resolve this?
> 
> This was discussed extensively in this bug, it's just not the right way to do multithreading, and not a reasonable API to expose and to support going forward. I don't think that repeating the same considerations will be productive.
> 
> I think that the best reason for WebKit to support navigator.hardwareConcurrency would be to not stay behind in website compatibility - assuming that multiple other engines implement it, and assuming that any important sites start using it. But being the first to expose this feature seems pointless and harmful. 

WebKit will not be the first. Blink already approved that they will implement and ship this. I prepared a patch for Firefox as well but we're still going through the (same) issues.

>What are we trying to demo here - that websites can burn through your laptop 
>battery by mining bitcoins even faster than before?

No, they can burn your laptop just as well but you might get more coins since you can balance the load. 
> > It will always be 2 on iOS (unless there's a new device with more).
> 
> Right, I don't want to worry about it when/if such devices appear. Whenever this happens, these devices will be a rarity at first.

OK. I will update so iOS only returns 2.

> 
> > I find 100's of instances of 'WTF::' in the codebase.
> 
> A lot of these are outright mistakes (like instances of "WTF::PassRefPtr" or "WTF::OrdinalNumber") that should be corrected as a matter of routine any time related code is touched. Some are consequences of mistakes made in WTF - there are a few public symbols that are not exported correctly, and that needs to be cleaned up eventually, preferably in separate patches. And a few cases are where disambiguation is truly required (like WTF::bind vs. std::bind), but that's a rare exception.

Got it! I will update the patch.
Comment 72 Filip Pizlo 2014-05-13 10:54:58 PDT
(In reply to comment #70)
> > > I still think that it is undesirable to add this feature to the web platform. Please don't count the code comments below as general support.
> > 
> > How can we resolve this?
> 
> This was discussed extensively in this bug, it's just not the right way to do multithreading, 

This statement has been roundly debunked in this bug.  Please don't restate points that have already been refuted without at least pointing to something specific you disagree with.

> and not a reasonable API to expose and to support going forward. I don't think that repeating the same considerations will be productive.
> 
> I think that the best reason for WebKit to support navigator.hardwareConcurrency would be to not stay behind in website compatibility - assuming that multiple other engines implement it, and assuming that any important sites start using it. But being the first to expose this feature seems pointless and harmful. What are we trying to demo here - that websites can burn through your laptop battery by mining bitcoins even faster than before?
> 
> > It will always be 2 on iOS (unless there's a new device with more).
> 
> Right, I don't want to worry about it when/if such devices appear. Whenever this happens, these devices will be a rarity at first.
> 
> > I find 100's of instances of 'WTF::' in the codebase.
> 
> A lot of these are outright mistakes (like instances of "WTF::PassRefPtr" or "WTF::OrdinalNumber") that should be corrected as a matter of routine any time related code is touched. Some are consequences of mistakes made in WTF - there are a few public symbols that are not exported correctly, and that needs to be cleaned up eventually, preferably in separate patches. And a few cases are where disambiguation is truly required (like WTF::bind vs. std::bind), but that's a rare exception.
Comment 73 Alexey Proskuryakov 2014-05-13 11:08:54 PDT
> This statement has been roundly debunked in this bug.  Please don't restate points that have already been refuted without at least pointing to something specific you disagree with.

The number of cores has very little to do with how many parallel tasks I want a web page to use. Exposing it may make it slightly easier to implement something that required a polyfill before, but demonstrating that is only a tiny step towards proving that this addition is an overall improvement to the platform (otherwise, we'd all be busy implementing ActiveX and/or NaCl, as running native code is obviously a sure way to achieve parity with native applications).
Comment 74 Filip Pizlo 2014-05-13 11:12:09 PDT
(In reply to comment #73)
> > This statement has been roundly debunked in this bug.  Please don't restate points that have already been refuted without at least pointing to something specific you disagree with.
> 
> The number of cores has very little to do with how many parallel tasks I want a web page to use. Exposing it may make it slightly easier to implement something that required a polyfill before, but demonstrating that is only a tiny step towards proving that this addition is an overall improvement to the platform (otherwise, we'd all be busy implementing ActiveX and/or NaCl, as running native code is obviously a sure way to achieve parity with native applications).

The number of cores is an upper bound on the number of parallel tasks I want a web page to use, in the sense that it would definitely not be efficient if a web page spawned more tasks than processors just because it didn't know how many cores you had.

Also we are arranging to return a tighter upper bound - it may be 8 even if you have more than 8 cores.
Comment 75 Alexey Proskuryakov 2014-05-14 11:29:53 PDT
> only a tiny step towards proving that this addition is an overall improvement to the platform

To elaborate, here are some of the considerations that lead me to not being supportive of the proposed API.

1. Does the new feature make something new possible, that wasn't possible before?

In this case, it's not quite clear. There is a polyfill, which actually produces dramatically better results than the proposed API, but is costly to use. The results are better because (1) they are not clamped, and (2) they might adapt to current load (I didn't actually test what the polyfill days when one of my cores is busy with another process).

There was a claim that the polyfill will "significantly increase the barrier of entry for people writing parallel code", which I find inaccurate. Adding a line like <script src="core-estimator.min.js"></script> does not significantly increase the complexity of writing parallel computational algorithms in JS.

2. Is this new functionality important, will it be used on many sites, or on sites that many people use?

Making better use of processor cores in the browser engine is super important in general. 

Talking about this particular feature, it seems unlikely that there will be any win for general browsing, even long term. It may benefit some kind of "Photoshop online". Even games historically had much difficulty using multiple cores (although the wiki claims that physics engines are highly parallelizable, so maybe that has changed). Other use cases on the wiki seem so esoteric that they are probably not worth supporting with this feature today.

3. Is the new functionality in line with long term direction of the platform?

Seems like we all agree that it's not.

This consideration is difficult to apply in practice though - the web platform is all made of bad features anyway, so adding one more will not kill it.

Also, making some moves in the wrong direction could still help us gain experience, assess the real level of interest, and prioritize future efforts accordingly.

4. What are the dangers?

All the dangers are somewhat muted due to the existence of a polyfill. However, the actual proposed API differs in a few ways - it's very lightweight, so it could be used in applications that are lightweight (little CPU use, tiny code). Also, it gives results that never change for a given machine, while the polyfill is less than 100% consistent.

This feature has at least medium privacy cost, as it makes fingerprinting easier. I think that we now estimate that it makes fingerprinting 3x easier, which is non-trivial. We've been going to great lengths to make it harder (e.g., a weak random number generator in multipart form boundaries was deemed unacceptable, because an ad network could compute your random seed, and thus track you until WebProcess relaunch). This will be a long term liability too (will we need to add a lower bound when almost all machines are 2+ cores?)

The original proposal also exposed directly usable information about user hardware in some cases ("Hello new Mac Pro user!"). I think that this was mostly addressed, although disclosing information never helps privacy, and there could be cases we didn't think of.

Another reason why the privacy cost is relatively high is that the API results have to be the same in private browsing mode, you can't have a clean "session".

5. What's the exit strategy if everything falls apart?

We can always hardcode "return 2;". This will be about as embarrassing as "C:\fakepath\" is today (it's something we have in form submission). Anyway, there is a workable exit strategy.

6. What if we don't implement this?

We might feel competitive pressure due to not doing well on synthetic demos, and possibly on "Photoshop online" or games.

Given all the above, I think that we should be pushing against this feature, but not as strongly as being the only engine that refuses to implement it.
Comment 76 Rik Cabanier 2014-05-14 13:58:14 PDT
(In reply to comment #75)
> > only a tiny step towards proving that this addition is an overall improvement to the platform
> 
> To elaborate, here are some of the considerations that lead me to not being supportive of the proposed API.
> 
> 1. Does the new feature make something new possible, that wasn't possible before?
> 
> In this case, it's not quite clear. There is a polyfill, which actually produces dramatically better results than the proposed API, but is costly to use. The results are better because (1) they are not clamped, and (2) they might adapt to current load (I didn't actually test what the polyfill days when one of my cores is busy with another process).
> 
> There was a claim that the polyfill will "significantly increase the barrier of entry for people writing parallel code", which I find inaccurate. Adding a line like <script src="core-estimator.min.js"></script> does not significantly increase the complexity of writing parallel computational algorithms in JS.

The polyfill is imprecise and takes some time to run. To be precise (which is needed), it would have to run quite a bit longer.


> 2. Is this new functionality important, will it be used on many sites, or on sites that many people use?
> 
> Making better use of processor cores in the browser engine is super important in general. 
> 
> Talking about this particular feature, it seems unlikely that there will be any win for general browsing, even long term. It may benefit some kind of "Photoshop online". Even games historically had much difficulty using multiple cores (although the wiki claims that physics engines are highly parallelizable, so maybe that has changed). Other use cases on the wiki seem so esoteric that they are probably not worth supporting with this feature today.

I think you will find that this is no longer the case. This is a well-understood problem. (ie why do you think gaming consoles have so many CPU cores?)
It's hard to say how native applications are using this, but I did a github search on this feature in other VMs yesterday:
Python:
    multiprocessing.cpu_count()
    11,295 results
    https://github.com/search?q=multiprocessing.cpu_count%28%29+extension%3Apy&type=Code&ref=advsearch&l=
Java:
    Runtime.getRuntime().availableProcessors()
    23,967 results
    https://github.com/search?q=availableProcessors%28%29+extension%3Ajava&type=Code&ref=searchresults
node.js is also exposing it:
    require('os').cpus()
    4,851 results
    https://github.com/search?q=require%28%27os%27%29.cpus%28%29+extension%3Ajs&type=Code&ref=searchresults


> 3. Is the new functionality in line with long term direction of the platform?
> 
> Seems like we all agree that it's not.

I don't think that is true. We agree it's not the final solution but it will be part of it.

> This consideration is difficult to apply in practice though - the web platform is all made of bad features anyway, so adding one more will not kill it.
> 
> Also, making some moves in the wrong direction could still help us gain experience, assess the real level of interest, and prioritize future efforts accordingly.
> 
> 4. What are the dangers?
> 
> All the dangers are somewhat muted due to the existence of a polyfill. However, the actual proposed API differs in a few ways - it's very lightweight, so it could be used in applications that are lightweight (little CPU use, tiny code). Also, it gives results that never change for a given machine, while the polyfill is less than 100% consistent.
> 
> This feature has at least medium privacy cost, as it makes fingerprinting easier. I think that we now estimate that it makes fingerprinting 3x easier, which is non-trivial. We've been going to great lengths to make it harder (e.g., a weak random number generator in multipart form boundaries was deemed unacceptable, because an ad network could compute your random seed, and thus track you until WebProcess relaunch). This will be a long term liability too (will we need to add a lower bound when almost all machines are 2+ cores?)
> 
> The original proposal also exposed directly usable information about user hardware in some cases ("Hello new Mac Pro user!"). I think that this was mostly addressed, although disclosing information never helps privacy, and there could be cases we didn't think of.
> 
> Another reason why the privacy cost is relatively high is that the API results have to be the same in private browsing mode, you can't have a clean "session".

I'm unsure how this is different from any other parameter in Navigator

> 5. What's the exit strategy if everything falls apart?
> 
> We can always hardcode "return 2;". This will be about as embarrassing as "C:\fakepath\" is today (it's something we have in form submission). Anyway, there is a workable exit strategy.

True
Comment 77 Filip Pizlo 2014-05-14 14:06:56 PDT
(In reply to comment #75)
> > only a tiny step towards proving that this addition is an overall improvement to the platform
> 
> To elaborate, here are some of the considerations that lead me to not being supportive of the proposed API.
> 
> 1. Does the new feature make something new possible, that wasn't possible before?
> 
> In this case, it's not quite clear. There is a polyfill, which actually produces dramatically better results than the proposed API, but is costly to use. The results are better because (1) they are not clamped, and (2) they might adapt to current load (I didn't actually test what the polyfill days when one of my cores is busy with another process).
> 
> There was a claim that the polyfill will "significantly increase the barrier of entry for people writing parallel code", which I find inaccurate. Adding a line like <script src="core-estimator.min.js"></script> does not significantly increase the complexity of writing parallel computational algorithms in JS.
> 
> 2. Is this new functionality important, will it be used on many sites, or on sites that many people use?
> 
> Making better use of processor cores in the browser engine is super important in general. 
> 
> Talking about this particular feature, it seems unlikely that there will be any win for general browsing, even long term. It may benefit some kind of "Photoshop online". Even games historically had much difficulty using multiple cores (although the wiki claims that physics engines are highly parallelizable, so maybe that has changed). Other use cases on the wiki seem so esoteric that they are probably not worth supporting with this feature today.
> 
> 3. Is the new functionality in line with long term direction of the platform?
> 
> Seems like we all agree that it's not.
> 
> This consideration is difficult to apply in practice though - the web platform is all made of bad features anyway, so adding one more will not kill it.
> 
> Also, making some moves in the wrong direction could still help us gain experience, assess the real level of interest, and prioritize future efforts accordingly.

I strongly disagree with this point.  For as long as I've been writing multithreaded code, I've always relied on some API to tell me the number of cores.  JavaScriptCore uses it heavily.  I've also used it in a past life when I was writing GUIs for some scientific visualizations and wanted to shard some 2D image analysis.

GCD-like (or more broadly, minitask-like) models work for some subset of parallel computing tasks but certainly not for all of them.  The most fundamental unit of abstraction that a programmer will ultimately want is a thread, and he will eventually want to know what the maximum number of threads is that it makes sense to start.

> 
> 4. What are the dangers?
> 
> All the dangers are somewhat muted due to the existence of a polyfill. However, the actual proposed API differs in a few ways - it's very lightweight, so it could be used in applications that are lightweight (little CPU use, tiny code). Also, it gives results that never change for a given machine, while the polyfill is less than 100% consistent.
> 
> This feature has at least medium privacy cost, as it makes fingerprinting easier. I think that we now estimate that it makes fingerprinting 3x easier, which is non-trivial. We've been going to great lengths to make it harder (e.g., a weak random number generator in multipart form boundaries was deemed unacceptable, because an ad network could compute your random seed, and thus track you until WebProcess relaunch). This will be a long term liability too (will we need to add a lower bound when almost all machines are 2+ cores?)

I disagree that it makes fingerprinting 3x easier.  I suspect it really only reveals 1 bit of information, since it's already possible to get a rough bound on the number of cores from inspecting other aspects of a person's computer.

> 
> The original proposal also exposed directly usable information about user hardware in some cases ("Hello new Mac Pro user!"). I think that this was mostly addressed, although disclosing information never helps privacy, and there could be cases we didn't think of.
> 
> Another reason why the privacy cost is relatively high is that the API results have to be the same in private browsing mode, you can't have a clean "session".
> 
> 5. What's the exit strategy if everything falls apart?
> 
> We can always hardcode "return 2;". This will be about as embarrassing as "C:\fakepath\" is today (it's something we have in form submission). Anyway, there is a workable exit strategy.
> 
> 6. What if we don't implement this?
> 
> We might feel competitive pressure due to not doing well on synthetic demos, and possibly on "Photoshop online" or games.
> 
> Given all the above, I think that we should be pushing against this feature, but not as strongly as being the only engine that refuses to implement it.

Personally, I would prefer for WebKit to be on the forefront of improving the sorry state of concurrency on the web, and to me there is high cost to us rejecting a perfectly sensible API that is known to have productive and meaningful uses.
Comment 78 Alexey Proskuryakov 2014-05-14 15:38:56 PDT
> The polyfill is imprecise and takes some time to run. To be precise (which is needed), it would have to run quite a bit longer.

Why are you saying that it's needed to be precise? The very reason why it's imprecise is that some cores may be busy, in which case we don't want to use them anyway. It only needs to be precise if the purpose if nefarious (back to fingerprinting).

You did not comment on how clamping makes the polyfill superior for legitimate applications.

> This is a well-understood problem. (ie why do you think gaming consoles have so many CPU cores?)

Ok.

> > Another reason why the privacy cost is relatively high is that the API results have to be the same in private browsing mode, you can't have a clean "session".

> I'm unsure how this is different from any other parameter in Navigator

Now I get to say that this is a well understood problem :-)

If you want deep details, there is a lot in <https://wiki.mozilla.org/Fingerprinting> and in linked articles. Navigator properties such as navigator.plugins can and should be cloaked in private browsing mode without breaking user experience much. This is not true for hardwareConcurrency - we can of course make it always return a constant in private browsing mode, but that would have a big undesirable effect on pages that need it.


> GCD-like (or more broadly, minitask-like) models work for some subset of parallel computing tasks but certainly not for all of them.

Do we need to reimplement all of the tasks in JavaScript though? If adding the ability to perform visualization makes regular web browsing worse, it's not an easy question to answer whether we want this ability.

There are certainly other ways to adapt to system load that are more declarative and give the engine and the OS a way to reconcile appetite of a single isolated web page with system responsiveness (e.g. by limiting the number of workers that run at the same time, and notifying the page about changes dynamically).

If you see the hardwareConcurrency as part of future Web, then do you also see all the other related low level knowledge as part of it (number of processors vs. number of cores on each processor, core affinity, cache size and so on)? I think that this is a rabbit hole.

> I disagree that it makes fingerprinting 3x easier.  I suspect it really only reveals 1 bit of information

Even so, 2x is not pleasant either. Not a show stopper (otherwise I'd just r-), but given the trivial utility of the feature, this can't be laughed off.

> Personally, I would prefer for WebKit to be on the forefront of improving the sorry state of concurrency on the web, and to me there is high cost to us rejecting a perfectly sensible API that is known to have productive and meaningful uses.

I agree with this statement (heck, I implemented Web Workers for us to be where the forefront was at the time). I agree even though poor adoption of Web Workers shows how much we overestimated the interest in such technologies.

But I consider this statement too general and too strong when we are talking about hardwareConcurrency.
Comment 79 Luke Diggins 2014-05-14 23:57:04 PDT
(In reply to comment #78)
> > The polyfill is imprecise and takes some time to run. To be precise (which is needed), it would have to run quite a bit longer.
> 
> Why are you saying that it's needed to be precise? The very reason why it's imprecise is that some cores may be busy, in which case we don't want to use them anyway. It only needs to be precise if the purpose if nefarious (back to fingerprinting).

From what I have read, the polyfill is usually accurate when load is low, and *overestimates* to various degrees when load is high. This is intuitive behaviour to me: a logical core alternating between two tasks compared to one splits time more noticeably than the same core alternating between say ten tasks compared to nine. This makes the polyfill far more useful for fingerprinting, where other information can be used to distinguish between weak overloaded machines and high end hardware, and a potential pitfall for legitimate developers.

> > GCD-like (or more broadly, minitask-like) models work for some subset of parallel computing tasks but certainly not for all of them.
> 
> Do we need to reimplement all of the tasks in JavaScript though? If adding the ability to perform visualization makes regular web browsing worse, it's not an easy question to answer whether we want this ability.
> 
> There are certainly other ways to adapt to system load that are more declarative and give the engine and the OS a way to reconcile appetite of a single isolated web page with system responsiveness (e.g. by limiting the number of workers that run at the same time, and notifying the page about changes dynamically).
> 
> If you see the hardwareConcurrency as part of future Web, then do you also see all the other related low level knowledge as part of it (number of processors vs. number of cores on each processor, core affinity, cache size and so on)? I think that this is a rabbit hole.

I cannot claim to be an expert like Rik, but his statistics fit my personal understanding, which is that plenty of applications written in other languages have benefited from knowing the number of logical processors and only the number of logical processors. Those seeking such finetuned multi-threaded performance are probably firmly focused on native applications anyway.

> > Personally, I would prefer for WebKit to be on the forefront of improving the sorry state of concurrency on the web, and to me there is high cost to us rejecting a perfectly sensible API that is known to have productive and meaningful uses.
> 
> I agree with this statement (heck, I implemented Web Workers for us to be where the forefront was at the time). I agree even though poor adoption of Web Workers shows how much we overestimated the interest in such technologies.
> 
> But I consider this statement too general and too strong when we are talking about hardwareConcurrency.

Web workers are a fantastic innovation, but in my experience, they are still tricky to implement well. I just started writing an application in which I want to perform a basic simulation many (n) times, then pool the results for statistical analysis. From my previous experience in concurrent programming, my planned approach was to start a thread on each logical processor (p), run n/p loops on each and combine the arrays of result objects in the main thread. It was incredibly frustrating the first line, trivial and fundamental in java, could not be done. (As an aside, I'm also returning the data as typed arrays now, which is inconvenient but still less messy than manually resetting the prototype of that many serialised objects.)
Comment 80 Luke Diggins 2014-05-15 00:29:10 PDT
Seeing as I've taken the time to register, I may as well add: thumbs up to limiting the number of possible property values. Returning only 2, 4 or 8 should be enough to avoid creating twice as many threads needed on cheap dual core machines without leaving hyperthreaded quad core machines less than half utilised.
Comment 81 Rik Cabanier 2014-05-15 11:53:41 PDT
we were going to wait until Thursday (= today) to check this in but I'm unsure if I should.
Alexey, do you still want me to wait?
Comment 82 Alexey Proskuryakov 2014-05-15 12:47:15 PDT
My preference is to wait a year or two. I wrote a very detailed comment explaining why. Nothing in it was "debunked" in any way, except possibly the claim about games, which I don't know enough about to agree or disagree with.
Comment 83 Rik Cabanier 2014-05-15 13:16:02 PDT
(In reply to comment #78)
> > The polyfill is imprecise and takes some time to run. To be precise (which is needed), it would have to run quite a bit longer.
> 
> Why are you saying that it's needed to be precise? The very reason why it's imprecise is that some cores may be busy, in which case we don't want to use them anyway. It only needs to be precise if the purpose if nefarious (back to fingerprinting).

Since it's imprecise, it would be weird that the same application would have different performance characteristics.
Also, since this test might typically happen during the startup, other things (such as parsing and download) will affect the estimate.

> You did not comment on how clamping makes the polyfill superior for legitimate applications.
> 
> > This is a well-understood problem. (ie why do you think gaming consoles have so many CPU cores?)
> 
> Ok.
> 
> > > Another reason why the privacy cost is relatively high is that the API results have to be the same in private browsing mode, you can't have a clean "session".
> 
> > I'm unsure how this is different from any other parameter in Navigator
> 
> Now I get to say that this is a well understood problem :-)
> 
> If you want deep details, there is a lot in <https://wiki.mozilla.org/Fingerprinting> and in linked articles. Navigator properties such as navigator.plugins can and should be cloaked in private browsing mode without breaking user experience much. This is not true for hardwareConcurrency - we can of course make it always return a constant in private browsing mode, but that would have a big undesirable effect on pages that need it.

Ah, I didn't know that but it's good that you believe that we don't have to do this for in-private.

> > GCD-like (or more broadly, minitask-like) models work for some subset of parallel computing tasks but certainly not for all of them.
> 
> Do we need to reimplement all of the tasks in JavaScript though? If adding the ability to perform visualization makes regular web browsing worse, it's not an easy question to answer whether we want this ability.

Not sure why it would degrade the visualization. Worker could (should?) run at a slightly lower priority than the main thread.
Again, authors can already do this.

> There are certainly other ways to adapt to system load that are more declarative and give the engine and the OS a way to reconcile appetite of a single isolated web page with system responsiveness (e.g. by limiting the number of workers that run at the same time, and notifying the page about changes dynamically).

Yes. This is how it's implemented in mozilla. The return value is 8 or the maximum number of workers per domain.
I'm unsure if dynamic changing of the worker pool buys you much.

> If you see the hardwareConcurrency as part of future Web, then do you also see all the other related low level knowledge as part of it (number of processors vs. number of cores on each processor, core affinity, cache size and so on)? I think that this is a rabbit hole.

I don't think that level of detail will be needed but who knows what the future will hold...

> > I disagree that it makes fingerprinting 3x easier.  I suspect it really only reveals 1 bit of information
> 
> Even so, 2x is not pleasant either. Not a show stopper (otherwise I'd just r-), but given the trivial utility of the feature, this can't be laughed off.
> 
> > Personally, I would prefer for WebKit to be on the forefront of improving the sorry state of concurrency on the web, and to me there is high cost to us rejecting a perfectly sensible API that is known to have productive and meaningful uses.
> 
> I agree with this statement (heck, I implemented Web Workers for us to be where the forefront was at the time). I agree even though poor adoption of Web Workers shows how much we overestimated the interest in such technologies.

yes, it's unfortunate that they haven't been super successful. It took a while for applications to use threads so maybe we just need to wait and make small improvements along the way.
Comment 84 Filip Pizlo 2014-05-16 21:09:41 PDT
I spoke with Alexey about this late this afternoon.  My understanding is that he intentionally did *not* R- the patch in its current form because although he opposes this, he won't object strongly enough to roll it out if it was landed.

He would like to see a demonstration that the polyfill is bad - either because it is inaccurate or because it takes too much time.

But, I think that the best thing to do at this point is to land this patch as is.

Thanks for all of the feedback, everyone!  And Rik, thanks for doing this and being patient through the entire process.
Comment 85 Alexey Proskuryakov 2014-05-16 22:52:11 PDT
> He would like to see a demonstration that the polyfill is bad - either because it is inaccurate or because it takes too much time.

Yep. The name polyfill implies that it's reasonably good, but if it's actually a demo that's completely unacceptable in production, this certainly changes a lot.

One of the best working principles of web API design is "paving the cowpaths". If people can make something work with effort, it's usually best to wait and see where the paths are (no matter how much your previous non-web experience suggests that it's a great idea that everyone should love). If it's not possible at all, then sometimes you have to take a leap of faith (and usually fail, but whatever).

> But, I think that the best thing to do at this point is to land this patch as is.

I'm not sure why you say so. Running an experiment with the "polyfill" is not a multi-week project, and it's much better to base decisions on facts than on guesswork.
Comment 86 Rik Cabanier 2014-05-17 19:47:34 PDT
(In reply to comment #85)
> > He would like to see a demonstration that the polyfill is bad - either because it is inaccurate or because it takes too much time.
> 
> Yep. The name polyfill implies that it's reasonably good, but if it's actually a demo that's completely unacceptable in production, this certainly changes a lot.

I tried running the polyfill on a couple of devices and browsers to see how good it was.

MacBook Pro 8 cpu
Chrome: found 8 CPUs took 2.2s
Firefox: found 4 CPUs took 4.5s
Safari: found 3 CPUs took 2s

Windows Core i5 with 4 cpus
Chrome: found 4 CPUs took 2.9s
Firefox: found 2 CPUs took 3.3s
IE: all over the map 2 to 64 CPUS took 5-95s

Windows Phone dual Core
IE 10: 3 CPUs took 49s

Apple iPad 2 dual core
Safari: 2 CPUs took 12.9s

This was running the benchmark page on an idle machine. Doing anything else during the test results in the wrong result. In production, the results might be more varied as the system might be busy downloading resources, compiling JavaScript, etc.

Given these results, I don't think the polyfill is good enough for estimating the number of cores.
Comment 87 Rik Cabanier 2014-05-17 19:50:04 PDT
(In reply to comment #86)
> Given these results, I don't think the polyfill is good enough for estimating the number of cores.

Also, it takes a long time to get a result on some machines which might be unacceptable.
Comment 88 Alexey Proskuryakov 2014-05-17 21:50:59 PDT
Yeah, that's pretty bad, and pretty much invalidates all arguments based on the existence of the polyfill (pro or contra, alike).
Comment 89 Filip Pizlo 2014-05-17 21:54:09 PDT
(In reply to comment #86)
> (In reply to comment #85)
> > > He would like to see a demonstration that the polyfill is bad - either because it is inaccurate or because it takes too much time.
> > 
> > Yep. The name polyfill implies that it's reasonably good, but if it's actually a demo that's completely unacceptable in production, this certainly changes a lot.
> 
> I tried running the polyfill on a couple of devices and browsers to see how good it was.
> 
> MacBook Pro 8 cpu
> Chrome: found 8 CPUs took 2.2s
> Firefox: found 4 CPUs took 4.5s
> Safari: found 3 CPUs took 2s
> 
> Windows Core i5 with 4 cpus
> Chrome: found 4 CPUs took 2.9s
> Firefox: found 2 CPUs took 3.3s
> IE: all over the map 2 to 64 CPUS took 5-95s
> 
> Windows Phone dual Core
> IE 10: 3 CPUs took 49s
> 
> Apple iPad 2 dual core
> Safari: 2 CPUs took 12.9s
> 
> This was running the benchmark page on an idle machine. Doing anything else during the test results in the wrong result. In production, the results might be more varied as the system might be busy downloading resources, compiling JavaScript, etc.
> 
> Given these results, I don't think the polyfill is good enough for estimating the number of cores.

Yes.  I think you should go ahead and land your patch.
Comment 90 Alexey Proskuryakov 2014-05-17 22:00:54 PDT
FWIW, one thing that seems likely based on these results is that algorithms using navigator.hardwareConcurrency will see poor results. If a trivial algorithm that spins cores sees such dramatically different behaviors across platforms, then real algorithms won't be able to depend on the number of cores either, and will need hacky manual adjustments.
Comment 91 Filip Pizlo 2014-05-17 22:33:04 PDT
(In reply to comment #90)
> FWIW, one thing that seems likely based on these results is that algorithms using navigator.hardwareConcurrency will see poor results. If a trivial algorithm that spins cores sees such dramatically different behaviors across platforms, then real algorithms won't be able to depend on the number of cores either, and will need hacky manual adjustments.

JavaScriptCore uses the native equivalent of hardwareConcurrency to decide how many threads to start and it doesn't have hacks (manual or otherwise) to work around the scheduling fuzz that derailed the polyfill.

It's true that if a parallel algorithm starts ncpu threads, it won't see ncpu performance scale-up on *every* execution.  That's fine.  Like most performance optimizations, the goal is to win in the average rather than to win all the time.  And, on average, starting ncpu threads results in better performance than starting any other number of threads.

Nobody actually cares if your parallel algorithm achieved the speed-up it was intended to achieve; all that matters is whether or not there was a speed-up and whether the amount of resources you used to achieve it was the minimum amount necessary.  NCPU is the best estimate available for answering this question for many algorithms, including those that we use in WebKit.

We've beat this to death.  It makes sense to land this.  It's a great feature to add to the web platform and we should welcome it.
Comment 92 Luke Diggins 2014-05-17 22:41:04 PDT
(In reply to comment #90)
> FWIW, one thing that seems likely based on these results is that algorithms using navigator.hardwareConcurrency will see poor results. If a trivial algorithm that spins cores sees such dramatically different behaviors across platforms, then real algorithms won't be able to depend on the number of cores either, and will need hacky manual adjustments.

To me, that does sound like a greater reflection on the individual browsers, with Chrome spot on and IE all over the place, than upon the longstanding measure well debated here. Even so, loading up the CPU to guess the number of processors isn't a great hacky manual adjustment, if your aim is to reduce overhead. Certainly I won't be doing that in my webapp.

Anyway, thanks again to Rik & reviewers; looking forward to seeing this become available, as an essential tool for mine.
Comment 93 Alexey Proskuryakov 2014-05-17 22:47:00 PDT
> We've beat this to death.  It makes sense to land this.  It's a great feature to add to the web platform and we should welcome it.

Phil, so far you were the only WebKit reviewer who publicly supported adding the feature (there was general support from Ben, but not specifically for exposing the number of cores). It's OK if you r+ the patch by yourself and get it landed over objections, but please don't speak for everyone.
Comment 94 Filip Pizlo 2014-05-17 22:51:39 PDT
(In reply to comment #93)
> > We've beat this to death.  It makes sense to land this.  It's a great feature to add to the web platform and we should welcome it.
> 
> Phil, so far you were the only WebKit reviewer who publicly supported adding the feature (there was general support from Ben, but not specifically for exposing the number of cores). It's OK if you r+ the patch by yourself and get it landed over objections, but please don't speak for everyone.

You're the only one continuing to object after all of the details have been discussed.  If someone else has objections to this API, they should speak up now.  I think it's OK for Rik to land this given all of the support it's received.
Comment 95 Rik Cabanier 2014-05-18 09:57:01 PDT
Created attachment 231657 [details]
Patch for landing
Comment 96 WebKit Commit Bot 2014-05-18 09:58:28 PDT
Comment on attachment 231657 [details]
Patch for landing

Rejecting attachment 231657 [details] from commit-queue.

Failed to run "['/Volumes/Data/EWS/WebKit/Tools/Scripts/webkit-patch', '--status-host=webkit-queues.appspot.com', '--bot-id=webkit-cq-01', 'apply-attachment', '--no-update', '--non-interactive', 231657, '--port=mac']" exit_code: 2 cwd: /Volumes/Data/EWS/WebKit

Last 500 characters of output:
/Scripts/webkitperl/FeatureList.pm
Hunk #1 succeeded at 336 (offset -4 lines).
patching file LayoutTests/ChangeLog
Hunk #1 succeeded at 1 with fuzz 3.
patching file LayoutTests/fast/dom/navigator-detached-no-crash-expected.txt
patching file LayoutTests/fast/dom/navigator-hardwareConcurrency-expected.txt
patching file LayoutTests/fast/dom/navigator-hardwareConcurrency.html

Failed to run "[u'/Volumes/Data/EWS/WebKit/Tools/Scripts/svn-apply', '--force']" exit_code: 1 cwd: /Volumes/Data/EWS/WebKit

Full output: http://webkit-queues.appspot.com/results/5436550225592320
Comment 97 Rik Cabanier 2014-05-18 12:59:24 PDT
Created attachment 231664 [details]
Patch for landing
Comment 98 WebKit Commit Bot 2014-05-18 13:37:13 PDT
Comment on attachment 231664 [details]
Patch for landing

Clearing flags on attachment: 231664

Committed r169017: <http://trac.webkit.org/changeset/169017>
Comment 99 WebKit Commit Bot 2014-05-18 13:37:25 PDT
All reviewed patches have been landed.  Closing bug.
Comment 100 Rik Cabanier 2014-05-18 16:02:53 PDT
(In reply to comment #93)
> > We've beat this to death.  It makes sense to land this.  It's a great feature to add to the web platform and we should welcome it.
> 
> Phil, so far you were the only WebKit reviewer who publicly supported adding the feature (there was general support from Ben, but not specifically for exposing the number of cores). It's OK if you r+ the patch by yourself and get it landed over objections, but please don't speak for everyone.

Thanks for exploring the issues around this problem. I'm sure more people will have doubts about this attribute and this thread should settle most of them.
I will refer people to this discussion if they have questions.