RESOLVED DUPLICATE of bug 215248 204871
sched_yield can cause calling thread to be preempted for 10 ms on Darwin
https://bugs.webkit.org/show_bug.cgi?id=204871
Summary sched_yield can cause calling thread to be preempted for 10 ms on Darwin
Ben Nham
Reported 2019-12-04 20:07:20 PST
We currently use sched_yield in both bmalloc and WTF::Lock. This can lead to the calling thread being preempted for up to 10 ms even if the lock was relinquished much earlier than that because of the way this is implemented on Darwin: - sched_yield calls the swtch_pri syscall - swtch_pri depresses the calling thread's priority to 0 for thread_depress_time milliseconds - thread_depress_time is set to one scheduling quantum in sched_*.c - most of our devices run with a timer tick of 10 ms, so thread_depress_time is 10 ms - if the system is busy with enough >0 pri work for 10 ms to saturate all possible CPUs, then the calling thread blocks for 10 ms We actually see exactly this in a number of MobileSafari launch tests (this originally came up as a launch test variability issue from the perf team). An alternative is to use thread_switch, which can depress the priority of the calling thread for a configurable amount of time with millisecond granularity.
Attachments
Patch (3.73 KB, patch)
2019-12-04 20:15 PST, Ben Nham
fpizlo: review-
Ben Nham
Comment 1 2019-12-04 20:09:08 PST
Ben Nham
Comment 2 2019-12-04 20:10:42 PST
I ran this by Daniel in Core OS (who is one of the scheduler maintainers) and he said that thread_switch(MACH_PORT_NULL, SWITCH_OPTION_DEPRESS, 1) would do basically what sched_yield does, except only depress the priority for 1 ms instead of 10 ms. Of course the kernel folks would rather we use os_unfair_lock rather than rolling our own locks but I think that's probably a separate discussion.
Filip Pizlo
Comment 3 2019-12-04 20:14:59 PST
(In reply to Ben Nham from comment #2) > I ran this by Daniel in Core OS (who is one of the scheduler maintainers) > and he said that thread_switch(MACH_PORT_NULL, SWITCH_OPTION_DEPRESS, 1) > would do basically what sched_yield does, except only depress the priority > for 1 ms instead of 10 ms. > > Of course the kernel folks would rather we use os_unfair_lock rather than > rolling our own locks but I think that's probably a separate discussion. We use sched_yield because it’s faster than what Daniel suggests. We have tested this many times. We use our own locks because they are faster for WebKit and because they can fit into 2 bits. We use that fact in JSCell. Even WTF::Lock is only one byte, which is 4x smaller than os_unfair_lock.
Ben Nham
Comment 4 2019-12-04 20:15:27 PST
Reopening to attach new patch.
Ben Nham
Comment 5 2019-12-04 20:15:28 PST
Ben Nham
Comment 6 2019-12-04 20:19:11 PST
How are we testing thread_switch vs sched_yield? There is some difference in using a Mach trap vs. a BSD syscall but I am somewhat surprised it's that large. I have brought up lowering the amount of time that swtch_pri depresses the priority as well, but didn't get any response about that. In any case, we are definitely seeing 10 ms yields on phones (see the radar); whether the calling thread could have woken up earlier is harder to determine from the profile that I was looking at. When the calling thread is the main thread then those 10 ms can count.
Ben Nham
Comment 7 2019-12-04 20:26:27 PST
Actually both swtch_pri and thread_switch seem to be Mach traps. There is a bit more logic in the latter call on the kernel side, but I am somewhat surprised that there would be that much of a difference. But if you have a benchmark in mind that I can try out I would be happy to do that.
Filip Pizlo
Comment 8 2019-12-04 20:42:43 PST
(In reply to Ben Nham from comment #6) > How are we testing thread_switch vs sched_yield? There is some difference in > using a Mach trap vs. a BSD syscall but I am somewhat surprised it's that > large. > > I have brought up lowering the amount of time that swtch_pri depresses the > priority as well, but didn't get any response about that. > > In any case, we are definitely seeing 10 ms yields on phones (see the > radar); whether the calling thread could have woken up earlier is harder to > determine from the profile that I was looking at. When the calling thread is > the main thread then those 10 ms can count. I don’t think you’re understanding me. Your patch is going to be a performance regression.
Filip Pizlo
Comment 9 2019-12-04 20:43:25 PST
Comment on attachment 384875 [details] Patch We have tried this change before. It was bad for perf. Unless you’ve got overwhelming perf data, this ain’t a change we are going to land.
Filip Pizlo
Comment 10 2019-12-04 20:46:19 PST
(In reply to Ben Nham from comment #7) > Actually both swtch_pri and thread_switch seem to be Mach traps. There is a > bit more logic in the latter call on the kernel side, but I am somewhat > surprised that there would be that much of a difference. But if you have a > benchmark in mind that I can try out I would be happy to do that. Any benchmark? We have tried your exact change about 10 times throughout the history of our locks. We have run all the benchmarks. We know it’s a bad idea.
Ben Nham
Comment 11 2020-08-08 17:30:27 PDT
We hadn't re-run the benchmarks again since xnu made another change to thread_switch to more closely match the behavior for swtch_pri a couple of years ago. I wasn't able to complete running the benchmarks due to some automation issues, but it looks like Saam was able to do it last week. Duping to Saam's bug, which is the exact same patch. *** This bug has been marked as a duplicate of bug 215248 ***
Note You need to log in before you can comment on or make changes to this bug.