My initial implementation of the retreating wavefront barrier required a store-load fence. This is not necessary if you do some double-buffering.
Created attachment 298366 [details] work in progress I ought to be able to benchmark this soon.
<rdar://problem/29934388>
Created attachment 298439 [details] all of the optimizations _except_ the rescan/fence removal
It appears that the barrier based on the rescan bit is best for programs that repeatedly store into the same object. The barrier based on an optimized, and inlined, store-load fence is best for programs that obey the generational hypothesis. I think that the latter kind of barrier is also giving better splay-latency numbers. I'm going to create a new bug for the optimized (but not rescan log) barrier, and get that landed. I'll leave this bug in limbo for now. It seems like it's sometimes, but not always, a good idea.
Created attachment 298443 [details] it seems to work But it doesn't seem as good as the TSO barrier, after the TSO barrier gets all of the other optimizations.