Summary: | Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues() | ||||||
---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Chris Dumez <cdumez> | ||||
Component: | Web Audio | Assignee: | Chris Dumez <cdumez> | ||||
Status: | RESOLVED FIXED | ||||||
Severity: | Normal | CC: | cdumez, darin, eric.carlson, ews-watchlist, ggaren, glenn, jer.noble, philipj, sam, sergio, webkit-bug-importer | ||||
Priority: | P2 | Keywords: | InRadar | ||||
Version: | WebKit Nightly Build | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Bug Depends on: | |||||||
Bug Blocks: | 212611 | ||||||
Attachments: |
|
Description
Chris Dumez
2020-10-15 10:43:05 PDT
Created attachment 411460 [details]
Patch
Comment on attachment 411460 [details]
Patch
r=me
Comment on attachment 411460 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=411460&action=review > Source/WebCore/ChangeLog:8 > + Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues(). Does this get us any vectorization or parallelization? I know that’s the long term ambition of the C++ library, they even have std::execution::seq/par/par_unseq/unseq that you can pass as the first argument. (In reply to Darin Adler from comment #3) > Comment on attachment 411460 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=411460&action=review > > > Source/WebCore/ChangeLog:8 > > + Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues(). > > Does this get us any vectorization or parallelization? I know that’s the > long term ambition of the C++ library, they even have > std::execution::seq/par/par_unseq/unseq that you can pass as the first > argument. I actually don't know. I figured the code was more concise with std::fill and we *might* get more optimized code. I guess I could write a simple benchmark to compare the 2. Also note that we could very easily add a VectorMath function that uses vDSP_fill() [1] to guarantee we get vectorization. What do you think? [1] https://developer.apple.com/documentation/accelerate/1450501-vdsp_vfill?language=objc Comment on attachment 411460 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=411460&action=review >>> Source/WebCore/ChangeLog:8 >>> + Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues(). >> >> Does this get us any vectorization or parallelization? I know that’s the long term ambition of the C++ library, they even have std::execution::seq/par/par_unseq/unseq that you can pass as the first argument. > > I actually don't know. I figured the code was more concise with std::fill and we *might* get more optimized code. I guess I could write a simple benchmark to compare the 2. > > Also note that we could very easily add a VectorMath function that uses vDSP_fill() [1] to guarantee we get vectorization. What do you think? > > [1] https://developer.apple.com/documentation/accelerate/1450501-vdsp_vfill?language=objc Like most optimization situations, I think: 1) We should optimize if it makes a measurable difference. 2) Outside of that, we should choose an idiom that is both easy to understand and reasonably optimized by default. I think that using std::fill_n already accomplishes (2) and we could go further if we measure something that shows an optimization opportunity. (In reply to Chris Dumez from comment #4) > (In reply to Darin Adler from comment #3) > > Comment on attachment 411460 [details] > > Patch > > > > View in context: > > https://bugs.webkit.org/attachment.cgi?id=411460&action=review > > > > > Source/WebCore/ChangeLog:8 > > > + Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues(). > > > > Does this get us any vectorization or parallelization? I know that’s the > > long term ambition of the C++ library, they even have > > std::execution::seq/par/par_unseq/unseq that you can pass as the first > > argument. > > I actually don't know. I figured the code was more concise with std::fill > and we *might* get more optimized code. I guess I could write a simple > benchmark to compare the 2. > > Also note that we could very easily add a VectorMath function that uses > vDSP_fill() [1] to guarantee we get vectorization. What do you think? > > [1] > https://developer.apple.com/documentation/accelerate/1450501- > vdsp_vfill?language=objc I think using accelerate as much as possible in VectorMath is what we should be doing. I really see it as a "platform" abstraction around it. I still think using std::fill_n here for the non-HAVE(ACCELERATE) case it the right way to go. It doesn't guarantee any vectorization, but since it is closer the the compiler, if autovectorization is in the compiler, it is more likely the compiler will ensure it works with standard library idioms. (In reply to Sam Weinig from comment #6) > (In reply to Chris Dumez from comment #4) > > (In reply to Darin Adler from comment #3) > > > Comment on attachment 411460 [details] > > > Patch > > > > > > View in context: > > > https://bugs.webkit.org/attachment.cgi?id=411460&action=review > > > > > > > Source/WebCore/ChangeLog:8 > > > > + Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues(). > > > > > > Does this get us any vectorization or parallelization? I know that’s the > > > long term ambition of the C++ library, they even have > > > std::execution::seq/par/par_unseq/unseq that you can pass as the first > > > argument. > > > > I actually don't know. I figured the code was more concise with std::fill > > and we *might* get more optimized code. I guess I could write a simple > > benchmark to compare the 2. > > > > Also note that we could very easily add a VectorMath function that uses > > vDSP_fill() [1] to guarantee we get vectorization. What do you think? > > > > [1] > > https://developer.apple.com/documentation/accelerate/1450501- > > vdsp_vfill?language=objc > > I think using accelerate as much as possible in VectorMath is what we should > be doing. I really see it as a "platform" abstraction around it. > > I still think using std::fill_n here for the non-HAVE(ACCELERATE) case it > the right way to go. It doesn't guarantee any vectorization, but since it is > closer the the compiler, if autovectorization is in the compiler, it is more > likely the compiler will ensure it works with standard library idioms. Yes, I think this is a good idea. I will still benchmark std::fill_n() and vDSP_fill() because I am curious now :) (In reply to Chris Dumez from comment #7) > (In reply to Sam Weinig from comment #6) > > (In reply to Chris Dumez from comment #4) > > > (In reply to Darin Adler from comment #3) > > > > Comment on attachment 411460 [details] > > > > Patch > > > > > > > > View in context: > > > > https://bugs.webkit.org/attachment.cgi?id=411460&action=review > > > > > > > > > Source/WebCore/ChangeLog:8 > > > > > + Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues(). > > > > > > > > Does this get us any vectorization or parallelization? I know that’s the > > > > long term ambition of the C++ library, they even have > > > > std::execution::seq/par/par_unseq/unseq that you can pass as the first > > > > argument. > > > > > > I actually don't know. I figured the code was more concise with std::fill > > > and we *might* get more optimized code. I guess I could write a simple > > > benchmark to compare the 2. > > > > > > Also note that we could very easily add a VectorMath function that uses > > > vDSP_fill() [1] to guarantee we get vectorization. What do you think? > > > > > > [1] > > > https://developer.apple.com/documentation/accelerate/1450501- > > > vdsp_vfill?language=objc > > > > I think using accelerate as much as possible in VectorMath is what we should > > be doing. I really see it as a "platform" abstraction around it. > > > > I still think using std::fill_n here for the non-HAVE(ACCELERATE) case it > > the right way to go. It doesn't guarantee any vectorization, but since it is > > closer the the compiler, if autovectorization is in the compiler, it is more > > likely the compiler will ensure it works with standard library idioms. > > Yes, I think this is a good idea. I will still benchmark std::fill_n() and > vDSP_fill() because I am curious now :) Interestingly, std::fill_n() seems consistently faster for large arrays than vDSP_fill() on my MacBook Pro: std::fill_n() took 1.0884ms vDSP::vfill() took 1.20753ms Benchmark: // clang++ -O2 -std=c++14 -framework Accelerate fill_benchmark.cpp -o fill_benchmark #include <algorithm> #include <chrono> #include <iostream> #include <Accelerate/Accelerate.h> int main() { constexpr unsigned N = 524288; const float pi = 3.1415926535; float array1[N]; float array2[N]; auto start = std::chrono::steady_clock::now(); std::fill_n(array1, N, pi); auto end = std::chrono::steady_clock::now(); std::chrono::duration<double> diff = end - start; std::cout << "std::fill_n() took " << diff.count() * 1000 << "ms\n"; start = std::chrono::steady_clock::now(); vDSP_vfill(&pi, array2, 1, N); end = std::chrono::steady_clock::now(); diff = end - start; std::cout << "vDSP::vfill() took " << diff.count() * 1000 << "ms\n"; std::cout << array1[2000] << " " << array2[2000] << "\n"; return 0; } (In reply to Chris Dumez from comment #8) > (In reply to Chris Dumez from comment #7) > > (In reply to Sam Weinig from comment #6) > > > (In reply to Chris Dumez from comment #4) > > > > (In reply to Darin Adler from comment #3) > > > > > Comment on attachment 411460 [details] > > > > > Patch > > > > > > > > > > View in context: > > > > > https://bugs.webkit.org/attachment.cgi?id=411460&action=review > > > > > > > > > > > Source/WebCore/ChangeLog:8 > > > > > > + Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues(). > > > > > > > > > > Does this get us any vectorization or parallelization? I know that’s the > > > > > long term ambition of the C++ library, they even have > > > > > std::execution::seq/par/par_unseq/unseq that you can pass as the first > > > > > argument. > > > > > > > > I actually don't know. I figured the code was more concise with std::fill > > > > and we *might* get more optimized code. I guess I could write a simple > > > > benchmark to compare the 2. > > > > > > > > Also note that we could very easily add a VectorMath function that uses > > > > vDSP_fill() [1] to guarantee we get vectorization. What do you think? > > > > > > > > [1] > > > > https://developer.apple.com/documentation/accelerate/1450501- > > > > vdsp_vfill?language=objc > > > > > > I think using accelerate as much as possible in VectorMath is what we should > > > be doing. I really see it as a "platform" abstraction around it. > > > > > > I still think using std::fill_n here for the non-HAVE(ACCELERATE) case it > > > the right way to go. It doesn't guarantee any vectorization, but since it is > > > closer the the compiler, if autovectorization is in the compiler, it is more > > > likely the compiler will ensure it works with standard library idioms. > > > > Yes, I think this is a good idea. I will still benchmark std::fill_n() and > > vDSP_fill() because I am curious now :) > > Interestingly, std::fill_n() seems consistently faster for large arrays than > vDSP_fill() on my MacBook Pro: > > std::fill_n() took 1.0884ms > vDSP::vfill() took 1.20753ms > > Benchmark: > > // clang++ -O2 -std=c++14 -framework Accelerate fill_benchmark.cpp -o > fill_benchmark > > #include <algorithm> > #include <chrono> > #include <iostream> > #include <Accelerate/Accelerate.h> > > int main() > { > constexpr unsigned N = 524288; > const float pi = 3.1415926535; > float array1[N]; > float array2[N]; > > auto start = std::chrono::steady_clock::now(); > std::fill_n(array1, N, pi); > auto end = std::chrono::steady_clock::now(); > > std::chrono::duration<double> diff = end - start; > std::cout << "std::fill_n() took " << diff.count() * 1000 << "ms\n"; > > start = std::chrono::steady_clock::now(); > vDSP_vfill(&pi, array2, 1, N); > end = std::chrono::steady_clock::now(); > > diff = end - start; > std::cout << "vDSP::vfill() took " << diff.count() * 1000 << "ms\n"; > > std::cout << array1[2000] << " " << array2[2000] << "\n"; > > return 0; > } Interesting. Probably worth a radar to the Accelerate folks. (In reply to Sam Weinig from comment #9) > (In reply to Chris Dumez from comment #8) > > (In reply to Chris Dumez from comment #7) > > > (In reply to Sam Weinig from comment #6) > > > > (In reply to Chris Dumez from comment #4) > > > > > (In reply to Darin Adler from comment #3) > > > > > > Comment on attachment 411460 [details] > > > > > > Patch > > > > > > > > > > > > View in context: > > > > > > https://bugs.webkit.org/attachment.cgi?id=411460&action=review > > > > > > > > > > > > > Source/WebCore/ChangeLog:8 > > > > > > > + Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues(). > > > > > > > > > > > > Does this get us any vectorization or parallelization? I know that’s the > > > > > > long term ambition of the C++ library, they even have > > > > > > std::execution::seq/par/par_unseq/unseq that you can pass as the first > > > > > > argument. > > > > > > > > > > I actually don't know. I figured the code was more concise with std::fill > > > > > and we *might* get more optimized code. I guess I could write a simple > > > > > benchmark to compare the 2. > > > > > > > > > > Also note that we could very easily add a VectorMath function that uses > > > > > vDSP_fill() [1] to guarantee we get vectorization. What do you think? > > > > > > > > > > [1] > > > > > https://developer.apple.com/documentation/accelerate/1450501- > > > > > vdsp_vfill?language=objc > > > > > > > > I think using accelerate as much as possible in VectorMath is what we should > > > > be doing. I really see it as a "platform" abstraction around it. > > > > > > > > I still think using std::fill_n here for the non-HAVE(ACCELERATE) case it > > > > the right way to go. It doesn't guarantee any vectorization, but since it is > > > > closer the the compiler, if autovectorization is in the compiler, it is more > > > > likely the compiler will ensure it works with standard library idioms. > > > > > > Yes, I think this is a good idea. I will still benchmark std::fill_n() and > > > vDSP_fill() because I am curious now :) > > > > Interestingly, std::fill_n() seems consistently faster for large arrays than > > vDSP_fill() on my MacBook Pro: > > > > std::fill_n() took 1.0884ms > > vDSP::vfill() took 1.20753ms > > > > Benchmark: > > > > // clang++ -O2 -std=c++14 -framework Accelerate fill_benchmark.cpp -o > > fill_benchmark > > > > #include <algorithm> > > #include <chrono> > > #include <iostream> > > #include <Accelerate/Accelerate.h> > > > > int main() > > { > > constexpr unsigned N = 524288; > > const float pi = 3.1415926535; > > float array1[N]; > > float array2[N]; > > > > auto start = std::chrono::steady_clock::now(); > > std::fill_n(array1, N, pi); > > auto end = std::chrono::steady_clock::now(); > > > > std::chrono::duration<double> diff = end - start; > > std::cout << "std::fill_n() took " << diff.count() * 1000 << "ms\n"; > > > > start = std::chrono::steady_clock::now(); > > vDSP_vfill(&pi, array2, 1, N); > > end = std::chrono::steady_clock::now(); > > > > diff = end - start; > > std::cout << "vDSP::vfill() took " << diff.count() * 1000 << "ms\n"; > > > > std::cout << array1[2000] << " " << array2[2000] << "\n"; > > > > return 0; > > } > > Interesting. Probably worth a radar to the Accelerate folks. Ok, rdar://problem/70351530. Committed r268553: <https://trac.webkit.org/changeset/268553> All reviewed patches have been landed. Closing bug and clearing flags on attachment 411460 [details]. |