I do see the prefix sum is defined and used in the Cuda functions I may overlook something, but I did not find the parallel prefix sum in the python implementation selective_scan_ref. Is it only for ...
To learn about parallel prefix scan algorithms Compare their performance using different synchronization primitives Gain experience implementing your own barrier primitive Understand the performance ...