This affects the inner core of nearly all kernels and thus impacts
almost all hash modes. The only functional change is that we now
manually unroll the individual steps of the transform() functions,
saving a small amount of constant memory.
In most cases, JIT compilers would likely detect the unused constant
buffer and remove it automatically, but this makes it explicit.
Tested on newer NVIDIA devices: no speed change observed.
Tested on older NVIDIA devices: visible speed increase.
Tested on AMD devices: visible speed increase across all tested GPUs.
Not yet tested: CPUs, Intel iGPUs, Intel dGPUs.
Updated kernel declarations from "KERNEL_FQ void HC_ATTR_SEQ" to "KERNEL_FQ KERNEL_FA void". Please update your custom plugin kernels accordingly.
Added spilling size as a factor in calculating usable memory per device. This is based on undocumented variables and may not be 100% accurate, but it works well in practice.
Added a compiler hint to scrypt-based kernels indicating the guaranteed maximum thread count per kernel invocation.
Removed redundant kernel code 29800, as it is identical to 27700, and updated the plugin.