Fixed old/critical bug on Apple Intel with Metal by patching inc_rp_optimized.cl.
Tested on Apple Intel and Silicon with Metal/OpenCL and on Linux with CUDA, HIP, OpenCL GPU/CPU
Metal Backend: parallelize pipeline state object (PSO) compilation internally
Set unexported setting, setShouldMaximizeConcurrentCompilation, to boost kernel build process on Apple Metal (only >= 3)
Updated kernel declarations from "KERNEL_FQ void HC_ATTR_SEQ" to "KERNEL_FQ KERNEL_FA void". Please update your custom plugin kernels accordingly.
Added spilling size as a factor in calculating usable memory per device. This is based on undocumented variables and may not be 100% accurate, but it works well in practice.
Added a compiler hint to scrypt-based kernels indicating the guaranteed maximum thread count per kernel invocation.
Removed redundant kernel code 29800, as it is identical to 27700, and updated the plugin.
Substitute long parameter lists in ~2900 kernel function declarations
with macros. This cleans up the code, reduces probability of copy-paste
errors and highlights the differences between kernel functions. Also
reduces the size of the OpenCL folder by ~3 MB.
Renamed pure kernels to default kernels
Replaced long option --length-limit-disable with --optimized-kernel-enable
Replaced short option -L with -O
Set --optimized-kernel-enable to unset by default