- Replaced inline asm in hc_byte_perm() with __builtin_amdgcn_perm()
- Replaced inline asm in hc_bytealign() with __builtin_amdgcn_alignbyte()
- Defined HC_INLINE as default for HIP, significantly boosting kernel performance of pure kernels
- Removed IS_ROCM from inc_vendor.h as it's no longer needed
- Removed backend-specific code from several hash-modes and inc_rp_optimized.cl, as hc_bytealign_S() is now available on all backends
Fixed parameter types in inc_hash_blake2b.cl and inc_hash_blake2s.cl for FINAL value.
Added kernel code for -m 15400 to s04/s08/m04/m08, even if not needed, to help autotune find optimal workitem settings.
Fixed a rare autotune case (e.g. in mode 18600) where threads_min was not a multiple of kernel_preferred_wgs_multiple, and changes it so that as long as it only threads_min is affected and not threads_max, we now ensure at least kernel_preferred_wgs_multiple.
Improved autotune logic for best thread count: double thread count until reaching the device's preferred multiple, then increase in steps of that multiple while comparing efficiency vs. runtime, and select the configuration with best efficiency, not highest thread count.
Always set funnelshift support to true for HIP devices, as it always reports false.
Set minimum loop count to 250 for all VeraCrypt modes with PIM brute-force support.