Improved handling of an autotune edge case. In theory, increasing
accel early can improve accuracy, and it does, but it also prevents
increasing the thread count because it's more likely to run into
high runtime limits. OTOH, we want to prioritize threads over accel.
This change may slightly reduce performance for algorithms that
benefit from high accel and low thread counts (e.g., 7800, 14900),
but those can be managed by limiting thread count or, preferably,
by setting OPTS_TYPE_NATIVE_THREADS.
Added OPTS_TYPE_NATIVE_THREADS to 7800, 7810, and 14900.
Also fixed encoder bugs in hash-mode 29920 and 29940, identified
using the new test_edge.sh script. The encoders in the modules
failed to properly terminate the output string.
Update default hash settings to 64MiB:3:4 for Argon2 in -m 70000, following RFC 9106 recommendations.
Add option OPTS_TYPE_THREAD_MULTI_DISABLE: allows plugin developers to disable scaling the password candidate batch size based on device thread count. This can be useful for super slow hash algorithms that utilize threads differently, e.g., when the algorithm allows parallelization. Note: thread count for the device can still be set normally.
Add options OPTI_TYPE_SLOW_HASH_DIMY_INIT/LOOP/COMP: enable 2D launches for slow hash init/loop/comp kernel with dimensions X and Y. The Y value must be set via salt->salt_dimy attribute.
Change autotune kernel-loops start value to the lowest multiple of the target hash iteration count, if kernel_loops_min permits.
Fixed a bug in autotune where kernel_threads_max was not respected during initial init and loop-prepare kernel runs.
Fix the automatic reduction of the kernel-accel maximum based on available memory per device by accounting for the additional size needed to handle register spilling.
Fix the tools/benchmark_deep.pl script to recognize benchmark masks more reliably.