This change affects three key areas, each improving autotuning:
- Autotune refactoring itself
The main autotune algorithm had become too complex to maintain and has
now been rewritten from scratch. The engine is now closer to the old
v6.0.0 version, using a much more straightforward approach.
Additionally, the backend is now informed when the autotune engine runs
its operations and runs an extra invisible kernel invocation. This
significantly improves runtime accuracy because the same caching
mechanisms which kick in normal cracking sessions now also apply during
autotuning. This leads to more consistent and reliable automatic
workload tuning.
- Benchmarking and '--speed-only' accuracy bugs fixed
Benchmark runtimes had become too short, especially since the default
benchmark mask changed from '?b?b?b?b?b?b?b' to '?a?a?a?a?a?a?a?a'. For
very fast hashes like NTLM, benchmarks often stopped immediately when
base words needed to be regenerated, producing highly inaccurate
results.
This issue also misled users tuning '-n' values, as manually
oversubscribing kernels could mask the problem, creating the impression
that increasing '-n' had a larger impact on performance than it truly
does. While '-n' still has an effect, it’s not as significant. With this
fix, users achieve the same speed without needing to tune '-n' manually.
The bug was fixed by enforcing a minimum benchmark runtime of 4 seconds,
regardless of kernel runtime or kernel type. This ensures more stable
and realistic benchmark results, but typically increasing the benchmark
duration by up to 4 seconds.
- Kernel-Threads set to 32 and plugin configuration cleanup
Some plugin configurations existed solely to work around the old
benchmarking bug and can now be removed. For example,
'OPTS_TYPE_MAXIMUM_THREADS' is no longer required and has been removed
from all plugins, although the parameter itself remains to avoid
breaking custom plugins.
Because increasing threads beyond 32 no longer offers meaningful
performance gains, the default is now capped at 32 (unless overridden
with '-T'). This simplifies GPU memory management. Currently, work-item
counts are indirectly limited by buffer sizes (e.g., 'pws_buf[]'), which
must not exceed 4 GiB (a hard-coded limit). This buffer size depends on
the product of 'kernel-accel', 'kernel-threads', and the device’s
compute units. By reducing the default threads from 1024 to 32, there is
now more space available for base words.