Improved shared memory handling for -m 10700. Removed the hard-coded limit of 256 threads and now dynamically check the device's shared memory pool to adapt threads accordingly.
Implemented a feature request to display non-default session names early during startup.
Added a check for the number of registers required by a kernel (CUDA and HIP only). This allows us to estimate the max threads per block before entering the auto-tune engine and make pre-adjustments.
Fixed Metal command encoder argument to work with the new auto-tuner's extra kernel invocation.
Fixed incorrect host memory calculation logic during automatic kernel-accel reduction for scrypt-based algorithms. This ensures memory constraints are respected.
Improved several plugins by setting maximum loop counts and others using the OPTS_TYPE_NATIVE_THREADS option.
Fixed compilation on Apple platforms by excluding '#include <sys/sysinfo.h>'.
This change affects three key areas, each improving autotuning:
- Autotune refactoring itself
The main autotune algorithm had become too complex to maintain and has
now been rewritten from scratch. The engine is now closer to the old
v6.0.0 version, using a much more straightforward approach.
Additionally, the backend is now informed when the autotune engine runs
its operations and runs an extra invisible kernel invocation. This
significantly improves runtime accuracy because the same caching
mechanisms which kick in normal cracking sessions now also apply during
autotuning. This leads to more consistent and reliable automatic
workload tuning.
- Benchmarking and '--speed-only' accuracy bugs fixed
Benchmark runtimes had become too short, especially since the default
benchmark mask changed from '?b?b?b?b?b?b?b' to '?a?a?a?a?a?a?a?a'. For
very fast hashes like NTLM, benchmarks often stopped immediately when
base words needed to be regenerated, producing highly inaccurate
results.
This issue also misled users tuning '-n' values, as manually
oversubscribing kernels could mask the problem, creating the impression
that increasing '-n' had a larger impact on performance than it truly
does. While '-n' still has an effect, it’s not as significant. With this
fix, users achieve the same speed without needing to tune '-n' manually.
The bug was fixed by enforcing a minimum benchmark runtime of 4 seconds,
regardless of kernel runtime or kernel type. This ensures more stable
and realistic benchmark results, but typically increasing the benchmark
duration by up to 4 seconds.
- Kernel-Threads set to 32 and plugin configuration cleanup
Some plugin configurations existed solely to work around the old
benchmarking bug and can now be removed. For example,
'OPTS_TYPE_MAXIMUM_THREADS' is no longer required and has been removed
from all plugins, although the parameter itself remains to avoid
breaking custom plugins.
Because increasing threads beyond 32 no longer offers meaningful
performance gains, the default is now capped at 32 (unless overridden
with '-T'). This simplifies GPU memory management. Currently, work-item
counts are indirectly limited by buffer sizes (e.g., 'pws_buf[]'), which
must not exceed 4 GiB (a hard-coded limit). This buffer size depends on
the product of 'kernel-accel', 'kernel-threads', and the device’s
compute units. By reducing the default threads from 1024 to 32, there is
now more space available for base words.
General:
The logic for calculating the SCRYPT workload has been moved
from module_extra_buffer_size() to module_extra_tuningdb_block().
Previously, this function just returned values from a static
tuning file. Now, it actually computes tuning values on the fly
based on the device's resources and SCRYPT parameters. This
was always possible, it just wasn't used that way until now.
After running the calculation, the calculated kernel_accel value
is injected into the tuning database as if it had come from a
file. The tmto value is stored internally.
Users can still override kernel-threads, kernel-accel, and
scrypt-tmto via the command line or via tuningdb file.
module_extra_tuningdb_block():
This is now where kernel_accel and tmto are automatically
calculated.
The logic for accel and tmto is now separated and more
flexible. Whether the user is using defaults, tuningdb entries, or
manual command line overrides, the code logic will try to make
smart choices based on what's actually available on the device.
First, it tries to find a kernel_accel value that fits into
available memory. It starts with a base value and simulates
tmto=1 or 2 (which is typical good on GPU).
It also leaves room for other buffers (like pws[], tmps[], etc.).
If the result is close to the actual processor count,
it gets clamped.
This value is then added to the tuning database, so hashcat can pick
it up during startup.
Once that's set, it derives tmto using available memory, thread
count, and the actual SCRYPT parameters.
module_extra_buffer_size():
This function now just returns the size of the SCRYPT B[] buffer,
based on the tmto that was already calculated.
kernel_threads:
Defaults are now set to 32 threads in most cases. On AMD GPUs,
64 threads might give a slight performance bump, but 32 is more
consistent and reliable.
For very memory-heavy algorithms (like Ethereum Wallet), it
scales down the thread count.
Here's a rough reference for other SCRYPT-based modes:
- 64 MiB: 16 threads
- 256 MiB: 4 threads
Tuning files:
All built-in tuningdb entries have been removed, because they
shouldn’t be needed anymore. But you can still add custom entries
if needed. There’s even a commented-out example in the tuningdb
file for mode 22700.
Free memory handling:
Getting the actual amount of free GPU memory is critical for
this to work right. Unfortunately, none of the common GPGPU APIs
give reliable numbers. We now query low-level interfaces like
SYSFS (AMD) and NVML (NVIDIA). Support for those APIs is in
place already, except for ADL, which still needs to be added.
Because of this, hwmon support (which handles those low-level
queries) can no longer be disabled.