Improved shared memory handling for -m 10700. Removed the hard-coded limit of 256 threads and now dynamically check the device's shared memory pool to adapt threads accordingly.
Implemented a feature request to display non-default session names early during startup.
Added a check for the number of registers required by a kernel (CUDA and HIP only). This allows us to estimate the max threads per block before entering the auto-tune engine and make pre-adjustments.
Fixed Metal command encoder argument to work with the new auto-tuner's extra kernel invocation.
Fixed incorrect host memory calculation logic during automatic kernel-accel reduction for scrypt-based algorithms. This ensures memory constraints are respected.
Improved several plugins by setting maximum loop counts and others using the OPTS_TYPE_NATIVE_THREADS option.
Fixed compilation on Apple platforms by excluding '#include <sys/sysinfo.h>'.
- Integrated occupancy hints from vendor APIs (CUDA, HIP) to set a
dynamic threads-per-block limit per kernel instead of using static
values.
- Added `find_tuning_function()` to identify the relevant kernel.
- Autotuner now runs in three stages: threads -> loops -> accel. The
first two stages now stop increasing when the tested kernel runtime
gets too close to the target runtime (96ms for `-w 3`), leaving
headroom for the next stage to adjust in a finer sense.
- Accel tuning now uses a capped floating-point multiplier instead of
powers of two.
- Removed workarounds for missing thread autotuning in plugins.
- Removed the hardcoded 4GiB host memory limit for accel. Added a
cross-platform `get_free_memory()` to check actual free RAM during GPU
initialization, preventing underutilization of high-end GPUs like the
4090. If needed, users can still cap memory usage with `-T` or `-n`.
- Updated enums for ROCm 6.4.x and CUDA 12.9.
- Added code to detect kernel register spilling. That's relevant so we
can keep free enough global memory on the runtime for the runtime to
handle spills efficiently.
This change affects three key areas, each improving autotuning:
- Autotune refactoring itself
The main autotune algorithm had become too complex to maintain and has
now been rewritten from scratch. The engine is now closer to the old
v6.0.0 version, using a much more straightforward approach.
Additionally, the backend is now informed when the autotune engine runs
its operations and runs an extra invisible kernel invocation. This
significantly improves runtime accuracy because the same caching
mechanisms which kick in normal cracking sessions now also apply during
autotuning. This leads to more consistent and reliable automatic
workload tuning.
- Benchmarking and '--speed-only' accuracy bugs fixed
Benchmark runtimes had become too short, especially since the default
benchmark mask changed from '?b?b?b?b?b?b?b' to '?a?a?a?a?a?a?a?a'. For
very fast hashes like NTLM, benchmarks often stopped immediately when
base words needed to be regenerated, producing highly inaccurate
results.
This issue also misled users tuning '-n' values, as manually
oversubscribing kernels could mask the problem, creating the impression
that increasing '-n' had a larger impact on performance than it truly
does. While '-n' still has an effect, it’s not as significant. With this
fix, users achieve the same speed without needing to tune '-n' manually.
The bug was fixed by enforcing a minimum benchmark runtime of 4 seconds,
regardless of kernel runtime or kernel type. This ensures more stable
and realistic benchmark results, but typically increasing the benchmark
duration by up to 4 seconds.
- Kernel-Threads set to 32 and plugin configuration cleanup
Some plugin configurations existed solely to work around the old
benchmarking bug and can now be removed. For example,
'OPTS_TYPE_MAXIMUM_THREADS' is no longer required and has been removed
from all plugins, although the parameter itself remains to avoid
breaking custom plugins.
Because increasing threads beyond 32 no longer offers meaningful
performance gains, the default is now capped at 32 (unless overridden
with '-T'). This simplifies GPU memory management. Currently, work-item
counts are indirectly limited by buffer sizes (e.g., 'pws_buf[]'), which
must not exceed 4 GiB (a hard-coded limit). This buffer size depends on
the product of 'kernel-accel', 'kernel-threads', and the device’s
compute units. By reducing the default threads from 1024 to 32, there is
now more space available for base words.