## OpenCL Backend
- added hc_clCreateBuffer wrapper, hc_clCreateBuffer_pre
- updated HC_OCL_CREATEBUFFER macro
- updated other two hc_clEnqueueWriteBuffer from CL_FALSE to CL_TRUE
## Metal Backend
- added hc_mtlFinish
- updated hc_mtlCreateBuffer
## Memory
- added hc_alloc_aligned and hc_free_aligned
- renamed hcmalloc_aligned and hcfree_aligned to hcmalloc_bridge_aligned and hcfree_bridge_aligned
## Backend & Bridge
- updated references of hcmalloc_aligned and hcfree_aligned to the new memory defined functions
- Added OS detection to allow conditional execution of platform-specific initialization steps
- Implemented macOS-specific cache cleaning to ensure consistent benchmark results
- Updated script logic to align with other test suites, including improved handling of workdir setup and execution timing measurements
Update driver version checks to v7 base in code
Rename rules/top10_2023.rule -> rules/top10_2025.rule
Fix Bitlocker new minimum password length in unit test
Plugin developers can now use the FORCED_THREAD_COUNT macro to enforce
a thread count based on kernel logic and report it back to the host
from module_jit_build_options().
This works similarly to FIXED_LOCAL_SIZE, but with an important
difference: FIXED_LOCAL_SIZE also affects the JIT compiler by setting
a runtime-specific attribute that allows it to optimize for the fixed
thread size. FORCED_THREAD_COUNT does not trigger that behavior.
The downside of FORCED_THREAD_COUNT is that it disables use of
multi-dimensional JIT-optimized kernels. This means we cannot use it
for kernels like Argon2.
However, we still need to dynamically enforce a thread size for Argon2,
because the Argon2 implementation assumes 32 threads for GPUs
and 1 for CPUs. This conflicts with GPUs whose native thread size
is not 32, such as Intel discrete GPUs where the native size is 8.
Also added a case-specific optimization for Argon2. When all hashes
share the same parallelism configuration and/or the same memory size
per password candidate, we can hardcode those values. This is similar
to how we optimize our scrypt kernels and allows the compiler to make
better unrolling decisions.
Fixed a bug in the RC4 crypto primitives that can cause false
negatives on GPUs which do not result in a thread size of 32 in
module_jit_build_options(). This typically affects Intel GPUs
since those often have a native thread size of 8, but not NVIDIA
or AMD.
The problem was in the key lookup logic in inc_cipher_rc4.cl. It was
written in a way that assumes 32 threads are always running:
If the kernel is launched with only 8 threads, this macro can result
in out-of-bound reads and writes.
To fix this, we now enforce that all kernels using the RC4 primitives
are launched with 32 threads (on GPUs).
The following hash modes were updated to include this fix:
7500, 9700, 9710, 9720, 9800, 9810, 9820, 10400, 10410, 10420, 10500,
10510, 13100, 18200, 25400, 33500, 33501, and 33502.
Re-enabled USE_BITSELECT for Intel GPUs.
Optimize vector version of hc_swap32() to allow using USE_SWIZZLE based technique on OpenCL in case USE_BITSELECT or USE_ROTATE is not set.
- solved TODOs in hc_fstat()
- fix memory leaks on Metal Backend
- using HC_OCL_CREATEBUFFER macro for buffer allocation and openclMemoryFlags array to configure the memory flags with OpenCL
- convert lasts CL_FALSE to CL_TRUE in hc_clEnqueueWriteBuffer() calls
- hide pyenv stderr on test_edge.sh
- do not allow --slow-candidates (-S) in benchmark mode
Detect the highest supported OpenCL version at runtime and use the
appropriate -cl-std= flag when compiling kernels. This improves
compatibility with the Intel NEO driver. Note: behavior is untested
on other platforms (NVIDIA, AMD, Apple, etc.). Feedback will be
monitored.
Add tuningdb entries for discrete Intel GPUs. Copy over hash-mode
patterns that benefit from vectorizing on scalar compute units, based
on existing AMD and NVIDIA entries. This change also removes the
artificial thread limit previously enforced for discrete Intel GPUs.
Disable automatic vector width detection from the OpenCL runtime
except on CPU, where it remains in use.
Until now, support for Metal has been written using an Apple M1 and a 10-year-old Apple Intel as a basis to verify that the changes were also compatible with very old devices.
Recently, the code has been tested on an Apple M4 Pro, with a performance increase of about 3,7 times compared to the M1.
The code has also been sporadically tested on an Apple device with a discrete AMD GPU, but the performance was very low.
With this patch, I revisited memory management on Metal, initially creating an easily configurable array mapped 1 to 1 with the buffers allocated by hashcat.
The configuration refers to the Storage Mode associated with the buffers, as well as an ad hoc modification that transforms buffers
with a SHARED Storage Mode to MANAGED if the device is a discrete GPU and not an M* (Silicon) or integrated (Intel) one.
The result was excellent, as some very quick tests showed, for example, argon2 going from 10 H/s (1330.58ms) to 465 H/s (57.26ms)!
A 4550% increase in computational power and a 2223% increase in execution timing on the GPU!
In addition to the array for configuring the buffer storage modes, a macro, HC_MTL_CREATEBUFFER, has also been created.
This is used to generate the code that calls hc_mtlCreateBuffer, making the code much more readable than before.
In summary, this patch lays the groundwork for further improvements to the hashcat core, both on Metal itself and also for other runtimes, particularly OpenCL.
- Took keyfile and keyboard handling from 293xx and moved to 62xx
- Applied performance optimizations based on OPTS_TYPE parameters,
fixed/limited kernel-loops/threads from 62xx and moved to 293xx