Until now, support for Metal has been written using an Apple M1 and a 10-year-old Apple Intel as a basis to verify that the changes were also compatible with very old devices.
Recently, the code has been tested on an Apple M4 Pro, with a performance increase of about 3,7 times compared to the M1.
The code has also been sporadically tested on an Apple device with a discrete AMD GPU, but the performance was very low.
With this patch, I revisited memory management on Metal, initially creating an easily configurable array mapped 1 to 1 with the buffers allocated by hashcat.
The configuration refers to the Storage Mode associated with the buffers, as well as an ad hoc modification that transforms buffers
with a SHARED Storage Mode to MANAGED if the device is a discrete GPU and not an M* (Silicon) or integrated (Intel) one.
The result was excellent, as some very quick tests showed, for example, argon2 going from 10 H/s (1330.58ms) to 465 H/s (57.26ms)!
A 4550% increase in computational power and a 2223% increase in execution timing on the GPU!
In addition to the array for configuring the buffer storage modes, a macro, HC_MTL_CREATEBUFFER, has also been created.
This is used to generate the code that calls hc_mtlCreateBuffer, making the code much more readable than before.
In summary, this patch lays the groundwork for further improvements to the hashcat core, both on Metal itself and also for other runtimes, particularly OpenCL.
Fix deprecation warning on m30906.pm
Fix pipeline error with -m 32600 on Apple
Update edge_test.sh
Fix edge test vectors generation for hash-type 28501, 28502, 28503, 28504, 28505, 28506
Fixed old/critical bug on Apple Intel with Metal by patching inc_rp_optimized.cl.
Tested on Apple Intel and Silicon with Metal/OpenCL and on Linux with CUDA, HIP, OpenCL GPU/CPU
Metal Backend: parallelize pipeline state object (PSO) compilation internally
Set unexported setting, setShouldMaximizeConcurrentCompilation, to boost kernel build process on Apple Metal (only >= 3)
Hashcat is evolving, both in its core and in the supported algorithms.
To uncover bugs in the code, I implemented edge case testing to verify the settings defined in the specific algorithm test modules (e.g., m00000.pm), as well as the behavior of the kernels (pure and optimized) in relation to the different attack modes (-a0, -a1, etc.).