1
0
mirror of https://github.com/hashcat/hashcat.git synced 2024-11-29 11:28:15 +00:00

Add documentation about the 04|08|16 fast kernels

This commit is contained in:
Marcus T 2021-10-18 19:38:16 -04:00 committed by GitHub
parent 50fc474f25
commit f225d07fa5
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -657,6 +657,8 @@ Each fast hash kernel source in optimized mode has to provide the following kern
As always, the XXXXX is the hash mode with leading zeros. The `m` or `s` defines the multi-hash and single-hash implementation. In single hashes, often we can store the target hash on the register which makes the final test much faster compared to checking it on GPU memory. The `m` and `s` therefore often look almost the same. The only difference is that in the `s` kernel at some point you will store the target hash in a register. The final comparison function macro for `m` is COMPARE_M_SIMD() and for `s` is COMPARE_S_SIMD(). As always, the XXXXX is the hash mode with leading zeros. The `m` or `s` defines the multi-hash and single-hash implementation. In single hashes, often we can store the target hash on the register which makes the final test much faster compared to checking it on GPU memory. The `m` and `s` therefore often look almost the same. The only difference is that in the `s` kernel at some point you will store the target hash in a register. The final comparison function macro for `m` is COMPARE_M_SIMD() and for `s` is COMPARE_S_SIMD().
For single-hash this will add code to do on-register comparison. For multi-hash this will add the code to run the bloom filter and a binary tree search. For both cases, the macros expect you to provide 4 times 32 bit values in the same order as you have configured in the module functions module_dgst_pos0() - module_dgst_pos3(). Note that it always has to be 4 times 32 bit values, also for hashes which provide much more or much less bits output size. See the sections about `module_dgst_pos0()` - `module_dgst_pos3()` for details. For single-hash this will add code to do on-register comparison. For multi-hash this will add the code to run the bloom filter and a binary tree search. For both cases, the macros expect you to provide 4 times 32 bit values in the same order as you have configured in the module functions module_dgst_pos0() - module_dgst_pos3(). Note that it always has to be 4 times 32 bit values, also for hashes which provide much more or much less bits output size. See the sections about `module_dgst_pos0()` - `module_dgst_pos3()` for details.
The `[04|08|16]` from the function name denotes the maximum password candidate length in 4-byte words. The `04` function limits the candidates to 16 bytes, `08` limits to 32 bytes, and `16` to 64 bytes. Only the brute-force `a3` set of kernels need to implement all of the `04`, `08`, and `16` length variants. The `a0` and `a1` sets of kernels should only implement the `04` function and create the `08` and `16` variants as empty stubs. The candidate generation for the fast `a0` and `a1` kernels are proportionally slow enough that the length optimizations of the `08` and `16` variants simply don't provide significant benefit over the "pure" variant.
There are some kernels where using vector data types are beneficial even if they are executed on compute devices which have no native support for vector data types. A good example is NTLM running on a high end GPU. The performance gain comes from how the algorithm works and that there is in total 60 instructions that can be precomputed based only on the scalar base password. The base password never changes. The vectorization is done only in the inner loop, but from there it can access the precomputed (scalar) values from the outer loop. It saves both, instructions and resources. This is done automatically by the compiler because the structure of hashcat kernels allows the compiler to optimize it. There are some kernels where using vector data types are beneficial even if they are executed on compute devices which have no native support for vector data types. A good example is NTLM running on a high end GPU. The performance gain comes from how the algorithm works and that there is in total 60 instructions that can be precomputed based only on the scalar base password. The base password never changes. The vectorization is done only in the inner loop, but from there it can access the precomputed (scalar) values from the outer loop. It saves both, instructions and resources. This is done automatically by the compiler because the structure of hashcat kernels allows the compiler to optimize it.
#### Kernel: fast hash type (pure) #### #### Kernel: fast hash type (pure) ####