From b2911a9a5fc0b0ca0a2837cb90cf7f0034df6f7a Mon Sep 17 00:00:00 2001 From: Chick3nman Date: Fri, 16 Apr 2021 14:56:15 -0500 Subject: [PATCH] Add SCRYPT manual tuning information --- hashcat.hctune | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 88 insertions(+) diff --git a/hashcat.hctune b/hashcat.hctune index f9655a3fb..879792ddc 100644 --- a/hashcat.hctune +++ b/hashcat.hctune @@ -378,6 +378,94 @@ DEVICE_TYPE_GPU * 15700 1 1 DEVICE_TYPE_CPU * 22700 1 N 1 DEVICE_TYPE_GPU * 22700 1 N 1 +## Here's an example of how to manually tune SCRYPT algorithm kernels for your hardware. +## Manually tuning on GPU will yield very good results. For CPU there is not typically a significant change. +## +## First, you need to know the parameters of your SCRYPT hash: N, r and p. +## +## For the default SCRYPT reference those are N=14, r=8 and p=1, but these will likely not match the paremeters used by real-world applications. +## By reference, the N value represents an exponent (2^N, which we calculate as 1 bit shifted left by N). +## Hashcat expects this N value in decimal: 1 << 14 = 16384 +## +## Now that you have the 3 configuration items in decimal, multiply them with 128 (underlaying crypto primitive block size). +## For example: 128 * 16384 * 8 * 1 = 16777216 = 16MB +## This is the amount of memory required on the GPU to compute the hash of one password candidate. +## +## Hashcat computes multiple password candidates in parallel - this is what allows for full utilization of the device. +## The number of password candidates Hashcat can run in parallel is VRAM limited and depends on: +## +## 1. Compute devices' native compute units +## 2. Compute devices' native thread count +## 3. Artificial multiplier (--kernel-accel aka -n) +## +## In order to find out these values: +## +## 1. On startup Hashcat will show: * Device #1: GeForce GTX 980, 3963/4043 MB, 16MCU. The 16 MCU is the number of compute units on that device. +## 2. Native thread counts are fixed values: CPU=1, GPU-Intel=8, GPU-AMD=64 (wavefronts), GPU-NVIDIA=32 (warps) +## +## Now simply multiply them together. For my GTX980: 16 * 32 * 16777216 = 8589934592 = 8GB +## +## So what this means is that if we want to actually make use of all computing resource, this GPU would require 8GB of GPU RAM. +## However, it doesn't have that: +## +## Device #1: GeForce GTX 980, 3963/4043 MB, 16MCU. We only have 4043 MB (4GB minus some overhead from the OS). +## +## So how do we deal with this? This is were SCRYPT TMTO(time-memory trde off) kicks in. The SCRYPT algorithm is designed in such a way that we +## can precomputate that 16MB buffer from a self-choosen offset. Going into detail here on how this actually works is not important. +## +## What's relevant to us is that we can half the buffer size, but in doing so we pay with twice the computation time. +## We can repeat this as often as we want. That's why it's a trade-off. +## +## This mechanic can be manually set using --scrypt-tmto on the commandline, but won't typically need to. +## +## So back to our problem. We need 8GB of memory but have only 4GB. +## Actually, it's not full 4GB. The OS needs some of it and Hashcat needs some of it to store password candidates and other things. +## If you run a headless server it should be safe to subtract a fixed value of 200MB from whatever you have in your GPU. +## +## So lets divide our required memory(8GB) by 2 until it fits in our VRAM -200MB. +## +## (8GB >> 0) = 8GB < 3.8GB = No, Does not fit +## (8GB >> 1) = 4GB < 3.8GB = No, Does not fit +## (8GB >> 2) = 2GB < 3.8GB = Yes! +## +## This process is automated in Hashcat, but it is important to understand what's happening here. +## Because of the little overhead from the OS and Hashcat we pay a very high price. +## Even though it is just 200MB, it forces us to increase the TMTO by another step. +## In terms of speed, the speed is now only 1/4 of what we could archieve on that same GPU if it had only 8.2GB ram. +## But now we end up in a situation that we waste 1.8GB RAM which costs us ((1.8GB/16MB)>>1) candidates/second. +## +## This is where we can step in with manual tuning. We can override the above algorithm slightly to our advantage. +## If we know that we the resources we need are close to what we have (in this case 3.8GB <-> 4.0GB) +## We could decide to throw away some of our compute units so that we will no longer need 4.0GB but only 3.8GB +## and therefore we do not need to increase the TMTO by another step to fit in VRAM. +## +## If we cut down our 16 MCU to only 15 MCU or 14 MCU using --kernel-accel(-n), we end up with: +## +## 16 * 32 * 16777216 = 8589934592 / 2 = 4294967296 = 4.00GB < 3.80GB = Nope, next +## 15 * 32 * 16777216 = 8053063680 / 2 = 4026531840 = 3.84GB < 3.80GB = Nope, next +## 14 * 32 * 16777216 = 7516192768 / 2 = 3758096384 = 3.58GB < 3.80GB = Yes! +## +## So we can throw away 2/16 compute units, but save half of the computation trade-off on the rest of the compute device. +## On my GTX980, this improves the performance from 163 H/s to 201 H/s. +## You don't need to control --scrypt-tmto manually because now that the multiplier (-n) is smaller than the native value +## Hashcat will automatically realize it can decrease the TMTO by one. +## +## At this point, you found the optimal base value for your compute device. In this case: 14. +## +## Depending on our hardware, especially with hardware with very slow memory access like GPU +## there's a good chance that it's cheaper (faster) to compute an extra step on the GPU register. +## So if we increase the TMTO again by one, this gives an extra speed update. +## +## On my GTX980, this improves the performance from 201 H/s to 255 H/s. +## Again, there's no need to control this with --scrypt-tmto. Hashcat will realize it has to increase the TMTO again. +## +## All together, you can control all of this by using the -n parameter in the command line. +## This is not ideal in a production environment because you must use the --force flag. +## The best way to set this is using this Hashcat.hctune file to store it. This avoids the need to bypass any warnings. +## +## Find the ideal -n value, then store it here along with the proper compute device name. +## Formatting guidelines are availabe at the top of this document. + GeForce_GTX_980 * 8900 1 28 1 GeForce_GTX_980 * 9300 1 128 1 GeForce_GTX_980 * 15700 1 1 1