From b2911a9a5fc0b0ca0a2837cb90cf7f0034df6f7a Mon Sep 17 00:00:00 2001
From: Chick3nman <admin@chick3nman.com>
Date: Fri, 16 Apr 2021 14:56:15 -0500
Subject: [PATCH 1/3] Add SCRYPT manual tuning information

---
 hashcat.hctune | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 88 insertions(+)

diff --git a/hashcat.hctune b/hashcat.hctune
index f9655a3fb..879792ddc 100644
--- a/hashcat.hctune
+++ b/hashcat.hctune
@@ -378,6 +378,94 @@ DEVICE_TYPE_GPU                                 *       15700   1       1
 DEVICE_TYPE_CPU                                 *       22700   1       N       1
 DEVICE_TYPE_GPU                                 *       22700   1       N       1
 
+## Here's an example of how to manually tune SCRYPT algorithm kernels for your hardware.
+## Manually tuning on GPU will yield very good results. For CPU there is not typically a significant change.
+##
+## First, you need to know the parameters of your SCRYPT hash: N, r and p.
+##
+## For the default SCRYPT reference those are N=14, r=8 and p=1, but these will likely not match the paremeters used by real-world applications.
+## By reference, the N value represents an exponent (2^N, which we calculate as 1 bit shifted left by N).
+## Hashcat expects this N value in decimal: 1 << 14 = 16384
+##
+## Now that you have the 3 configuration items in decimal, multiply them with 128 (underlaying crypto primitive block size).
+## For example: 128 * 16384 * 8 * 1 = 16777216 = 16MB
+## This is the amount of memory required on the GPU to compute the hash of one password candidate.
+##
+## Hashcat computes multiple password candidates in parallel - this is what allows for full utilization of the device.
+## The number of password candidates Hashcat can run in parallel is VRAM limited and depends on:
+##
+## 1. Compute devices' native compute units
+## 2. Compute devices' native thread count
+## 3. Artificial multiplier (--kernel-accel aka -n)
+##
+## In order to find out these values:
+## 
+## 1. On startup Hashcat will show: * Device #1: GeForce GTX 980, 3963/4043 MB, 16MCU. The 16 MCU is the number of compute units on that device.
+## 2. Native thread counts are fixed values: CPU=1, GPU-Intel=8, GPU-AMD=64 (wavefronts), GPU-NVIDIA=32 (warps)
+##
+## Now simply multiply them together. For my GTX980: 16 * 32 * 16777216 = 8589934592 = 8GB
+##
+## So what this means is that if we want to actually make use of all computing resource, this GPU would require 8GB of GPU RAM.
+## However, it doesn't have that:
+##
+## Device #1: GeForce GTX 980, 3963/4043 MB, 16MCU. We only have 4043 MB (4GB minus some overhead from the OS).
+##
+## So how do we deal with this? This is were SCRYPT TMTO(time-memory trde off) kicks in. The SCRYPT algorithm is designed in such a way that we
+## can precomputate that 16MB buffer from a self-choosen offset. Going into detail here on how this actually works is not important.
+## 
+## What's relevant to us is that we can half the buffer size, but in doing so we pay with twice the computation time. 
+## We can repeat this as often as we want. That's why it's a trade-off.
+##
+## This mechanic can be manually set using --scrypt-tmto on the commandline, but won't typically need to.
+## 
+## So back to our problem. We need 8GB of memory but have only 4GB.
+## Actually, it's not full 4GB. The OS needs some of it and Hashcat needs some of it to store password candidates and other things.
+## If you run a headless server it should be safe to subtract a fixed value of 200MB from whatever you have in your GPU.
+##
+## So lets divide our required memory(8GB) by 2 until it fits in our VRAM -200MB.
+##
+##   (8GB >> 0) = 8GB < 3.8GB = No, Does not fit
+##   (8GB >> 1) = 4GB < 3.8GB = No, Does not fit
+##   (8GB >> 2) = 2GB < 3.8GB = Yes! 
+##
+## This process is automated in Hashcat, but it is important to understand what's happening here.
+## Because of the little overhead from the OS and Hashcat we pay a very high price.
+## Even though it is just 200MB, it forces us to increase the TMTO by another step.
+## In terms of speed, the speed is now only 1/4 of what we could archieve on that same GPU if it had only 8.2GB ram.
+## But now we end up in a situation that we waste 1.8GB RAM which costs us ((1.8GB/16MB)>>1) candidates/second.
+##
+## This is where we can step in with manual tuning. We can override the above algorithm slightly to our advantage.
+## If we know that we the resources we need are close to what we have (in this case 3.8GB <-> 4.0GB)
+## We could decide to throw away some of our compute units so that we will no longer need 4.0GB but only 3.8GB
+## and therefore we do not need to increase the TMTO by another step to fit in VRAM.
+##
+## If we cut down our 16 MCU to only 15 MCU or 14 MCU using --kernel-accel(-n), we end up with:
+## 
+##   16 * 32 * 16777216 = 8589934592 / 2 = 4294967296 = 4.00GB < 3.80GB = Nope, next
+##   15 * 32 * 16777216 = 8053063680 / 2 = 4026531840 = 3.84GB < 3.80GB = Nope, next
+##   14 * 32 * 16777216 = 7516192768 / 2 = 3758096384 = 3.58GB < 3.80GB = Yes!
+##
+## So we can throw away 2/16 compute units, but save half of the computation trade-off on the rest of the compute device.
+## On my GTX980, this improves the performance from 163 H/s to 201 H/s.
+## You don't need to control --scrypt-tmto manually because now that the multiplier (-n) is smaller than the native value
+## Hashcat will automatically realize it can decrease the TMTO by one.
+##
+## At this point, you found the optimal base value for your compute device. In this case: 14.
+##
+## Depending on our hardware, especially with hardware with very slow memory access like GPU
+## there's a good chance that it's cheaper (faster) to compute an extra step on the GPU register.
+## So if we increase the TMTO again by one, this gives an extra speed update.
+##
+## On my GTX980, this improves the performance from 201 H/s to 255 H/s.
+## Again, there's no need to control this with --scrypt-tmto. Hashcat will realize it has to increase the TMTO again.
+##
+## All together, you can control all of this by using the -n parameter in the command line. 
+## This is not ideal in a production environment because you must use the --force flag.
+## The best way to set this is using this Hashcat.hctune file to store it. This avoids the need to bypass any warnings.
+##
+## Find the ideal -n value, then store it here along with the proper compute device name. 
+## Formatting guidelines are availabe at the top of this document.
+
 GeForce_GTX_980                                 *       8900    1      28       1
 GeForce_GTX_980                                 *       9300    1     128       1
 GeForce_GTX_980                                 *       15700   1       1       1

From 380cf61424795570b8c1238708984d090b451cd7 Mon Sep 17 00:00:00 2001
From: Chick3nman <admin@chick3nman.com>
Date: Fri, 16 Apr 2021 15:11:03 -0500
Subject: [PATCH 2/3] Fix typo, spelling

---
 hashcat.hctune | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hashcat.hctune b/hashcat.hctune
index 879792ddc..e10885513 100644
--- a/hashcat.hctune
+++ b/hashcat.hctune
@@ -383,7 +383,7 @@ DEVICE_TYPE_GPU                                 *       22700   1       N
 ##
 ## First, you need to know the parameters of your SCRYPT hash: N, r and p.
 ##
-## For the default SCRYPT reference those are N=14, r=8 and p=1, but these will likely not match the paremeters used by real-world applications.
+## In the SCRYPT reference implementation those parameters are N=14, r=8 and p=1, but these will likely not match the parameters used by real-world applications.
 ## By reference, the N value represents an exponent (2^N, which we calculate as 1 bit shifted left by N).
 ## Hashcat expects this N value in decimal: 1 << 14 = 16384
 ##
@@ -428,7 +428,7 @@ DEVICE_TYPE_GPU                                 *       22700   1       N
 ##   (8GB >> 1) = 4GB < 3.8GB = No, Does not fit
 ##   (8GB >> 2) = 2GB < 3.8GB = Yes! 
 ##
-## This process is automated in Hashcat, but it is important to understand what's happening here.
+## This process is automated in Hashcat, but it is important to understand what's actually happening here.
 ## Because of the little overhead from the OS and Hashcat we pay a very high price.
 ## Even though it is just 200MB, it forces us to increase the TMTO by another step.
 ## In terms of speed, the speed is now only 1/4 of what we could archieve on that same GPU if it had only 8.2GB ram.

From d298af00a02ea31a131fde97df6df1cc5e04c429 Mon Sep 17 00:00:00 2001
From: Chick3nman <admin@chick3nman.com>
Date: Fri, 16 Apr 2021 15:48:35 -0500
Subject: [PATCH 3/3] Grammar changes, phrasing

---
 hashcat.hctune | 65 +++++++++++++++++++++++++-------------------------
 1 file changed, 33 insertions(+), 32 deletions(-)

diff --git a/hashcat.hctune b/hashcat.hctune
index e10885513..de055ea92 100644
--- a/hashcat.hctune
+++ b/hashcat.hctune
@@ -379,71 +379,71 @@ DEVICE_TYPE_CPU                                 *       22700   1       N
 DEVICE_TYPE_GPU                                 *       22700   1       N       1
 
 ## Here's an example of how to manually tune SCRYPT algorithm kernels for your hardware.
-## Manually tuning on GPU will yield very good results. For CPU there is not typically a significant change.
+## Manually tuning the GPU will yield increased performance. There is typically no noticeable change to CPU performance.
 ##
 ## First, you need to know the parameters of your SCRYPT hash: N, r and p.
 ##
-## In the SCRYPT reference implementation those parameters are N=14, r=8 and p=1, but these will likely not match the parameters used by real-world applications.
-## By reference, the N value represents an exponent (2^N, which we calculate as 1 bit shifted left by N).
-## Hashcat expects this N value in decimal: 1 << 14 = 16384
+## The reference SCRYPT parameter values are N=14, r=8 and p=1, but these will likely not match the parameters used by real-world applications.
+## For reference, the N value represents an exponent (2^N, which we calculate by bit shifting 1 left by N bits).
+## Hashcat expects this N value in decimal format: 1 << 14 = 16384
 ##
-## Now that you have the 3 configuration items in decimal, multiply them with 128 (underlaying crypto primitive block size).
+## Now that you have the 3 configuration items in decimal format, multiply them by 128 (underlaying crypto primitive block size).
 ## For example: 128 * 16384 * 8 * 1 = 16777216 = 16MB
-## This is the amount of memory required on the GPU to compute the hash of one password candidate.
+## This is the amount of memory required for the GPU to compute the hash of one password candidate.
 ##
 ## Hashcat computes multiple password candidates in parallel - this is what allows for full utilization of the device.
-## The number of password candidates Hashcat can run in parallel is VRAM limited and depends on:
+## The number of password candidates that Hashcat can run in parallel is VRAM limited and depends on:
 ##
 ## 1. Compute devices' native compute units
 ## 2. Compute devices' native thread count
 ## 3. Artificial multiplier (--kernel-accel aka -n)
 ##
-## In order to find out these values:
+## In order to find these values:
 ## 
 ## 1. On startup Hashcat will show: * Device #1: GeForce GTX 980, 3963/4043 MB, 16MCU. The 16 MCU is the number of compute units on that device.
 ## 2. Native thread counts are fixed values: CPU=1, GPU-Intel=8, GPU-AMD=64 (wavefronts), GPU-NVIDIA=32 (warps)
 ##
-## Now simply multiply them together. For my GTX980: 16 * 32 * 16777216 = 8589934592 = 8GB
+## Now multiply them together. For my GTX980: 16 * 32 * 16777216 = 8589934592 = 8GB
 ##
-## So what this means is that if we want to actually make use of all computing resource, this GPU would require 8GB of GPU RAM.
+## If we want to actually make use of all computing resources, this GPU would require 8GB of GPU RAM.
 ## However, it doesn't have that:
 ##
 ## Device #1: GeForce GTX 980, 3963/4043 MB, 16MCU. We only have 4043 MB (4GB minus some overhead from the OS).
 ##
-## So how do we deal with this? This is were SCRYPT TMTO(time-memory trde off) kicks in. The SCRYPT algorithm is designed in such a way that we
-## can precomputate that 16MB buffer from a self-choosen offset. Going into detail here on how this actually works is not important.
+## How do we deal with this? This is where SCRYPT TMTO(time-memory trde off) kicks in. The SCRYPT algorithm is designed in such a way that we
+## can pre-compute that 16MB buffer from a self-choosen offset. Details on how this actually works are not important for this process.
 ## 
-## What's relevant to us is that we can half the buffer size, but in doing so we pay with twice the computation time. 
+## What's relevant to us is that we can halve the buffer size, but we pay with twice the computation time. 
 ## We can repeat this as often as we want. That's why it's a trade-off.
 ##
-## This mechanic can be manually set using --scrypt-tmto on the commandline, but won't typically need to.
+## This mechanic can be manually set using --scrypt-tmto on the commandline, but this is not the best way.
 ## 
-## So back to our problem. We need 8GB of memory but have only 4GB.
-## Actually, it's not full 4GB. The OS needs some of it and Hashcat needs some of it to store password candidates and other things.
+## Back to our problem. We need 8GB of memory but have only ~4GB.
+## It's not a full 4GB. The OS needs some of it and Hashcat needs some of it to store password candidates and other things.
 ## If you run a headless server it should be safe to subtract a fixed value of 200MB from whatever you have in your GPU.
 ##
-## So lets divide our required memory(8GB) by 2 until it fits in our VRAM -200MB.
+## So lets divide our required memory(8GB) by 2 until it fits in our VRAM - 200MB.
 ##
-##   (8GB >> 0) = 8GB < 3.8GB = No, Does not fit
-##   (8GB >> 1) = 4GB < 3.8GB = No, Does not fit
-##   (8GB >> 2) = 2GB < 3.8GB = Yes! 
+## (8GB >> 0) = 8GB < 3.8GB = No, Does not fit
+## (8GB >> 1) = 4GB < 3.8GB = No, Does not fit
+## (8GB >> 2) = 2GB < 3.8GB = Yes! 
 ##
-## This process is automated in Hashcat, but it is important to understand what's actually happening here.
-## Because of the little overhead from the OS and Hashcat we pay a very high price.
+## This process is automated in Hashcat, but it is important to understand what's happening here.
+## Because of the light overhead from the OS and Hashcat, we pay a very high price.
 ## Even though it is just 200MB, it forces us to increase the TMTO by another step.
 ## In terms of speed, the speed is now only 1/4 of what we could archieve on that same GPU if it had only 8.2GB ram.
 ## But now we end up in a situation that we waste 1.8GB RAM which costs us ((1.8GB/16MB)>>1) candidates/second.
 ##
-## This is where we can step in with manual tuning. We can override the above algorithm slightly to our advantage.
-## If we know that we the resources we need are close to what we have (in this case 3.8GB <-> 4.0GB)
-## We could decide to throw away some of our compute units so that we will no longer need 4.0GB but only 3.8GB
-## and therefore we do not need to increase the TMTO by another step to fit in VRAM.
+## This is where manual tuning can come into play.
+## If we know that the resources we need are close to what we have (in this case 3.8GB <-> 4.0GB)
+## We could decide to throw away some of our compute units so that we will no longer need 4.0GB but only 3.8GB.
+## Therefore, we do not need to increase the TMTO by another step to fit in VRAM.
 ##
 ## If we cut down our 16 MCU to only 15 MCU or 14 MCU using --kernel-accel(-n), we end up with:
 ## 
-##   16 * 32 * 16777216 = 8589934592 / 2 = 4294967296 = 4.00GB < 3.80GB = Nope, next
-##   15 * 32 * 16777216 = 8053063680 / 2 = 4026531840 = 3.84GB < 3.80GB = Nope, next
-##   14 * 32 * 16777216 = 7516192768 / 2 = 3758096384 = 3.58GB < 3.80GB = Yes!
+## 16 * 32 * 16777216 = 8589934592 / 2 = 4294967296 = 4.00GB < 3.80GB = Nope, next
+## 15 * 32 * 16777216 = 8053063680 / 2 = 4026531840 = 3.84GB < 3.80GB = Nope, next
+## 14 * 32 * 16777216 = 7516192768 / 2 = 3758096384 = 3.58GB < 3.80GB = Yes!
 ##
 ## So we can throw away 2/16 compute units, but save half of the computation trade-off on the rest of the compute device.
 ## On my GTX980, this improves the performance from 163 H/s to 201 H/s.
@@ -452,20 +452,21 @@ DEVICE_TYPE_GPU                                 *       22700   1       N
 ##
 ## At this point, you found the optimal base value for your compute device. In this case: 14.
 ##
-## Depending on our hardware, especially with hardware with very slow memory access like GPU
+## Depending on our hardware, especially hardware with very slow memory access like a GPU
 ## there's a good chance that it's cheaper (faster) to compute an extra step on the GPU register.
-## So if we increase the TMTO again by one, this gives an extra speed update.
+## So if we increase the TMTO again by one, this gives an extra speed boost.
 ##
 ## On my GTX980, this improves the performance from 201 H/s to 255 H/s.
 ## Again, there's no need to control this with --scrypt-tmto. Hashcat will realize it has to increase the TMTO again.
 ##
 ## All together, you can control all of this by using the -n parameter in the command line. 
 ## This is not ideal in a production environment because you must use the --force flag.
-## The best way to set this is using this Hashcat.hctune file to store it. This avoids the need to bypass any warnings.
+## The best way to set this is by using this Hashcat.hctune file to store it. This avoids the need to bypass any warnings.
 ##
 ## Find the ideal -n value, then store it here along with the proper compute device name. 
 ## Formatting guidelines are availabe at the top of this document.
 
+
 GeForce_GTX_980                                 *       8900    1      28       1
 GeForce_GTX_980                                 *       9300    1     128       1
 GeForce_GTX_980                                 *       15700   1       1       1