mirror of
https://github.com/bitcoinbook/bitcoinbook
synced 2024-12-23 15:18:11 +00:00
CH10: add section about compact block filters
This commit is contained in:
parent
f75f6b83cc
commit
1e2a2252b3
296
ch08.asciidoc
296
ch08.asciidoc
@ -833,6 +833,302 @@ For both of those reasons, Bitcoin Core eventually limited support for
|
|||||||
bloom filters to only clients on IP addresses that were explicitly
|
bloom filters to only clients on IP addresses that were explicitly
|
||||||
allowed by the node operator. This meant that an alternative method for
|
allowed by the node operator. This meant that an alternative method for
|
||||||
helping SPV cients find their transactions was needed.
|
helping SPV cients find their transactions was needed.
|
||||||
|
|
||||||
|
=== Compact Block Filters
|
||||||
|
|
||||||
|
// https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2016-May/012636.html
|
||||||
|
|
||||||
|
An idea was posted to the Bitcoin-Dev mailing list by an anonymous
|
||||||
|
developer in 2016 of reversing the bloom filter process. With a BIP37
|
||||||
|
bloom filter, each client hashes their addresses to create a bloom
|
||||||
|
filter and nodes hash parts of each transaction to attempt to match
|
||||||
|
that filter. In the new proposal, nodes hash parts of each transaction
|
||||||
|
in a block to create a bloom filter and clients hash their addresses to
|
||||||
|
attempt to match that filter. If a client finds a match, they download
|
||||||
|
the entire block.
|
||||||
|
|
||||||
|
[NOTE]
|
||||||
|
====
|
||||||
|
Despite the similarities in names, BIP152 _compact blocks_ and
|
||||||
|
BIP157/158 _compact block filters_ are unrelated
|
||||||
|
====
|
||||||
|
|
||||||
|
This allows nodes to create a single filter for every block, which they
|
||||||
|
can save to disk and serve over and over, eliminating the
|
||||||
|
denial-of-service vulnerabilities with BIP37. Clients don't give full
|
||||||
|
nodes any information about their past or future addresses. They only
|
||||||
|
download blocks, which may contain thousands of transactions that
|
||||||
|
weren't created by the client. They can even download each matching
|
||||||
|
block from a different peer, making it harder for full nodes to connect
|
||||||
|
transactions belonging to a single client across multiple blocks.
|
||||||
|
|
||||||
|
This idea for server-generated filters doesn't offer perfect privacy and
|
||||||
|
it still places some costs on full nodes (and it does require SPV
|
||||||
|
clients use more bandwidth for the block download), but it is much more
|
||||||
|
private and reliable than BIP37 client-requested bloom filters.
|
||||||
|
|
||||||
|
After the description of the original idea based on bloom filters,
|
||||||
|
developers realized there was a better data structure for
|
||||||
|
server-generated filters, called Golomb-Rice Coded Sets (GCS).
|
||||||
|
|
||||||
|
==== Golomb-Rice Coded Sets (GCS)
|
||||||
|
|
||||||
|
Imagine that Alice wants to send a list of numbers to Bob. The simple
|
||||||
|
way to do that is to just send him the entire list of numbers:
|
||||||
|
|
||||||
|
----
|
||||||
|
849
|
||||||
|
653
|
||||||
|
476
|
||||||
|
900
|
||||||
|
379
|
||||||
|
----
|
||||||
|
|
||||||
|
But there's a more efficient way. First, Alice puts the list in
|
||||||
|
numerical order:
|
||||||
|
|
||||||
|
----
|
||||||
|
379
|
||||||
|
476
|
||||||
|
653
|
||||||
|
849
|
||||||
|
900
|
||||||
|
----
|
||||||
|
|
||||||
|
Then, Alice sends the first number. For the remaining numbers, she
|
||||||
|
sends the difference between that number and the preceding number. For
|
||||||
|
example, for the second number, she sends 97 (476 - 379); for the third
|
||||||
|
number, she sends 177 (653 - 476); and so on:
|
||||||
|
|
||||||
|
----
|
||||||
|
379
|
||||||
|
97
|
||||||
|
177
|
||||||
|
196
|
||||||
|
51
|
||||||
|
----
|
||||||
|
|
||||||
|
We can see that the differences between two numbers in an ordered list
|
||||||
|
produces numbers that are shorter than the original numbers. Upon
|
||||||
|
receiving this list, Bob can reconstruct the original list by simply
|
||||||
|
adding each number with its predecessor. That means we save space
|
||||||
|
without losing any information, which is called _lossless encoding_.
|
||||||
|
|
||||||
|
If we randomly select numbers within a fixed range of values, then the
|
||||||
|
more numbers we select, the smaller the average (mean) size of the
|
||||||
|
differences. That means the amount of data we need to transfer doesn't
|
||||||
|
increase as fast as the length of our list increases (up to a point).
|
||||||
|
|
||||||
|
Even more usefully, the length of the randomly-selected numbers in a
|
||||||
|
list of differences is naturally biased towards smaller lengths.
|
||||||
|
Consider selecting two random numbers from 1 to 6; this is the same
|
||||||
|
as rolling two dice. There are 36 distinct combinations of two dice:
|
||||||
|
|
||||||
|
[cols="1,1,1,1,1,1"]
|
||||||
|
|===
|
||||||
|
| 1 1 | 1 2 | 1 3 | 1 4 | 1 5 | 1 6
|
||||||
|
| 2 1 | 2 2 | 2 3 | 2 4 | 2 5 | 2 6
|
||||||
|
| 3 1 | 3 2 | 3 3 | 3 4 | 3 5 | 3 6
|
||||||
|
| 4 1 | 4 2 | 4 3 | 4 4 | 4 5 | 4 6
|
||||||
|
| 5 1 | 5 2 | 5 3 | 5 4 | 5 5 | 5 6
|
||||||
|
| 6 1 | 6 2 | 6 3 | 6 4 | 6 5 | 6 6
|
||||||
|
|===
|
||||||
|
|
||||||
|
Let's find the difference between the larger of the numbers and the
|
||||||
|
smaller of the numbers:
|
||||||
|
|
||||||
|
[cols="1,1,1,1,1,1"]
|
||||||
|
|===
|
||||||
|
| 0 | 1 | 2 | 3 | 4 | 5
|
||||||
|
| 1 | 0 | 1 | 2 | 3 | 4
|
||||||
|
| 2 | 1 | 0 | 1 | 2 | 3
|
||||||
|
| 3 | 2 | 1 | 0 | 1 | 2
|
||||||
|
| 4 | 3 | 2 | 1 | 0 | 1
|
||||||
|
| 5 | 4 | 3 | 2 | 1 | 0
|
||||||
|
|===
|
||||||
|
|
||||||
|
If we count the frequency of each difference occurring, we see that the
|
||||||
|
small differences are much more likely to occur that the large
|
||||||
|
differences:
|
||||||
|
|
||||||
|
[cols="1,1"]
|
||||||
|
|===
|
||||||
|
| Difference | Occurrences
|
||||||
|
| 0 | 6
|
||||||
|
| 1 | 10
|
||||||
|
| 2 | 8
|
||||||
|
| 3 | 6
|
||||||
|
| 4 | 4
|
||||||
|
| 5 | 2
|
||||||
|
|===
|
||||||
|
|
||||||
|
If we know that we might need to store large numbers (because large
|
||||||
|
differences can happen, even if they are rare) but we'll most often need
|
||||||
|
to store small numbers, we can encode each number using a system that
|
||||||
|
uses less space for small numbers and extra space for large numbers.
|
||||||
|
On average, that system will perform better than using the same amount
|
||||||
|
of space for every number.
|
||||||
|
|
||||||
|
Golomb coding provides that facility. Rice coding is a subset of Golomb
|
||||||
|
coding that's more convenient to use in some situations, including the
|
||||||
|
application of Bitcoin block filters.
|
||||||
|
|
||||||
|
==== What data to include in a block filter
|
||||||
|
|
||||||
|
Our primary goal is to allow wallets to learn whether a block contains a
|
||||||
|
transaction affecting that wallet. For a wallet to be effective, it
|
||||||
|
needs to learn two types of information:
|
||||||
|
|
||||||
|
1. When it has received money. Specifically, when a transaction
|
||||||
|
output contains a scriptPubKey that the wallet controls (such as by
|
||||||
|
controlling the authorized private key).
|
||||||
|
|
||||||
|
2. When it has spent money. Specifically, when a transaction input
|
||||||
|
references a previous transaction output that the wallet controlled.
|
||||||
|
|
||||||
|
A secondary goal during the design of compact block filters was to allow
|
||||||
|
the wallet receiving the filter to verify that it received an accurate
|
||||||
|
filter from a peer. For example, if the wallet downloaded the block
|
||||||
|
from which the filter was created, the wallet could generate its own
|
||||||
|
filter. It could then compare its filter to the peer's filter and
|
||||||
|
verify that they were identical, proving the peer had generated an
|
||||||
|
accurate filter.
|
||||||
|
|
||||||
|
For both the primary and secondary goals to be met, a filter would need
|
||||||
|
to reference two types of information:
|
||||||
|
|
||||||
|
1. The scriptPubKey for every output in every transaction in a block.
|
||||||
|
|
||||||
|
2. The outpoint for every input in every transaction in a block.
|
||||||
|
|
||||||
|
An early design for compact block filters included both of those pieces
|
||||||
|
of information, but it was realized that there was a more efficient way
|
||||||
|
to accomplish the primary goal if we sacrificed the secondary goal. In
|
||||||
|
the new design, a block filter would still references two types of
|
||||||
|
information, but they'd be more closely related:
|
||||||
|
|
||||||
|
1. As before, the scriptPubKey for every output in every transaction in a
|
||||||
|
block.
|
||||||
|
|
||||||
|
2. In a change, it would also reference the scriptPubKey of the output
|
||||||
|
referenced by the outpoint for every input in every transaction in a
|
||||||
|
block. In other words, the scriptPubKey being spent.
|
||||||
|
|
||||||
|
This had several advantages. First, it meant that wallets didn't need
|
||||||
|
to track outpoints; they could instead just scan for the the
|
||||||
|
scriptPubKeys to which they expected to receive money. Second, any time a
|
||||||
|
later transaction in a block spends the output of an earlier
|
||||||
|
transaction in the same block, they'll both reference the same
|
||||||
|
scriptPubKey. More than one reference to the same scriptPubKey is
|
||||||
|
redundant in a compact block filter, so the redundant copies can be
|
||||||
|
removed, shrinking the size of the filters.
|
||||||
|
|
||||||
|
When full nodes validate a block, they need access to the scriptPubKeys
|
||||||
|
for both the current transaction outputs in a block and the transaction
|
||||||
|
outputs from previous blocks that are being referenced in inputs, so
|
||||||
|
they're able to build compact block filters in this simplified model.
|
||||||
|
But a block itself doesn't include the scriptPubKeys from transactions
|
||||||
|
included in previous blocks, so there's no convenient way for a client
|
||||||
|
to verify a block filter was built correctly. However, there is an
|
||||||
|
alternative that can help a client detect if a peer is lying to it:
|
||||||
|
obtaining the same filter from multiple peers.
|
||||||
|
|
||||||
|
==== Downloading block filters from multiple peers
|
||||||
|
|
||||||
|
A peer can provide a wallet with an inaccurate filter. There's two ways
|
||||||
|
to create an inaccurate filter. The peer can create a filter that
|
||||||
|
references transactions that don't actually appear in the associated
|
||||||
|
block (a false positive). Alternatively, the peer can crate a filter
|
||||||
|
that doesn't reference transactions that do actually appear in the
|
||||||
|
associated block (a false negative).
|
||||||
|
|
||||||
|
The first protection against an inaccurate filter is for a client to
|
||||||
|
obtain a filter from multiple peers. The BIP157 protocol allows a
|
||||||
|
client to download just a short 32-byte commitment to a filter to
|
||||||
|
determine whether each peer is advertising the same filter as all of the
|
||||||
|
client's other peers. That minimizes the amount of bandwidth the client
|
||||||
|
must expend to query many different peers for their filters, if all of
|
||||||
|
those peers agree.
|
||||||
|
|
||||||
|
If two or more different peers have different filters for the same
|
||||||
|
block, the client can download all of them. It can then also download
|
||||||
|
the associated block. If the block contains any transaction related to
|
||||||
|
the wallet that is not part of one of the filters, then the wallet can
|
||||||
|
be sure that whichever peer created that filter was
|
||||||
|
inaccurate--Golomb-Rice Coded Sets (GCSes) will always include a
|
||||||
|
potential match.
|
||||||
|
|
||||||
|
Alternatively, if the block doesn't contain a transaction that the
|
||||||
|
filter said might match the wallet, that isn't proof that the filter was
|
||||||
|
inaccurate. To minimize the size of a GCS, we allow a certain number of
|
||||||
|
false positives. What the wallet can do is continue downloading
|
||||||
|
additional filters from the peer, either randomly or when they indicate
|
||||||
|
a match, and then track the client's false positive rate. If it
|
||||||
|
differs significantly from the false positive rate that filters were
|
||||||
|
designed to use, the wallet can stop using that peer. In most cases,
|
||||||
|
the only consequence of the inaccurate filter is that the wallet uses
|
||||||
|
more bandwidth than expected.
|
||||||
|
|
||||||
|
==== Reducing bandwidth with lossy encoding
|
||||||
|
|
||||||
|
The data about the transactions in a block that we want to communicate
|
||||||
|
is a scriptPubKey. ScriptPubKeys vary in length and follow patterns,
|
||||||
|
which means the differences between them won't be evenly distributed
|
||||||
|
like we want. However, we've already seen in many places in this book
|
||||||
|
that we can use a hash function to create a commitment to some data and
|
||||||
|
also produce a value that looks like a randomly selected number.
|
||||||
|
|
||||||
|
In other places in this book, we've used a cryptographically secure hash
|
||||||
|
function that provides assurances about the strength of its commitment
|
||||||
|
and how indistinguishable from random its output is. However, there are
|
||||||
|
faster and more configurable non-cryptographic hash functions, such as
|
||||||
|
the SipHash function we'll use for compact block filters.
|
||||||
|
|
||||||
|
The details of the algorithm used are described in BIP158, but the gist
|
||||||
|
is that each scriptPubKey is reduced to a 64 bit commitment using
|
||||||
|
SipHash and some arthritic operations. You can think of this as
|
||||||
|
taking a set of large numbers and truncating them to shorter numbers, a
|
||||||
|
process that loses data (so it's called _lossy encoding_). By losing
|
||||||
|
some information, we don't need to store as much information later,
|
||||||
|
which saves space. In this case, we go from a typical scriptPubKey
|
||||||
|
that's 160 bits or longer down to just 64 bits.
|
||||||
|
|
||||||
|
==== Using compact block filters
|
||||||
|
|
||||||
|
The 64 bit value for every commitment to a scriptPubKey in a block are
|
||||||
|
sorted, duplicate entries are removed, and the GCS is constructed by
|
||||||
|
finding the differences (deltas) between each entry. That compact block
|
||||||
|
filter is then distributed by peers to their clients (such as wallets).
|
||||||
|
|
||||||
|
A client uses the deltas to reconstruct the original commitments. The
|
||||||
|
client such as a wallet also takes all the scriptPubKeys it is
|
||||||
|
monitoring for and generates commitments in the same way as BIP158. It
|
||||||
|
checks whether any of its generated commitments match the commitments in
|
||||||
|
the filter.
|
||||||
|
|
||||||
|
Recall our example of the lossiness of compact block filters being
|
||||||
|
similar to truncating a number. Imagine a client is looking for a block
|
||||||
|
that contains the number 123456 and that an an accurate (but lossy)
|
||||||
|
compact block filter contains the number 1234. When a client sees that
|
||||||
|
1234, it will download the associated block.
|
||||||
|
|
||||||
|
There's a 100% guarantee that an accurate filter containing 1234 will
|
||||||
|
allow a client to learn about a block containing 123456, called a _true
|
||||||
|
positive_. However, there's also chance that the block might contain
|
||||||
|
123400, 123401, or almost a hundred other entries that are not when the
|
||||||
|
client is looking for (in this example), called a _false positive_.
|
||||||
|
|
||||||
|
A 100% true positive match rate is great. It means that a wallet can
|
||||||
|
depend on compact block filters to find every transaction affecting that
|
||||||
|
wallet. A non-zero false positive rate means that the wallet will end
|
||||||
|
up downloading some blocks that don't contain transactions interesting
|
||||||
|
to the wallet. The main consequence of this is that the client will use
|
||||||
|
extra bandwidth, which is not a huge problem. The actual
|
||||||
|
false-positive rate for BIP158 compact block filters is very low, so
|
||||||
|
it's not a major problem. A false positive rate can also help improve a
|
||||||
|
client's privacy, as it does with bloom filters, although anyone wanting
|
||||||
|
the best possible privacy should still use their own full node.
|
||||||
|
|
||||||
=== SPV Clients and Privacy
|
=== SPV Clients and Privacy
|
||||||
|
|
||||||
Clients that implement SPV have weaker privacy than a full node. A full
|
Clients that implement SPV have weaker privacy than a full node. A full
|
||||||
|
Loading…
Reference in New Issue
Block a user