Solvedhandson ml Creating test set with hash

Hi @ageron , i really love your book and it is easy to understand for me, but in the chapter 2 you wrote:

For example, you could compute a hash of each
instance’s identifier, keep only the last byte of the hash, and put the
instance in the test set if this value is lower or equal to 51 (~20% of 256).
This ensures that the test set will remain consistent across multiple runs,
even if you refresh the dataset. The new test set will contain 20% of the
new instances, but it will not contain any instance that was previously in
the training set

Could you give me some explanation, what is hash of instance's identifier, how can I apply it to split data set into 2 parts training set and test set? And one more question, instance's identifier are longtitude, median_house_value, population,... is that correct?

Sorry for my bad English :( Thank you in advanced.

14 Answers

✔️Accepted Answer

Thanks for your kind words and your question.

What we are trying to achieve is to have a stable test set across multiple runs of your program. If the test set changes every time you run the program, then you run the risk of gradually tweaking your program and your model to fit all of the data (including test instances), which is bad because it makes the evaluation of your model on the test set unreliable (the metric will be too optimistic).

Next question: how can you ensure that the test set is stable across multiple runs? Well, there are countless solutions. For example, you could simply randomly pick a set for each instance, just once, then write the sets in separate files. In subsequent runs of your program, you would just use these existing files. End of story. However, this extra preprocessing is not necessary if every instance has a unique ID: suppose the ID is a random integer, then you can simply decide that instances whose IDs end with 00 to 19 belong to the test set, and the rest belong to the training set (this would ensure that 20% of your instances go to the test set, and 80% go to the training set). So for example, the instance with ID=5910406 is in the test set, but the instance with ID=48290483 is not. This solution would work fine. However, database IDs are often sequential, not random, so you need a way to shuffle the IDs to ensure that the instances are randomly distributed across the training set and the test set. This leads us to hash functions.

As Wikipedia says, "A hash function is any function that can be used to map data of arbitrary size to data of fixed size. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes.". The most famous example is the MD5 function:

>>> from hashlib import md5
>>> md5(b"This could be a very long document").digest()
b'\x98\x14\x8d\x9f\xca\xf4\xf69F\x82\xf9kF\xda\x01\x06'
>>> md5(b"This could be a very long document").digest()
b'\x98\x14\x8d\x9f\xca\xf4\xf69F\x82\xf9kF\xda\x01\x06'
>>> md5(b"This may be a very long document").digest()
b'}\xeb\xdb\xdc\r\x81D\xa0\xa1\\\x19\x8c\xccT:-'

As this example shows, the MD5 function is deterministic: if you run it twice on the same data, it outputs the same result twice. However, if you run it on two different documents (even if they differ by only one bit), it outputs very different results, seemingly random. This is exactly what we need to shuffle the instance IDs! What we can do is just run the instance IDs through MD5 (or any other hash function), and this will give us a new, seemingly random ID (but actually stable across multiple runs, since hash functions are deterministic).

Now MD5 outputs a 128 bit number, so if we want to have 20% of the instances in the test set, we can just compute the MD5 of each instance's ID, and if the result is lower than 2^128 * 20%, it goes to the test set, otherwise it goes to the training set. Unfortunately, things get slightly ugly in Python when dealing with MD5, since it actually returns a byte array (with 16 bytes), not a large integer. We could convert the whole byte array to a long 128 bit integer (using the struct module), but instead it is simpler to just pick the last byte of the hash and use it. If we want 20% of our instances to go to the test set, we can just compute the MD5 of the ID, get the last byte out of the result (which will look like a random byte, from 0 to 255), and if that byte is lower or equal to 51 (which is about 20% of 256) then the instance goes to the test set, otherwise it goes to the training set. Phew... It sounds complicated, but in practice it just means using a one-line function to decide in which set each instance will go.

Lastly, what do we do if the dataset has no IDs at all? Well, we can just apply the MD5 function to each instance's input features (assuming they are unique and stable across runs). In the case of the California Housing dataset, you could use the longitude, latitude, median_house_value, population and so on. For example:

>>> doc = "\n".join([str(field)
...                  for field in (longitude, latitude, median_house_value, population)])
...
>>> h = md5(doc.encode("ascii")).digest()
>>> is_test = (h[-1] <= 51)
>>> is_test
True

I hope this helps!

Other Answers:

Hi @zahraegh ,
Thanks for your questions.

Here's a typical usage of the MD5 hash function:

>>> from hashlib import md5
>>> md5(b"hello")
<md5 HASH object @ 0x101988580>
>>> md5(b"hello").digest()
b']A@*\xbcK*v\xb9q\x9d\x91\x10\x17\xc5\x92'

You give it a byte array, of any length, and when you call digest() it returns a byte array of 128 bits (16 bytes), regardless of the length of the input byte array:

>>> len(md5(b"hello").digest())
16
>>> len(md5(b"hello").digest())*8
128
>>> len(md5(b"hello this is longer").digest())*8
128

The md5() function expects an input that implements the buffer API, such as a byte array, as above. But if you just give it an integer, it does not work, and it tells you why:

>>> md5(123)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object supporting the buffer API required

One workaround is to convert the integer to a string, then encode that string (e.g. using ASCII) to a byte-array. But that adds some unnecessary overhead. Instead, we can just create a NumPy int64 which is nice enough to implement the buffer API:

>>> import numpy as np
>>> md5(np.int64(123))
<md5 HASH object @ 0x101f2ba58>
>>> md5(np.int64(123)).digest()
b'\xf1\x8b\x8d\xbe\xfe\x02\xa0\xef\xce(\x1d\xebU\xa2\t\xcd'

Whether the dataset is big or small, we want to keep about 20% for the test set, and 80% for the training set. We don't need a one-to-one mapping from instance to hash: all we need is to have a deterministic mapping from each instance in the dataset to either the training set or the test set, so it does not matter if two instances end up with the same hash. We just need the mapping to be deterministic, so that we a given instance always gets assigned to the same subset (train or test). And it should also be pseudo-random so that instances get well distributed across the training set and the test set.
If the IDs in the original dataset are not random (for example, say they are just incremental: 1, 2, 3, 4...), and if we use a method that does not randomize sufficiently, then we may then there is a risk that the training set and the test set will not look like each other: for example, the training set may have all the old instances, while the test set may have all the new ones. This would be pretty bad. Fortunately, the MD5 hash function (and most commonly used hash functions, such as SHA1 and so on) have both these properties: they are deterministic (all hash functions are, by definition), and they are sufficiently pseudo-random for our use case (MD5 is not sufficiently random for some use cases, such as in cryptography, but it is random enough for splitting a dataset).

Hope this helps.

Thank you for a clear explanation.
Just one thing I do not fully understand. My question is that:
*Can we get exact 20% instances of the data set if we compute the MD5 hash function and take all the instance' ID, which is lower or equal 51 (20% of 256)?. *
I mean, it depends on the distribution of hash function. In my opinion, if we want 20% of the data set to go to the test set, we must know the distribution of hash function applied to our data set and choose the right number (may be not 51) to split our data set. Please correct me if I am wrong.

Hi @SumoBBQ,

The MD5 hash function acts like a pseudo-random number generator with a uniform probability distribution (as do most hash functions, in fact). So there will be roughly 52/256=20.3125% hashes whose last byte is strictly lower than 52 (i.e., lower or equal to 51).
You can have something closer to 20% by taking the hashes whose last byte is strictly lower than 51 (i.e., lower or equal to 50), since 51/256=19.921875. You can get something closer to 20% by taking the whole hash and converting it to a 128 bit number (MD5 hashes have 128 bits), and then checking whether or not that number is smaller than 2^128 * 20%.

Hope this helps.

More Issues: