Solvedhandson ml mnist dataset

hi,

I just noticed that the mnist dataset was removed from the sklearn and tensorflow basic datasets. therefore, it brings the trouble in doing the example of the chapter 3.

I will be grateful if you could help to revised the code for our continuing using the mnist as the example.

best regard,

Charles

37 Answers

✔️Accepted Answer

Scikit-Learn used to download MNIST from mldata.org, which was unfortunately pretty unstable, and was eventually shutdown. You can still use fetch_mldata() if you downloaded the dataset before mldata.org was shut down, as it will use the dataset stored in your cache. However, since Scikit-Learn 0.20, you should use fetch_openml() instead:

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)

For most cases, this should work fine. However, it does not return the exact same dataset as fetch_mldata() did. Indeed, the targets are now strings instead of unsigned 8-bit integers, and also it returns the unsorted MNIST dataset, whereas fetch_mldata() returned the dataset sorted by target (the training set and the test set were sorted separately). In general, this is fine, but if you want to get the exact same results as before, you need to sort the dataset using the following function:

def sort_by_target(mnist):
    reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
    reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
    mnist.data[:60000] = mnist.data[reorder_train]
    mnist.target[:60000] = mnist.target[reorder_train]
    mnist.data[60000:] = mnist.data[reorder_test + 60000]
    mnist.target[60000:] = mnist.target[reorder_test + 60000]

try:
    from sklearn.datasets import fetch_openml
    mnist = fetch_openml('mnist_784', version=1, cache=True)
    mnist.target = mnist.target.astype(np.int8) # fetch_openml() returns targets as strings
    sort_by_target(mnist) # fetch_openml() returns an unsorted dataset
except ImportError:
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata('MNIST original')

I'm updating all the notebooks that use fetch_mldata() to use fetch_openml() instead.
Hope this helps.

Other Answers:

Also note that TensorFlow's method to load MNIST was also deprecated. :(

from tensorflow.examples.tutorials.mnist import input_data # DEPRECATED!
mnist = input_data.read_data_sets("/tmp/data/")

Fortunately, Keras also has a method to load MNIST, and since it is now part of TensorFlow (in tf.keras), you can use:

from tensorflow.keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()

The main differences are as follows:

Loading MNIST Sklearn fetch_mldata() (deprecated) Sklearn fetch_openml() TensorFlow input_data() (deprecated) Keras load_data()
Sorted by label Yes No No No
Image dtype uint8 (0 to 255) float64 (0.0 to 255.0) float32 (0.0 to 1.0) uint8 (0 to 255)
Image shape [784] [784] [784] [28, 28]
Label dtype float (0.0 to 9.0) string ('0' to '9') uint8 (0 to 9) uint8 (0 to 9)
Split No split, but by convention: Train 0..59999, Test 60000...69999 No split, but by convention: Train 0..59999, Test 60000...69999 Validation 0..4999, Train 5000..59999, Test 60000..69999 Train 0..59999, Test 60000..69999

just use:

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
mnist

works with sklearn version 0.19.1
check your version with:

import sklearn
sklearn.__version__

fetch_openml() gave the following error on my Windows 10 machine.
"jsondecodeerror: Expecting value: line 1 column 1 (char 0)".

Then I upgraded scikitl-learn from 0.20.0 to 0.20.2. This solved the problem and fetch_openml() worked fine.

FYI, when I tried the suggestion in comment #4

reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
    reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
    mnist.data[:60000] = mnist.data[reorder_train]
    mnist.target[:60000] = mnist.target[reorder_train]
    mnist.data[60000:] = mnist.data[reorder_test + 60000]
    mnist.target[60000:] = mnist.target[reorder_test + 60000]

I got the error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jodavies/devl/test/ml/env/lib/python3.8/site-packages/pandas/core/frame.py", line 3030, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "/Users/jodavies/devl/test/ml/env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1266, in     _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
  File "/Users/jodavies/devl/test/ml/env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1308, in     _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Int64Index([    1,    21,    34,    37,    51,    56,    63,    68,    69,\n               75,\n            ...\n            59910, 59917, 59927, 59939, 59942, 59948, 59969, 59973, 59990,\n            59992],\n           dtype='int64', length=60000)] are in the [columns]"

My best guess is that the data was returned as a numpy array when the comment was written, but is returned as a Pandas Dataframe now. I changed it to:

reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
    reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
    mnist.data.values[:60000] = mnist.data.values[reorder_train]
    mnist.target.values[:60000] = mnist.target.values[reorder_train]
    mnist.data.values[60000:] = mnist.data.values[reorder_test + 60000]
    mnist.target.values[60000:] = mnist.target.values[reorder_test + 60000]

and it worked for me.

More Issues: