Feb 16, 2020 4:47 PM

Datasets for deep learning

*What can a machine "learn" about?

https://www.datasciencecentral.com/profiles/blogs/deep-learning-data-sets-for-every-data-scientist

(...)

Datasets for Deep Learning

1. MNIST – One of the popular deep learning datasets of handwritten digits which consists of sixty thousand training set examples, and ten thousand test set examples. The time spent in data pre-processing is minimum while you could try different deep recognition patterns, and learning techniques on the real-world data. The size of the dataset if nearly 50 MB.

2. MS-COCO – It is a dataset for segmentation, object detection, etc. The features of the COCO dataset are – object segmentation, context recognition, stuff segmentation, three hundred thirty thousand images, 1.5 million instances of the object, eighty categories of object, ninety-one categories of staff, five per image captions, 250,000 keynotes people. The size of the dataset is 25 GB.

3. ImageNet – An images dataset organized with regards to the WordNet hierarchy. There are one lakh phrases in WordNet and each phrase is illustrated by on average 1000 images. It is a huge dataset of size hundred and fifty gigabytes.

4. VisualQA – The open-ended questions about images is present in this dataset which requires vision and language understanding. The features are – 265,016 COCO and abstract scenes, three questions per image, ten true answers per question, three likely to be correct answers per question, automatic evaluation metric. The size is 25 GB.

5. CIFAR-10 – An image classification dataset consisting of ten classes of sixty thousand images. There are five training batches and one test batch in the dataset and there are 10000 images in each batch. The size is 170 MB.

6. Fashion-MNIST – There are sixty thousand training and ten thousand test images in the dataset. This dataset was created as a direct replacement for the MNIST dataset. The size is 30 MB.

7. Street View House Numbers – A dataset for object detection problems. Similar to MNIST dataset with minimum data pre-processing but more labeled data collected from Google Street viewed house numbers. The size is 2.5 GB.

8. Sentiment140 – It is a Natural Language Processing dataset which performs sentiment analysis. There are six features in the final dataset with emotions removed from the data. The features are – tweet polarity, the id of the tweet, tweet date, query, username, tweet text.

9. WordNet – It’s a large English synsets database which describes a different concept of synonyms. The size is nearly 10 MB.

10. Wikipedia Corpus – It consists of 1.9 billion textual records for more than four million articles. You could search using a phrase, word.

11. Free Spoken Digit – Inspired by MNIST dataset, it was created to identify spoken digits in audio samples. The more people contribute to it, the more it would grow. The characteristics of this dataset are three speakers, fifteen hundred recordings, and English pronunciations. The size of the dataset is nearly 10 MB.

12. Free Music Archive – It is a music analysis dataset which has HQ audio features and user-level metadata. The size is almost 1000 GB.

13. Ballroom – A dancing audio files dataset where in real audio format, many dance styles excerpts are provided. The dataset consists of six hundred and ninety-eight instances, a thirty seconds duration with a total duration of 20940 seconds.

14. Million Song – A million music tracks’ audio features and metadata are present in this dataset. The dataset is an alternative to create large datasets. There is only derived features, but no audio in this dataset. The size is nearly 280 GB.

15. LibriSpeech – It consists of English speech for a thousand hours. The dataset is properly segmented and there are Acoustic models which are trained by this.

16. VoxCeleb – It is a speaker identification dataset extracted from videos in YouTube consisting of one lakh utterances by 1251 celebrities. There is a balanced distributed of gender and a wide range of professions, accents, and so on. The intriguing task is to identify the superstar the voice belongs to.

17. Urban Sound Classification – This dataset consists of 8000 urban sounds excerpts from ten classes. The training size is three GB and the test set is 2 GB.

18. IMDB reviews – For any movie junkie, this is an ideal dataset. Used for binary sentiment classification and has unlabelled data as well apart from train and test review examples. The size is 80 MB.

19. Twenty Newsgroups – Newspaper information is present in the dataset. From twenty different newspapers, 1000 Usenet articles were used. Subject lines, signatures, etc., are some of the features. The size of the dataset is nearly 20 MB.

20. Yelp Reviews – This dataset is for learning the purpose and was released by Yelp. It consists of user reviews and more than twenty thousand pictures. The JSON file size is 2.66 GB, SQL is 2.9 GB. And Photos is 7.5 GB with all compressed together.