A collection of popular datasets for deep learning.

dbcollection

Join the chat at https://gitter.im/dbcollection/dbcollection

Build Status Build status codecov

License: MIT Documentation Status PyPI version

dbcollection is a python module for loading/managing datasets with a very simple set of commands with cross-platform and cross-language support in mind and it is distributed under the MIT license. With this package, you’ll have access (in a quick and simple way) to a collection of datasets for a variety of tasks such as object classification, detection, human pose estimation, captioning, etc.

This package is available for Linux, MacOs and Windows.

Fork me on GitHub

Why

Why use dbcollection?

dbcollection provides a cross-platform, cross-language framework to manage datasets and to quickly load/fetch data with minimal resources wasted. Plus, it contains a list of diverse datasets to work with.

The main goal of this project is to help save time for users when developing/deploying/sharing their work.

Reasons why you should use dbcollection

Here are some of the problems dbcollection tries to solve:

  • Downloading, extracting and parsing different datasets without prior knowledge of their inner workings is sometimes error prone.
  • Constantly writting the same boilerplate code or adapting existing one for new projects is a hassle.
  • Disk space gets littered with data files everywhere as you work on different projects with no centralized storage.
  • Wasted time and computer resources (mostly memory) when loading large datasets.
  • Many datasets use .json files to distribute their (meta)data, which is fine for smaller datasets, but they are not an efficient solution to store large amounts of (meta)data for large datasets.
  • Usually trying new datasets means that you will have to spend a significant portion of your time learning how to use it so you can load/parse it. And good luck if those datasets are distributed using some complicated, in house format that can only be extracted by using a specific toolbox (e.g., Caltech Pedestrian).
  • Having to learn a toolboxe (and languages maybe) to fetch data for a given dataset that you just want to try it out before doing anything serious is not viable at all or, at best, a tedious task.
  • If you start using a new language you’ll probably write the same scripts to load/parse your datasets.
Usage

Simple to use

Using the module is pretty straight-forward. To import it just do:

>>> import dbcollection as dbc

To load a dataset, you only need to use a single method that returns a data loader object which can then be used to fetch data from.

>>> mnist = dbc.load('mnist')

This data loader object contains information about the dataset’s name, task, data, cache paths, set splits, and some methods for querying and loading data from the HDF5 metadata file.

For example, if you want to know how the data is structured inside the metadata file, you can simply do the following:

>>> mnist.info()

> Set: test
   - classes,        shape = (10, 2),          dtype = uint8
   - images,         shape = (10000, 28, 28),  dtype = uint8,  (in 'object_ids', position = 0)
   - labels,         shape = (10000,),         dtype = uint8,  (in 'object_ids', position = 1)
   - object_fields,  shape = (2, 7),           dtype = uint8
   - object_ids,     shape = (10000, 2),       dtype = uint8

   (Pre-ordered lists)
   - list_images_per_class,  shape = (10, 1135),  dtype = int32

> Set: train
   - classes,        shape = (10, 2),          dtype = uint8
   - images,         shape = (60000, 28, 28),  dtype = uint8,  (in 'object_ids', position = 0)
   - labels,         shape = (60000,),         dtype = uint8,  (in 'object_ids', position = 1)
   - object_fields,  shape = (2, 7),           dtype = uint8
   - object_ids,     shape = (60000, 2),       dtype = uint8

   (Pre-ordered lists)
   - list_images_per_class,  shape = (10, 6742),  dtype = int32

To fetch data samples from a field, its is as easy as calling a method with the set and field names and the row id(s) you want to select. For example, to retrieve the 10 first images all you need to do is the following:

>>> imgs = mnist.get('train', 'images', range(10))
>>> imgs.shape
(10, 28, 28)

Note: For more information about using this module, please check the documentation or the available notebooks for guidance.

Features

Key features of the dbcollection library

  • Simple API to load/download/setup/manage datasets.
  • Simple API to fetch data from a dataset.
  • Store and pull data from disk or from memory, you choose!
  • Datasets only need to be set/processed once, so next time you use it it will load instantly!
  • Cross-platform (Windows, Linux, MacOs).
  • Cross-language (Python, Lua/Torch7, Matlab).
  • Easily extensible to other languages that support HDF5 files format.
  • Concurrent/parallel data access thanks to HDF5.
  • Contains a diverse (and growing!) list of popular datasets for machine-, deep-learning tasks (object detection, action recognition, human pose estimation, etc.)
Works with

Supported languages and platforms

Python

Lua/Torch7

Matlab

Install

Python

From PyPi

Installing dbcollection using pip is simple. For that purpose, simply do the following command:

$ pip install dbcollection

From source

To install dbcollection from source you need to do the following steps:

  • Clone the repo to your hard drive:
$ git clone --recursive https://github.com/dbcollection/dbcollection
  • cd to the dbcollection folder and do the command
$ python setup.py install

Lua/Torch7

To install the dbcollection’s Lua/Torch7 API you must first have the Python’s version installed in your system. If you do not have it already installed, then you can install it either via pip, conda or from source. Here we’ll use pip to install this package:

$ pip install dbcollection

After you have the Python’s version installed in your system, get the Lua/Torch7’s API via the following repository:

Then, all there is to do is to clone this repo and install the package via luarocks:

$ git clone https://github.com/dbcollection/dbcollection-torch7

Then, all there is to do is to install the package via luarocks

$ cd dbcollection-torch7/ && luarocks make rocks/*

Matlab

To install the dbcollection’s Matlab API you must first have the Python’s version installed in your system. If you do not have it already installed, please see the previous steps to install dbcollection.

After you have the Python’s version installed in your system, get the Matlab’s API via the following repository:

$ git clone https://github.com/dbcollection/dbcollection-matlab

Then, add dbcollection-matlab/ to your Matlab’s path:

addpath('<path>/dbcollection-matlab/');

Also, this package requires the JSONlab json encoder/decoder to work. To install this package just download the repo to disk:

$ git clone https://github.com/fangq/jsonlab

and add it your Matlab’s path:

addpath('/path/to/jsonlab');
Datasets

Some popular datasets currently available:

Caltech Pedestrian

CIFAR-10

CIFAR-100

COCO - Common Objects in Context

FLIC - Frames Labeled In Cinema

ILSVRC2012 - Imagenet Large Scale Visual Recognition Challenge 2012

INRIA Pedestrian

LSP - Leeds Sports Pose

LSPe - Leeds Sports Pose Extended

MNIST Handwritten Digit Database

MPII Human Pose

PASCAL VOC2007 - The PASCAL Visual Object Classes Challenge 2007

PASCAL VOC2012 - The PASCAL Visual Object Classes Challenge 2012

UCF101 - Action Recognition

Docs

To know more about this package and how to use it, please check out the documentation.

Click Here!

Contribute

Do you want to contribute?

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome. If you would like to see additional languages being supported, please consider contributing to the project.

If you are interested in fixing issues and contributing directly to the code base, please see How to Contribute.