A collection of popular datasets for deep learning.
dbcollection
dbcollection is a python module for loading/managing datasets with a very simple set of commands with cross-platform and cross-language support in mind and it is distributed under the MIT license. With this package, you’ll have access (in a quick and simple way) to a collection of datasets for a variety of tasks such as object classification, detection, human pose estimation, captioning, etc.
This package is available for Linux, MacOs and Windows.
Why
Why use dbcollection?
dbcollection provides a cross-platform, cross-language framework to manage datasets and to quickly load/fetch data with minimal resources wasted. Plus, it contains a list of diverse datasets to work with.
The main goal of this project is to help save time for users when developing/deploying/sharing their work.
Reasons why you should use dbcollection
Here are some of the problems dbcollection tries to solve:
- Downloading, extracting and parsing different datasets without prior knowledge of their inner workings is sometimes error prone.
- Constantly writting the same boilerplate code or adapting existing one for new projects is a hassle.
- Disk space gets littered with data files everywhere as you work on different projects with no centralized storage.
- Wasted time and computer resources (mostly memory) when loading large datasets.
- Many datasets use
.json
files to distribute their (meta)data, which is fine for smaller datasets, but they are not an efficient solution to store large amounts of (meta)data for large datasets. - Usually trying new datasets means that you will have to spend a significant portion of your time learning how to use it so you can load/parse it. And good luck if those datasets are distributed using some complicated, in house format that can only be extracted by using a specific toolbox (e.g., Caltech Pedestrian).
- Having to learn a toolboxe (and languages maybe) to fetch data for a given dataset that you just want to try it out before doing anything serious is not viable at all or, at best, a tedious task.
- If you start using a new language you’ll probably write the same scripts to load/parse your datasets.
Usage
Simple to use
Using the module is pretty straight-forward. To import it just do:
>>> import dbcollection as dbc
To load a dataset, you only need to use a single method that returns a data loader object which can then be used to fetch data from.
>>> mnist = dbc.load('mnist')
This data loader object contains information
about the dataset’s name, task, data, cache paths, set splits, and some methods for querying and loading data from the HDF5
metadata file.
For example, if you want to know how the data is structured inside the metadata file, you can simply do the following:
To fetch data samples from a field, its is as easy as calling a method with the set and field names and the row id(s) you want to select. For example, to retrieve the 10 first images all you need to do is the following:
Note: For more information about using this module, please check the documentation or the available notebooks for guidance.
Features
Key features of the dbcollection library
- Simple API to load/download/setup/manage datasets.
- Simple API to fetch data from a dataset.
- Store and pull data from disk or from memory, you choose!
- Datasets only need to be set/processed once, so next time you use it it will load instantly!
- Cross-platform (Windows, Linux, MacOs).
- Cross-language (Python, Lua/Torch7, Matlab).
- Easily extensible to other languages that support
HDF5
files format. - Concurrent/parallel data access thanks to
HDF5
. - Contains a diverse (and growing!) list of popular datasets for machine-, deep-learning tasks (object detection, action recognition, human pose estimation, etc.)
Works with
Install
Python
From PyPi
Installing dbcollection
using pip is simple. For that
purpose, simply do the following command:
$ pip install dbcollection
From source
To install dbcollection
from source you need to do
the following steps:
- Clone the repo to your hard drive:
$ git clone --recursive https://github.com/dbcollection/dbcollection
cd
to the dbcollection folder and do the command
$ python setup.py install
Lua/Torch7
To install the dbcollection’s Lua/Torch7 API you must first have the Python’s version installed in your system. If you do not have it already installed, then you can install it either via pip
, conda
or from source. Here we’ll use pip
to install this package:
$ pip install dbcollection
After you have the Python’s version installed in your system, get the Lua/Torch7’s API via the following repository:
Then, all there is to do is to clone this repo and install the package via luarocks
:
$ git clone https://github.com/dbcollection/dbcollection-torch7
Then, all there is to do is to install the package via luarocks
$ cd dbcollection-torch7/ && luarocks make rocks/*
Matlab
To install the dbcollection’s Matlab API you must first have the Python’s version installed in your system. If you do not have it already installed, please see the previous steps to install dbcollection.
After you have the Python’s version installed in your system, get the Matlab’s API via the following repository:
$ git clone https://github.com/dbcollection/dbcollection-matlab
Then, add dbcollection-matlab/
to your Matlab’s path:
addpath('<path>/dbcollection-matlab/');
Also, this package requires the JSONlab json encoder/decoder to work. To install this package just download the repo to disk:
$ git clone https://github.com/fangq/jsonlab
and add it your Matlab’s path:
addpath('/path/to/jsonlab');
Datasets
Some popular datasets currently available:
Caltech Pedestrian
CIFAR-10
CIFAR-100
COCO - Common Objects in Context
FLIC - Frames Labeled In Cinema
ILSVRC2012 - Imagenet Large Scale Visual Recognition Challenge 2012
INRIA Pedestrian
LSP - Leeds Sports Pose
LSPe - Leeds Sports Pose Extended
MNIST Handwritten Digit Database
MPII Human Pose
PASCAL VOC2007 - The PASCAL Visual Object Classes Challenge 2007
PASCAL VOC2012 - The PASCAL Visual Object Classes Challenge 2012
UCF101 - Action Recognition
Docs
To know more about this package and how to use it, please check out the documentation.
Click Here!
Contribute
Do you want to contribute?
All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome. If you would like to see additional languages being supported, please consider contributing to the project.
If you are interested in fixing issues and contributing directly to the code base, please see How to Contribute.