# Download The command `datum download` can be used to download datasets from Kaggle and TensorFlow dataset public libraries. ## Describing downloadable datasets This command reports various information about datasets that can be downloaded with the `download` command. The information is reported either as human-readable text (the default) or as a JSON object. The format can be selected with the `--report-format` option. Note that this command is only available for TensorFlow Datasets and cannot be used with Kaggle. When the JSON output format is selected, the output document has the following schema: ```json { "": { "default_output_format": "", "description": "", "download_size": , "home_url": "", "human_name": "", "num_classes": , "subsets": { "": { "num_items": }, ... }, "version": "" }, ... } ``` `home_url` may be `null` if there is no suitable web page for the dataset. `num_classes` may be `null` if the dataset does not involve classification. `version` currently contains the version number supplied by TFDS. In future versions of Datumaro, datasets might come from other sources; the way version numbers will be set for those is to be determined. New object members may be added in future versions of Datumaro. ### Usage ``` datum download tfds describe [-h] [--report-format {text,json}] [--report-file REPORT_FILE] ``` Parameters: - `-h`, `--help` - Print the help message and exit. - `--report-format` (text or json) - Format in which to report the information. By default, `text` is used. - `--report-file` (string) - File to which to write the report. By default, the report is written to the standard output stream. ## Downloading datasets This command downloads a publicly available dataset and saves it to a local directory. In terms of syntax, this command is similar to [`convert`](./convert.md), but instead of taking a local directory as the source, it takes a dataset ID. A list of supported datasets and output formats can be found in the `--help` output of this command. To use Datumaro ``download`` feature, you should install Datumaro with ``[tf,tfds]`` extras for TensorFlow Datasets or ``[kaggle]`` for Kaggle Datasets: ::::{tab-set} :::{tab-item} TensorFlow ```bash pip install datumaro[tf,tfds] ``` ::: :::{tab-item} Kaggle ```bash pip install datumaro[kaggle] ``` ::: :::: To use a proxy for downloading, configure it with the conventional [curl environment variables](https://everything.curl.dev/usingcurl/proxies/env). ### Usage ```console datum download [kaggle|tfds] get [-h] -i DATASET_ID [-f OUTPUT_FORMAT] [-o DST_DIR] [--overwrite] [-s SUBSET] [-- EXTRA_EXPORT_ARGS] ``` Parameters: - `-h`, `--help` - Print the help message and exit. - `-i`, `--dataset-id` (string) - ID of the dataset to download. - `-f`, `--output-format` (string) - Output format. By default, the format of the original dataset is used. - `-o, --output-dir` (string) - Output directory. By default, a subdirectory in the current directory is used. - `--overwrite` - Allows overwriting existing files in the output directory, when it is not empty. - `--subset` (string) - Which subset of the dataset to save. By default, all subsets are saved. Note that due to limitations of TFDS, all subsets are downloaded even if this option is specified. - `-- ` - Additional arguments for the format writer (use `-- -h` for help). Must be specified after the main command arguments. ### Examples - Download the MNIST dataset, saving it in the ImageNet text format ```console datum download tfds get -i tfds:mnist -f imagenet_txt -- --save-media ```