Arrow#
Format specification#
Apache Arrow is a in-memory columnar table format specification with multiple language support. This format supports to export into a single arrow file or multiple shardings.
Table schema#
id |
subset |
media.type |
media.path |
media.bytes |
media.attributes |
annotations |
attributes |
---|---|---|---|---|---|---|---|
string |
string |
uint32 |
string |
binary |
binary |
binary |
binary |
id (string
)#
The ID of each entity. A tuple of (id
, subset
) is a unique key of each entity.
subset (string
)#
The subset the entity belongs to. A tuple of (id
, subset
) is a unique key of each entity.
media.type (uint32
)#
The type of media the entity has.
Supported media types:
0:
None
2:
Image
6:
PointCloud
media.path (string
)#
The path of the media. It could be a real path or a relative path, or /NOT/A/REAL/PATH
if path is invalid.
media.bytes (binary
)#
The binary data of the media. It could be None
if one chooses not to save media when export.
media.attributes (binary
)#
The attribute of the entity. The contents of it depends on media.type
.
The byte order is little-endian.
Image
PointCloud
annotations (binary
)#
The annotations of the entity. The byte order is little-endian. The annotations are more than one like following.
Supported annotation types:
1:
Label
2:
Mask
3:
Points
4:
Polygon
5:
PolyLine
6:
Bbox
7:
Caption
8:
Cuboid3d
11:
Ellipse
attributes (binary
)#
The attributes of the entity. The byte order is little-endian.
Convert Arrow dataset#
An Arrow dataset can be converted in the following way:
datum convert --input-format arrow --input-path <path/to/dataset> \
--output-format <desired_format> --output-dir <output/dir>
An Arrow dataset directory should have the following structure:
└─ Dataset/
├── <subset_name_1>-0-of-2.arrow
├── <subset_name_1>-1-of-2.arrow
├── <subset_name_2>-0-of-1.arrow
└── ...
If your dataset is not following the above directory structure, it cannot be detected and converted properly.
Export to other formats#
It can convert Datumaro dataset into any other format Datumaro supports. To get the expected result, convert the dataset to formats that support the specified task (e.g. for panoptic segmentation - VOC, CamVID)
There are several ways to convert an Arrow dataset to other dataset formats using CLI:
Convert a dataset from Arrow to VOC format:
datum convert --input-format arrow --input-path <path/to/dataset> \
--output-format voc --output-dir <output/dir>
Or, using Python API:
import datumaro as dm
dataset = dm.Dataset.import_from('<path/to/dataset>', 'arrow')
dataset.export('save_dir', 'voc', save_media=True, image_ext="AS-IS", num_workers=4)
Export to Arrow#
There are several ways to convert a dataset to Arrow format:
Convert a dataset from VOC format to Arrow format:
# converting to arrow format from other format
datum convert --input-format voc --input-path <path/to/dataset> \
--output-format arrow --output-dir <output/dir> -- --save-media
Extra options for exporting to Arrow format:
--save-media
allow to export dataset with saving media files. (default:False
)--image-ext IMAGE_EXT
allow to choose which scheme to use for image when--save-media
isTrue
. (default:AS-IS
)Available options are (
AS-IS
,PNG
,TIFF
,JPEG/95
,JPEG/75
,NONE
)--max-shard-size MAX_SHARD_SIZE
allow to specify maximum number of dataset items when saving into arrow format. (default:1000
)--num-shards NUM_SHARDS
allow to specify the number of shards to generate.--num-shards
and--max-shard-size
are mutually exclusive. (default:None
)--num-workers NUM_WORKERS
allow to multi-processing for the export. If num_workers = 0, do not use multiprocessing (default:0
).
Examples#
Examples of using this format from the code can be found in the format tests