datumaro.components.dataset#

Functions

eager_mode([new_mode, dataset])

Classes

Dataset([source, infos, categories, ...])

Represents a dataset, contains metainfo about labels and dataset items.

DatasetSubset(parent, name)

StreamDataset([source, infos, categories, ...])

class datumaro.components.dataset.Dataset(source: IDataset | None = None, *, infos: Dict[str, Any] | None = None, categories: Dict[AnnotationType, Categories] | None = None, media_type: Type[MediaElement] | None = None, ann_types: Set[AnnotationType] | None = None, env: Environment | None = None)[source]#

Bases: IDataset

Represents a dataset, contains metainfo about labels and dataset items. Provides iteration and access options to dataset elements.

By default, all operations are done lazily, it can be changed by modifying the eager property and by using the eager_mode context manager.

Dataset is supposed to have a single media type for its items. If the dataset is filled manually or from extractors, and media type does not match, an error is raised.

classmethod from_iterable(iterable: ~typing.Iterable[~datumaro.components.dataset_base.DatasetItem], infos: ~typing.Dict[str, ~typing.Any] | None = None, categories: ~typing.Dict[~datumaro.components.annotation.AnnotationType, ~datumaro.components.annotation.Categories] | ~typing.List[str] | None = None, *, env: ~datumaro.components.environment.Environment | None = None, media_type: ~typing.Type[~datumaro.components.media.MediaElement] = <class 'datumaro.components.media.Image'>, ann_types: ~typing.Set[~datumaro.components.annotation.AnnotationType] | None = []) Dataset[source]#

Creates a new dataset from an iterable object producing dataset items - a generator, a list etc. It is a convenient way to create and fill a custom dataset.

Parameters:
  • iterable – An iterable which returns dataset items

  • infos – A dictionary of the dataset specific information

  • categories – A simple list of labels or complete information about labels. If not specified, an empty list of labels is assumed.

  • media_type – Media type for the dataset items. If the sequence contains items with mismatching media type, an error is raised during caching

  • env – A context for plugins, which will be used for this dataset. If not specified, the builtin plugins will be used.

Returns:

A new dataset with specified contents

Return type:

dataset

classmethod from_extractors(*sources: IDataset, env: Environment | None = None, merge_policy: str = 'exact') Dataset[source]#

Creates a new dataset from one or several `Extractor`s.

In case of a single input, creates a lazy wrapper around the input. In case of several inputs, merges them and caches the resulting dataset.

Parameters:
  • sources – one or many input extractors

  • env – A context for plugins, which will be used for this dataset. If not specified, the builtin plugins will be used.

  • merge_policy – Policy on how to merge multiple datasets. Possible options are “exact”, “intersect”, and “union”.

Returns:

A new dataset with contents produced by input extractors

Return type:

dataset

define_infos(infos: Dict[str, Any]) None[source]#
define_categories(categories: Dict[AnnotationType, Categories]) None[source]#
init_cache() None[source]#
get_subset(name) DatasetSubset[source]#
subsets() Dict[str, DatasetSubset][source]#

Enumerates subsets in the dataset. Each subset can be a dataset itself.

infos() Dict[str, Any][source]#

Returns meta-info of dataset.

categories() Dict[AnnotationType, Categories][source]#

Returns metainfo about dataset labels.

media_type() Type[MediaElement][source]#

Returns media type of the dataset items.

All the items are supposed to have the same media type. Supposed to be constant and known immediately after the object construction (i.e. doesn’t require dataset iteration).

ann_types() Set[AnnotationType][source]#

Returns available task type from dataset annotation types.

get(id: str, subset: str | None = None) DatasetItem | None[source]#

Provides random access to dataset items.

get_annotated_items()[source]#
get_annotations()[source]#
get_datasetitem_by_path(path)[source]#
get_label_cat_names()[source]#
get_subset_info() str[source]#
get_infos() Tuple[str][source]#
get_categories_info() Tuple[str][source]#
put(item: DatasetItem, id: str | None = None, subset: str | None = None) None[source]#
remove(id: str, subset: str | None = None) None[source]#
filter(expr: str, *, filter_annotations: bool = False, remove_empty: bool = False) Dataset[source]#
filter(filter_func: Callable[[DatasetItem], bool] | Callable[[DatasetItem, Annotation], bool], *, filter_annotations: bool = False, remove_empty: bool = False) Dataset
update(source: DatasetPatch | IDataset | Iterable[DatasetItem]) Dataset[source]#

Updates items of the current dataset from another dataset or an iterable (the source). Items from the source overwrite matching items in the current dataset. Unmatched items are just appended.

If the source is a DatasetPatch, the removed items in the patch will be removed in the current dataset.

If the source is a dataset, labels are matched. If the labels match, but the order is different, the annotation labels will be remapped to the current dataset label order during updating.

Returns: self

transform(method: str | Type[Transform], **kwargs) Dataset[source]#

Applies some function to dataset items.

Results are stored in-place. Modifications are applied lazily. Transforms are not allowed to change media type of dataset items.

Parameters:
  • method – The transformation to be applied to the dataset. If a string is passed, it is treated as a plugin name, which is searched for in the dataset environment.

  • **kwargs – Parameters for the transformation

Returns: self

select(pred: Callable[[DatasetItem], bool]) Dataset[source]#
property data_path: str | None#
property format: str | None#
property options: Dict[str, Any]#
property is_modified: bool#
get_patch() DatasetPatch[source]#
property env: Environment#
property is_cache_initialized: bool#
property is_eager: bool#
property is_bound: bool#
bind(path: str, format: str | None = None, *, options: Dict[str, Any] | None = None) None[source]#

Binds the dataset to a speific directory. Allows to set default saving parameters.

The following saves will be done to this directory by default and will use the saved parameters.

flush_changes()[source]#
export(save_dir: str, format: str | Type[Exporter], *, progress_reporter: ProgressReporter | None = None, error_policy: ExportErrorPolicy | None = None, **kwargs) None[source]#

Saves the dataset in some format.

Parameters:
  • save_dir – The output directory

  • format – The desired output format. If a string is passed, it is treated as a plugin name, which is searched for in the dataset environment.

  • progress_reporter – An object to report progress

  • error_policy – An object to report format-related errors

  • **kwargs – Parameters for the format

save(save_dir: str | None = None, **kwargs) None[source]#
classmethod load(path: str, **kwargs) Dataset[source]#
classmethod import_from(path: str, format: str | None = None, *, env: Environment | None = None, progress_reporter: ProgressReporter | None = None, error_policy: ImportErrorPolicy | None = None, **kwargs) Dataset[source]#

Creates a Dataset instance from a dataset on the disk.

Parameters:
  • path (path - The input file or directory) –

  • format. (format - Dataset) – If a string is passed, it is treated as a plugin name, which is searched for in the env plugin context. If not set, will try to detect automatically, using the env plugin context.

  • set (env - A plugin collection. If not) –

  • used (the built-in plugins are) –

  • progress. (progress_reporter - An object to report) – Implies earger loading.

  • errors. (error_policy - An object to report format-related) – Implies earger loading.

  • format (**kwargs - Parameters for the) –

static detect(path: str, *, env: Environment | None = None, depth: int = 2) str[source]#

Attempts to detect dataset format of a given directory.

This function tries to detect a single format and fails if it’s not possible. Check Environment.detect_dataset() for a function that reports status for each format checked.

Parameters:
  • path – The directory to check

  • depth – The maximum depth for recursive search

  • env – A plugin collection. If not set, the built-in plugins are used

property is_stream: bool#

Boolean indicating whether the dataset is a stream

If the dataset is a stream, the dataset item is generated on demand from its iterator.

clone() Dataset[source]#

Create a deep copy of this dataset.

Returns:

A cloned instance of the Dataset.

datumaro.components.dataset.eager_mode(new_mode: bool = True, dataset: Dataset | None = None) None[source]#
class datumaro.components.dataset.Annotation(*, id: int = 0, attributes: Dict[str, Any] = _Nothing.NOTHING, group: int = 0, object_id: int = -1)[source]#

Bases: object

A base annotation class.

Derived classes must define the ‘_type’ class variable with a value from the AnnotationType enum.

Method generated by attrs for class Annotation.

id: int#
attributes: Dict[str, Any]#
group: int#
object_id: int#
property type: AnnotationType#
as_dict() Dict[str, Any][source]#

Returns a dictionary { field_name: value }

wrap(**kwargs)[source]#

Returns a modified copy of the object

class datumaro.components.dataset.AnnotationType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: IntEnum

unknown = 0#
label = 1#
mask = 2#
points = 3#
polygon = 4#
polyline = 5#
bbox = 6#
caption = 7#
cuboid_3d = 8#
super_resolution_annotation = 9#
depth_annotation = 10#
ellipse = 11#
tabular = 13#
rotated_bbox = 14#
cuboid_2d = 15#
class datumaro.components.dataset.Any(*args, **kwargs)[source]#

Bases: object

Special type indicating an unconstrained type.

  • Any is compatible with every type.

  • Any assumed to have all methods.

  • All values assumed to be instances of Any.

Note that all the above statements are true from the point of view of static type checkers. At runtime, Any should not be used with instance checks.

class datumaro.components.dataset.DatasetBase(*, length: int | None = None, subsets: ~typing.Sequence[str] | None = None, media_type: ~typing.Type[~datumaro.components.media.MediaElement] = <class 'datumaro.components.media.Image'>, ann_types: ~typing.List[~datumaro.components.annotation.AnnotationType] | None = None, ctx: ~datumaro.components.contexts.importer.ImportContext | None = None)[source]#

Bases: _DatasetBase, CliPlugin

A base class for user-defined and built-in extractors. Should be used in cases, where SubsetBase is not enough, or its use makes problems with performance, implementation etc.

media_type()[source]#

Returns media type of the dataset items.

All the items are supposed to have the same media type. Supposed to be constant and known immediately after the object construction (i.e. doesn’t require dataset iteration).

ann_types()[source]#

Returns available task type from dataset annotation types.

exception datumaro.components.dataset.DatasetImportError[source]#

Bases: DatumaroError

class datumaro.components.dataset.DatasetItem(id: str, *, subset: str | None = None, media: str | MediaElement | None = None, annotations: List[Annotation] | None = None, attributes: Dict[str, Any] = None)[source]#

Bases: object

id: str#
subset: str#
media: MediaElement | None#
annotations: Annotations#
attributes: Dict[str, Any]#
wrap(**kwargs)[source]#
media_as(t: Type[T]) T[source]#
class datumaro.components.dataset.DatasetItemStorageDatasetView(parent: DatasetItemStorage, infos: Dict[str, Any], categories: Dict[AnnotationType, Categories], media_type: Type[MediaElement] | None, ann_types: Set[AnnotationType] | None)[source]#

Bases: IDataset

class Subset(parent: DatasetItemStorageDatasetView, name: str)[source]#

Bases: IDataset

put(item)[source]#
get(id, subset=None)[source]#

Provides random access to dataset items.

remove(id, subset=None)[source]#
get_subset(name)[source]#
subsets()[source]#

Enumerates subsets in the dataset. Each subset can be a dataset itself.

infos()[source]#

Returns meta-info of dataset.

categories()[source]#

Returns metainfo about dataset labels.

media_type()[source]#

Returns media type of the dataset items.

All the items are supposed to have the same media type. Supposed to be constant and known immediately after the object construction (i.e. doesn’t require dataset iteration).

ann_types()[source]#

Returns available task type from dataset annotation types.

infos()[source]#

Returns meta-info of dataset.

categories()[source]#

Returns metainfo about dataset labels.

get_subset(name)[source]#
subsets()[source]#

Enumerates subsets in the dataset. Each subset can be a dataset itself.

get(id, subset=None)[source]#

Provides random access to dataset items.

media_type()[source]#

Returns media type of the dataset items.

All the items are supposed to have the same media type. Supposed to be constant and known immediately after the object construction (i.e. doesn’t require dataset iteration).

ann_types()[source]#

Returns available task type from dataset annotation types.

class datumaro.components.dataset.DatasetPatch(data: DatasetItemStorage, infos: Dict[str, Any], categories: Dict[AnnotationType, Categories], updated_items: Dict[Tuple[str, str], ItemStatus], updated_subsets: Dict[str, ItemStatus] = None)[source]#

Bases: object

class DatasetPatchWrapper(patch: DatasetPatch, parent: IDataset)[source]#

Bases: DatasetItemStorageDatasetView

subsets()[source]#

Enumerates subsets in the dataset. Each subset can be a dataset itself.

property updated_subsets: Dict[str, ItemStatus]#
as_dataset(parent: IDataset) IDataset[source]#
class datumaro.components.dataset.DatasetStorage(source: IDataset | DatasetItemStorage, infos: Dict[str, Any] | None = None, categories: Dict[AnnotationType, Categories] | None = None, media_type: Type[MediaElement] | None = None, ann_types: Set[AnnotationType] | None = None)[source]#

Bases: IDataset

is_cache_initialized() bool[source]#
init_cache() None[source]#
infos() Dict[str, Any][source]#

Returns meta-info of dataset.

define_infos(infos: Dict[str, Any])[source]#
categories() Dict[AnnotationType, Categories][source]#

Returns metainfo about dataset labels.

define_categories(categories: Dict[AnnotationType, Categories])[source]#
media_type() Type[MediaElement][source]#

Returns media type of the dataset items.

All the items are supposed to have the same media type. Supposed to be constant and known immediately after the object construction (i.e. doesn’t require dataset iteration).

ann_types() Set[AnnotationType][source]#

Returns available task type from dataset annotation types.

put(item: DatasetItem) None[source]#
get(id: str, subset: str | None = None) DatasetItem | None[source]#

Provides random access to dataset items.

remove(id: str, subset: str | None = None) None[source]#
get_subset(name: str) IDataset[source]#
subsets() Dict[str, IDataset][source]#

Enumerates subsets in the dataset. Each subset can be a dataset itself.

get_annotated_items() int[source]#
get_annotations() int[source]#
get_datasetitem_by_path(path: str) DatasetItem | None[source]#
transform(method: Type[Transform], *args, **kwargs) None[source]#
has_updated_items()[source]#
get_patch() DatasetPatch[source]#
flush_changes()[source]#
update(source: DatasetPatch | IDataset | Iterable[DatasetItem])[source]#
class datumaro.components.dataset.DatasetSubset(parent: Dataset, name: str)[source]#

Bases: IDataset

put(item)[source]#
get(id, subset=None)[source]#

Provides random access to dataset items.

remove(id, subset=None)[source]#
get_subset(name)[source]#
subsets()[source]#

Enumerates subsets in the dataset. Each subset can be a dataset itself.

infos()[source]#

Returns meta-info of dataset.

categories()[source]#

Returns metainfo about dataset labels.

media_type()[source]#

Returns media type of the dataset items.

All the items are supposed to have the same media type. Supposed to be constant and known immediately after the object construction (i.e. doesn’t require dataset iteration).

ann_types()[source]#

Returns available task type from dataset annotation types.

get_annotated_items()[source]#
get_annotations()[source]#
as_dataset() Dataset[source]#
exception datumaro.components.dataset.DatumaroError[source]#

Bases: Exception

class datumaro.components.dataset.Environment(use_lazy_import: bool = True)[source]#

Bases: object

property extractors: DatasetBaseRegistry#
property importers: ImporterRegistry#
property exporters: ExporterRegistry#
property generators: GeneratorRegistry#
property transforms: TransformRegistry#
property validators: ValidatorRegistry#
load_plugins(plugins_dir)[source]#
register_plugins(plugins)[source]#
make_extractor(name, *args, **kwargs)[source]#
make_importer(name, *args, **kwargs)[source]#
make_exporter(name, *args, **kwargs)[source]#
make_transform(name, *args, **kwargs)[source]#
is_format_known(name)[source]#
detect_dataset(path: str, depth: int = 1, rejection_callback: Callable[[str, RejectionReason, str], None] | None = None) List[str][source]#
classmethod merge(envs: Sequence[Environment]) Environment[source]#
classmethod release_builtin_plugins()[source]#
class datumaro.components.dataset.ExportContext(progress_reporter=None, error_policy=None)[source]#

Bases: object

Method generated by attrs for class ExportContext.

progress_reporter: ProgressReporter#
error_policy: ExportErrorPolicy#
class datumaro.components.dataset.ExportErrorPolicy[source]#

Bases: object

report_item_error(error: Exception, *, item_id: Tuple[str, str]) None[source]#

Allows to report a problem with a dataset item. If this function returns, the converter must skip the item.

report_annotation_error(error: Exception, *, item_id: Tuple[str, str]) None[source]#

Allows to report a problem with a dataset item annotation. If this function returns, the converter must skip the annotation.

fail(error: Exception) NoReturn[source]#
class datumaro.components.dataset.Exporter(extractor: IDataset, save_dir: str, *, save_media: bool | None = None, image_ext: str | None = None, default_image_ext: str | None = None, save_dataset_meta: bool = False, stream: bool = False, ctx: ExportContext | None = None)[source]#

Bases: CliPlugin

DEFAULT_IMAGE_EXT = None#
classmethod build_cmdline_parser(**kwargs)[source]#
classmethod convert(extractor, save_dir, **options)[source]#
classmethod patch(dataset, patch, save_dir, **options)[source]#
apply()[source]#

Execute the data-format conversion

property can_stream: bool#

Flag to indicate whether the exporter can export the dataset in a stream manner or not.

class datumaro.components.dataset.IDataset[source]#

Bases: object

subsets() Dict[str, IDataset][source]#

Enumerates subsets in the dataset. Each subset can be a dataset itself.

get_subset(name) IDataset[source]#
infos() Dict[str, Any][source]#

Returns meta-info of dataset.

categories() Dict[AnnotationType, Categories][source]#

Returns metainfo about dataset labels.

get(id: str, subset: str | None = None) DatasetItem | None[source]#

Provides random access to dataset items.

media_type() Type[MediaElement][source]#

Returns media type of the dataset items.

All the items are supposed to have the same media type. Supposed to be constant and known immediately after the object construction (i.e. doesn’t require dataset iteration).

ann_types() List[AnnotationType][source]#

Returns available task type from dataset annotation types.

property is_stream: bool#

Boolean indicating whether the dataset is a stream

If the dataset is a stream, the dataset item is generated on demand from its iterator.

class datumaro.components.dataset.Image(size: Tuple[int, int] | None = None, ext: str | None = None, *args, **kwargs)[source]#

Bases: MediaElement[ndarray]

classmethod from_file(path: str, *args, **kwargs)[source]#
classmethod from_numpy(data: ndarray | Callable[[], ndarray], *args, **kwargs)[source]#
classmethod from_bytes(data: bytes | Callable[[], bytes], *args, **kwargs)[source]#
property has_size: bool#

Indicates that size info is cached and won’t require image loading

property size: Tuple[int, int] | None#

Returns (H, W)

property ext: str | None#

Media file extension (with the leading dot)

class datumaro.components.dataset.ImportContext(progress_reporter=None, error_policy=None)[source]#

Bases: object

Method generated by attrs for class ImportContext.

progress_reporter: ProgressReporter#
error_policy: ImportErrorPolicy#
class datumaro.components.dataset.ImportErrorPolicy[source]#

Bases: object

report_item_error(error: Exception, *, item_id: Tuple[str, str]) None[source]#

Allows to report a problem with a dataset item. If this function returns, the extractor must skip the item.

report_annotation_error(error: Exception, *, item_id: Tuple[str, str]) None[source]#

Allows to report a problem with a dataset item annotation. If this function returns, the extractor must skip the annotation.

fail(error: Exception) NoReturn[source]#
class datumaro.components.dataset.ItemTransform(extractor: IDataset)[source]#

Bases: Transform

transform_item(item: DatasetItem) DatasetItem | None[source]#

Returns a modified copy of the input item.

Avoid changing and returning the input item, because it can lead to unexpected problems. Use wrap_item() or item.wrap() to simplify copying.

class datumaro.components.dataset.LabelCategories(items: List[str] = _Nothing.NOTHING, label_groups: List[LabelGroup] = _Nothing.NOTHING, *, attributes: Set[str] = _Nothing.NOTHING)[source]#

Bases: Categories

Method generated by attrs for class LabelCategories.

class Category(name, parent: str = '', attributes: Set[str] = _Nothing.NOTHING)[source]#

Bases: object

Method generated by attrs for class LabelCategories.Category.

name: str#
parent: str#
attributes: Set[str]#
class LabelGroup(name, labels: List[str] = [], group_type: GroupType = GroupType.EXCLUSIVE)[source]#

Bases: object

Method generated by attrs for class LabelCategories.LabelGroup.

name: str#
labels: List[str]#
group_type: GroupType#
items: List[str]#
label_groups: List[LabelGroup]#
classmethod from_iterable(iterable: Iterable[str | Tuple[str] | Tuple[str, str] | Tuple[str, str, List[str]]]) LabelCategories[source]#

Creates a LabelCategories from iterable.

Parameters:

iterable

This iterable object can be:

  • a list of str - will be interpreted as list of Category names

  • a list of positional arguments - will generate Categories with these arguments

Returns: a LabelCategories object

add(name: str, parent: str | None = None, attributes: Set[str] | None = None) int[source]#
add_label_group(name: str, labels: List[str], group_type: GroupType) int[source]#
find(name: str) Tuple[int | None, Category | None][source]#
class datumaro.components.dataset.MediaElement(*args, **kwargs)[source]#

Bases: Generic[AnyData]

as_dict() Dict[str, Any][source]#
from_self(**kwargs)[source]#
property type: MediaType#
property data: AnyData | None#
property has_data: bool#
property bytes: bytes | None#
save(fp: str | IOBase)[source]#
exception datumaro.components.dataset.MultipleFormatsMatchError(formats)[source]#

Bases: DatasetImportError

Method generated by attrs for class MultipleFormatsMatchError.

formats#
exception datumaro.components.dataset.NoMatchingFormatsError[source]#

Bases: DatasetImportError

class datumaro.components.dataset.NullProgressReporter[source]#

Bases: ProgressReporter

property period: float#

Returns reporting period.

For example, 0.1 would mean every 10%.

property interval: float#

Returns reporting time interval in second.

start(total: int, *, desc: str | None = None)[source]#

Initializes the progress bar

report_status(progress: int)[source]#

Updates the progress bar

iter(iterable: Iterable[T], *, total: int | None = None, desc: str | None = None) Iterable[T][source]#

Traverses the iterable and reports progress simultaneously.

Starts and finishes the progress bar automatically.

Parameters:
  • iterable – An iterable to be traversed

  • total – The expected number of iterations. If not provided, will try to use iterable.__len__.

  • desc – The status message

Returns:

An iterable over elements of the input sequence

split(count: int) Tuple[ProgressReporter][source]#

Splits the progress bar into few independent parts. In case of 0 must return an empty tuple.

This class is supposed to manage the state of children progress bars and release of their resources, if necessary.

class datumaro.components.dataset.ProgressReporter[source]#

Bases: object

Only one set of methods must be called:
  • start - report_status - finish

  • iter

  • split

This class is supposed to manage the state of children progress bars and release of their resources, if necessary.

property period: float#

Returns reporting period.

For example, 0.1 would mean every 10%.

property interval: float#

Returns reporting time interval in second.

start(total: int, *, desc: str | None = None)[source]#

Initializes the progress bar

report_status(progress: int)[source]#

Updates the progress bar

finish()[source]#

Finishes the progress bar

iter(iterable: Iterable[T], *, total: int | None = None, desc: str | None = None) Iterable[T][source]#

Traverses the iterable and reports progress simultaneously.

Starts and finishes the progress bar automatically.

Parameters:
  • iterable – An iterable to be traversed

  • total – The expected number of iterations. If not provided, will try to use iterable.__len__.

  • desc – The status message

Returns:

An iterable over elements of the input sequence

split(count: int) Tuple[ProgressReporter, ...][source]#

Splits the progress bar into few independent parts. In case of 0 must return an empty tuple.

This class is supposed to manage the state of children progress bars and release of their resources, if necessary.

class datumaro.components.dataset.Source(config=None)[source]#

Bases: Config

property is_generated: bool#
class datumaro.components.dataset.StreamDataset(source: IDataset | None = None, *, infos: Dict[str, Any] | None = None, categories: Dict[AnnotationType, Categories] | None = None, media_type: Type[MediaElement] | None = None, ann_types: Set[AnnotationType] | None = None, env: Environment | None = None)[source]#

Bases: Dataset

property is_eager: bool#
classmethod from_extractors(*sources: IDataset, env: Environment | None = None, merge_policy: str = 'exact') Dataset[source]#

Creates a new dataset from one or several `Extractor`s.

In case of a single input, creates a lazy wrapper around the input. In case of several inputs, unifies them and caches the resulting dataset. We cannot apply regular dataset merge, since items list cannot be accessed.

Parameters:
  • sources – one or many input extractors

  • env – A context for plugins, which will be used for this dataset. If not specified, the builtin plugins will be used.

  • merge_policy – Policy on how to merge multiple datasets. Possible options are “exact”, “intersect”, and “union”.

Returns:

A new dataset with contents produced by input extractors

Return type:

dataset

class datumaro.components.dataset.StreamDatasetStorage(source: IDataset, infos: Dict[str, Any] | None = None, categories: Dict[AnnotationType, Categories] | None = None, media_type: Type[MediaElement] | None = None, ann_types: Set[AnnotationType] | None = None)[source]#

Bases: DatasetStorage

is_cache_initialized() bool[source]#
init_cache() None[source]#
property stacked_transform: IDataset#
put(item: DatasetItem) None[source]#
get(id: str, subset: str | None = None) DatasetItem | None[source]#

Provides random access to dataset items.

remove(id: str, subset: str | None = None) None[source]#
get_subset(name: str) IDataset[source]#
property subset_names#
subsets() Dict[str, IDataset][source]#

Enumerates subsets in the dataset. Each subset can be a dataset itself.

transform(method: Type[Transform], *args, **kwargs) None[source]#
get_annotated_items() int[source]#
get_annotations() int[source]#
get_datasetitem_by_path(path: str) DatasetItem | None[source]#
get_patch()[source]#
flush_changes()[source]#
update(source: DatasetPatch | IDataset | Iterable[DatasetItem])[source]#
infos() Dict[str, Any][source]#

Returns meta-info of dataset.

categories() Dict[AnnotationType, Categories][source]#

Returns metainfo about dataset labels.

property is_stream: bool#

Boolean indicating whether the dataset is a stream

If the dataset is a stream, the dataset item is generated on demand from its iterator.

exception datumaro.components.dataset.StreamedItemError[source]#

Bases: DatasetError

Method generated by attrs for class StreamedItemError.

class datumaro.components.dataset.TabularCategories(items: List[Category] = _Nothing.NOTHING, *, attributes: Set[str] = _Nothing.NOTHING)[source]#

Bases: Categories

Describes tabular data metainfo such as column names and types.

Method generated by attrs for class TabularCategories.

class Category(name, dtype: Type[TableDtype], labels: Set[str | int] = _Nothing.NOTHING)[source]#

Bases: object

Method generated by attrs for class TabularCategories.Category.

name: str#
dtype: Type[TableDtype]#
labels: Set[str | int]#
items: List[Category]#
classmethod from_iterable(iterable: Iterable[Tuple[str, Type[TableDtype]] | Tuple[str, Type[TableDtype], Set[str]]]) TabularCategories[source]#

Creates a TabularCategories from iterable.

Parameters:

iterable – a list of (Category name, type) or (Category name, type, set of labels)

Returns: a TabularCategories object

add(name: str, dtype: Type[TableDtype], labels: Set[str] | None = None) int[source]#

Add a Tabular Category.

Parameters:
  • name (str) – Column name

  • dtype (type) – Type of the corresponding column. (str, int, or float)

  • labels (optional, set(str)) – Label values where the column can have.

Returns:

A index of added category.

Return type:

int

find(name: str) Tuple[int | None, Category | None][source]#

Find Category information for the given column name.

Parameters:

name (str) – Column name

Returns:

A index and Category information.

Return type:

tuple(int, Category)

class datumaro.components.dataset.Transform(extractor: IDataset)[source]#

Bases: DatasetBase, CliPlugin

A base class for dataset transformations that change dataset items or their annotations.

static wrap_item(item, **kwargs)[source]#
categories()[source]#

Returns metainfo about dataset labels.

subsets()[source]#

Enumerates subsets in the dataset. Each subset can be a dataset itself.

media_type()[source]#

Returns media type of the dataset items.

All the items are supposed to have the same media type. Supposed to be constant and known immediately after the object construction (i.e. doesn’t require dataset iteration).

infos()[source]#

Returns meta-info of dataset.

exception datumaro.components.dataset.UnknownFormatError(format)[source]#

Bases: DatumaroError

Method generated by attrs for class UnknownFormatError.

format#
class datumaro.components.dataset.UserFunctionAnnotationsFilter(extractor: IDataset, filter_func: Callable[[DatasetItem, Annotation], bool], remove_empty: bool = False)[source]#

Bases: ItemTransform

Filter annotations using a user-provided Python function.

Parameters:
  • extractor – Datumaro Dataset to filter.

  • filter_func – A Python callable that takes DatasetItem and Annotation as its inputs and returns a boolean. If the return value is True, the Annotation will be retained. Otherwise, it is removed.

  • remove_empty – If True, DatasetItem without any annotations is removed after filtering its annotations. Otherwise, do not filter DatasetItem.

Example

This is an example of removing bounding boxes sized greater than 50% of the image size:

from datumaro.components.media import Image from datumaro.components.annotation import Annotation, Bbox

def filter_func(item: DatasetItem, ann: Annotation) -> bool:

# If the annotation is not a Bbox, do not filter if not isinstance(ann, Bbox):

return False

h, w = item.media_as(Image).size image_size = h * w bbox_size = ann.h * ann.w

# Accept Bboxes smaller than 50% of the image size return bbox_size < 0.5 * image_size

filtered = UserFunctionAnnotationsFilter(

extractor=dataset, filter_func=filter_func)

# No bounding boxes with a size greater than 50% of their image filtered_items = [item for item in filtered]

transform_item(item: DatasetItem) DatasetItem | None[source]#

Returns a modified copy of the input item.

Avoid changing and returning the input item, because it can lead to unexpected problems. Use wrap_item() or item.wrap() to simplify copying.

class datumaro.components.dataset.UserFunctionDatasetFilter(extractor: IDataset, filter_func: Callable[[DatasetItem], bool])[source]#

Bases: ItemTransform

Filter dataset items using a user-provided Python function.

Parameters:
  • extractor – Datumaro Dataset to filter.

  • filter_func – A Python callable that takes a DatasetItem as its input and returns a boolean. If the return value is True, that DatasetItem will be retained. Otherwise, it is removed.

Example

This is an example of filtering dataset items with images larger than 1024 pixels:

from datumaro.components.media import Image

def filter_func(item: DatasetItem) -> bool:

h, w = item.media_as(Image).size return h > 1024 or w > 1024

filtered = UserFunctionDatasetFilter(

extractor=dataset, filter_func=filter_func)

# No items with an image height or width greater than 1024 filtered_items = [item for item in filtered]

transform_item(item: DatasetItem) DatasetItem | None[source]#

Returns a modified copy of the input item.

Avoid changing and returning the input item, because it can lead to unexpected problems. Use wrap_item() or item.wrap() to simplify copying.

class datumaro.components.dataset.XPathAnnotationsFilter(extractor: IDataset, xpath: str, remove_empty: bool = False)[source]#

Bases: ItemTransform

transform_item(item: DatasetItem) DatasetItem | None[source]#

Returns a modified copy of the input item.

Avoid changing and returning the input item, because it can lead to unexpected problems. Use wrap_item() or item.wrap() to simplify copying.

class datumaro.components.dataset.XPathDatasetFilter(extractor: IDataset, xpath: str)[source]#

Bases: ItemTransform

transform_item(item: DatasetItem) DatasetItem | None[source]#

Returns a modified copy of the input item.

Avoid changing and returning the input item, because it can lead to unexpected problems. Use wrap_item() or item.wrap() to simplify copying.

datumaro.components.dataset.contextmanager(func)[source]#

@contextmanager decorator.

Typical usage:

@contextmanager def some_generator(<arguments>):

<setup> try:

yield <value>

finally:

<cleanup>

This makes this:

with some_generator(<arguments>) as <variable>:

<body>

equivalent to this:

<setup> try:

<variable> = <value> <body>

finally:

<cleanup>

datumaro.components.dataset.copy(x)[source]#

Shallow copy operation on arbitrary Python objects.

See the module’s __doc__ string for more info.

datumaro.components.dataset.deepcopy(x, memo=None, _nil=[])[source]#

Deep copy operation on arbitrary Python objects.

See the module’s __doc__ string for more info.

datumaro.components.dataset.logging_disabled(max_level=50)[source]#
datumaro.components.dataset.on_error_do(callback, *args, ignore_errors=False, kwargs=None)[source]#
datumaro.components.dataset.overload(func)[source]#

Decorator for overloaded functions/methods.

In a stub file, place two or more stub definitions for the same function in a row, each decorated with @overload.

For example:

@overload
def utf8(value: None) -> None: ...
@overload
def utf8(value: bytes) -> bytes: ...
@overload
def utf8(value: str) -> bytes: ...

In a non-stub file (i.e. a regular .py file), do the same but follow it with an implementation. The implementation should not be decorated with @overload:

@overload
def utf8(value: None) -> None: ...
@overload
def utf8(value: bytes) -> bytes: ...
@overload
def utf8(value: str) -> bytes: ...
def utf8(value):
    ...  # implementation goes here

The overloads for a function can be retrieved at runtime using the get_overloads() function.

datumaro.components.dataset.scoped(func, arg_name=None)[source]#

A function decorator, which allows to do actions with the current scope, such as registering error and exit callbacks and context managers.