datumaro.components.algorithms.hash_key_inference.prune#

Functions

match_num_item_for_cluster(ratio, ...)

Classes

Centroid(*args, **kwargs)

Select items through clustering with centers targeting the desired number.

ClusteredRandom(*args, **kwargs)

Select items through clustering and choose randomly within each cluster.

Entropy(*args, **kwargs)

Select items through clustering and choose them based on label entropy in each cluster.

NDRSelect(*args, **kwargs)

Select items based on NDR among each subset.

Prune(dataset[, cluster_method, hash_type])

Prune make a representative and manageable subset.

PruneBase(*args, **kwargs)

QueryClust(*args, **kwargs)

Select items through clustering with inits that imply each label.

RandomSelect(*args, **kwargs)

Select items randomly from the dataset.

datumaro.components.algorithms.hash_key_inference.prune.match_num_item_for_cluster(ratio, dataset_len, cluster_num_item_list)[source]#
class datumaro.components.algorithms.hash_key_inference.prune.PruneBase(*args, **kwargs)[source]#

Bases: ABC

abstract base(ratio: float, num_centers: int | None, labels: List[int] | None, database_keys: ndarray | None, item_list: List[DatasetItem], source: Dataset | None) Tuple[List[DatasetItem], Dict | None][source]#

It executes each method for pruning.

Parameters:
  • ratio – How much to remain dataset after pruning.

  • num_centers – Number of centers for clustering.

  • labels – Label of one annotation for each datasetitem.

  • database_keys – Batch of the numpy formatted hash_key.

  • item_list – List of datasetitem of dataset.

  • source – Whole dataset.

Returns:

It returns a tuple of selected items and distance of each item and clusters.

class datumaro.components.algorithms.hash_key_inference.prune.RandomSelect(*args, **kwargs)[source]#

Bases: PruneBase

Select items randomly from the dataset.

base(ratio, num_centers, labels, database_keys, item_list, source)[source]#

It executes each method for pruning.

Parameters:
  • ratio – How much to remain dataset after pruning.

  • num_centers – Number of centers for clustering.

  • labels – Label of one annotation for each datasetitem.

  • database_keys – Batch of the numpy formatted hash_key.

  • item_list – List of datasetitem of dataset.

  • source – Whole dataset.

Returns:

It returns a tuple of selected items and distance of each item and clusters.

class datumaro.components.algorithms.hash_key_inference.prune.Centroid(*args, **kwargs)[source]#

Bases: PruneBase

Select items through clustering with centers targeting the desired number.

base(ratio, num_centers, labels, database_keys, item_list, source)[source]#

It executes each method for pruning.

Parameters:
  • ratio – How much to remain dataset after pruning.

  • num_centers – Number of centers for clustering.

  • labels – Label of one annotation for each datasetitem.

  • database_keys – Batch of the numpy formatted hash_key.

  • item_list – List of datasetitem of dataset.

  • source – Whole dataset.

Returns:

It returns a tuple of selected items and distance of each item and clusters.

class datumaro.components.algorithms.hash_key_inference.prune.ClusteredRandom(*args, **kwargs)[source]#

Bases: PruneBase

Select items through clustering and choose randomly within each cluster.

base(ratio, num_centers, labels, database_keys, item_list, source)[source]#

It executes each method for pruning.

Parameters:
  • ratio – How much to remain dataset after pruning.

  • num_centers – Number of centers for clustering.

  • labels – Label of one annotation for each datasetitem.

  • database_keys – Batch of the numpy formatted hash_key.

  • item_list – List of datasetitem of dataset.

  • source – Whole dataset.

Returns:

It returns a tuple of selected items and distance of each item and clusters.

class datumaro.components.algorithms.hash_key_inference.prune.QueryClust(*args, **kwargs)[source]#

Bases: PruneBase

Select items through clustering with inits that imply each label.

base(ratio, num_centers, labels, database_keys, item_list, source)[source]#

It executes each method for pruning.

Parameters:
  • ratio – How much to remain dataset after pruning.

  • num_centers – Number of centers for clustering.

  • labels – Label of one annotation for each datasetitem.

  • database_keys – Batch of the numpy formatted hash_key.

  • item_list – List of datasetitem of dataset.

  • source – Whole dataset.

Returns:

It returns a tuple of selected items and distance of each item and clusters.

class datumaro.components.algorithms.hash_key_inference.prune.Entropy(*args, **kwargs)[source]#

Bases: PruneBase

Select items through clustering and choose them based on label entropy in each cluster.

base(ratio, num_centers, labels, database_keys, item_list, source)[source]#

It executes each method for pruning.

Parameters:
  • ratio – How much to remain dataset after pruning.

  • num_centers – Number of centers for clustering.

  • labels – Label of one annotation for each datasetitem.

  • database_keys – Batch of the numpy formatted hash_key.

  • item_list – List of datasetitem of dataset.

  • source – Whole dataset.

Returns:

It returns a tuple of selected items and distance of each item and clusters.

class datumaro.components.algorithms.hash_key_inference.prune.NDRSelect(*args, **kwargs)[source]#

Bases: PruneBase

Select items based on NDR among each subset.

base(ratio, num_centers, labels, database_keys, item_list, source)[source]#

It executes each method for pruning.

Parameters:
  • ratio – How much to remain dataset after pruning.

  • num_centers – Number of centers for clustering.

  • labels – Label of one annotation for each datasetitem.

  • database_keys – Batch of the numpy formatted hash_key.

  • item_list – List of datasetitem of dataset.

  • source – Whole dataset.

Returns:

It returns a tuple of selected items and distance of each item and clusters.

class datumaro.components.algorithms.hash_key_inference.prune.Prune(dataset: Dataset, cluster_method: str = 'random', hash_type: str = 'img')[source]#

Bases: HashInference

Prune make a representative and manageable subset.

get_pruned(ratio: float = 0.5) Dataset[source]#