Core Features
Label Management
Manage, modify, and analyze labels in your TXT-format datasets with precision and ease.
Overview
cvPal's label management features provide comprehensive tools for modifying, analyzing, and maintaining labels in your datasets. Whether you need to remove unwanted labels, rename classes, or find specific images, these functions help you maintain clean and consistent datasets.
Label Management Features
Remove Labels
Delete unwanted labels and reindex
Replace Labels
Update label names in YAML
Count Labels
Analyze label distribution
Find Images
Locate images with specific labels
Remove Labels
Remove unwanted labels from your dataset and automatically reindex the remaining labels:
Remove Labels by Name
from cvpal.preprocessing import remove_label_from_txt_dataset# Remove a label by nameremove_label_from_txt_dataset(dataset_path="/path/to/your/dataset",label_to_remove="unwanted_label")# Example: Remove "noise" labelremove_label_from_txt_dataset(dataset_path="/path/to/dataset",label_to_remove="noise")
Remove Labels by Index
# Remove a label by indexremove_label_from_txt_dataset(dataset_path="/path/to/your/dataset",label_to_remove=2 # Remove label at index 2)# Example: Remove the third label (index 2)remove_label_from_txt_dataset(dataset_path="/path/to/dataset",label_to_remove=2)
What Happens When You Remove a Label
β Automatic Updates
- β’ Label removed from data.yaml
- β’ Class count (nc) decremented
- β’ All TXT files updated
- β’ Remaining labels reindexed
- β’ Annotations with removed label deleted
β οΈ Important Notes
- β’ Operation is irreversible
- β’ Backup your dataset first
- β’ Images with only removed label become empty
- β’ Check for empty images after removal
- β’ Verify label indices are correct
Replace Labels
Update label names in your dataset's YAML configuration without affecting the actual annotations:
Replace Multiple Labels
from cvpal.preprocessing import replace_labels_in_txt_yaml# Replace multiple labels at oncereplace_labels_in_txt_yaml(dataset_path="/path/to/your/dataset",labels_dict={"old_label1": "new_label1","old_label2": "new_label2","cat": "feline","dog": "canine"})# Example: Standardize animal labelsreplace_labels_in_txt_yaml(dataset_path="/path/to/dataset",labels_dict={"cat": "feline","dog": "canine","bird": "avian"})
Single Label Replacement
# Replace a single labelreplace_labels_in_txt_yaml(dataset_path="/path/to/your/dataset",labels_dict={"person": "human" # Rename "person" to "human"})# Example: Fix typo in label namereplace_labels_in_txt_yaml(dataset_path="/path/to/dataset",labels_dict={"persn": "person" # Fix typo})
Label Replacement Use Cases
π Common Scenarios
- β’ Fix typos in label names
- β’ Standardize naming conventions
- β’ Merge similar labels
- β’ Update outdated terminology
- β’ Prepare for model training
β οΈ Important Notes
- β’ Only updates YAML file
- β’ TXT files remain unchanged
- β’ Label indices stay the same
- β’ Backup before bulk changes
- β’ Verify changes manually
Count Labels
Analyze label distribution across train, test, and validation splits:
Basic Label Counting
from cvpal.preprocessing import count_labels_in_txt_dataset# Get label counts for each splitlabel_counts = count_labels_in_txt_dataset("/path/to/your/dataset")# Display resultsprint("Label distribution across splits:")for split, counts in label_counts.items():print(f"\n{split.upper()}:")for label, count in counts.items():print(f" {label}: {count}")
Advanced Analysis
import pandas as pd# Convert to DataFrame for analysisdf_counts = pd.DataFrame(label_counts).fillna(0)print("\nLabel counts DataFrame:")print(df_counts)# Calculate totalstotal_counts = df_counts.sum(axis=1)print("\nTotal counts per label:")print(total_counts.sort_values(ascending=False))# Calculate percentagespercentages = (df_counts / df_counts.sum().sum() * 100).round(2)print("\nLabel percentages:")print(percentages)# Find most/least common labelsmost_common = total_counts.nlargest(5)least_common = total_counts.nsmallest(5)print(f"\nMost common labels: {most_common.to_dict()}")print(f"Least common labels: {least_common.to_dict()}")# Check for missing labels in splitsmissing_in_splits = {}for split in ['train', 'test', 'valid']:if split in df_counts.columns:missing = df_counts[df_counts[split] == 0].index.tolist()if missing:missing_in_splits[split] = missingif missing_in_splits:print("\nLabels missing in certain splits:")for split, labels in missing_in_splits.items():print(f"{split}: {labels}")else:print("\nβ All labels present in all splits")
Class Imbalance Analysis
# Analyze class imbalancemax_count = total_counts.max()min_count = total_counts.min()imbalance_ratio = max_count / min_countprint(f"Class imbalance ratio: {imbalance_ratio:.2f}")if imbalance_ratio > 10:print("β οΈ High class imbalance detected!")print("Consider data augmentation or resampling strategies")elif imbalance_ratio > 5:print("β οΈ Moderate class imbalance detected")print("Monitor training performance carefully")else:print("β Class distribution is relatively balanced")# Identify problematic labelsproblematic_labels = total_counts[total_counts < 10]if len(problematic_labels) > 0:print(f"\nLabels with very few instances (<10): {problematic_labels.to_dict()}")print("Consider removing these labels or collecting more data")# Calculate label coverage per splitfor split in ['train', 'test', 'valid']:if split in df_counts.columns:coverage = (df_counts[split] > 0).sum() / len(df_counts) * 100print(f"{split} label coverage: {coverage:.1f}%")
Find Images with Specific Labels
Locate images containing specific labels for analysis, filtering, or quality control:
Find Images by Label Name
from cvpal.preprocessing import find_images_with_label_in_txt_type# Find all images containing "cat" labelcat_images = find_images_with_label_in_txt_type(dataset_path="/path/to/your/dataset",label="cat",exclusive=False # Include images with other labels too)print(f"Found {len(cat_images)} images with 'cat' label")for img_path, label_path in cat_images[:5]: # Show first 5print(f"Image: {img_path}")print(f"Label: {label_path}")# Find images containing "person" labelperson_images = find_images_with_label_in_txt_type(dataset_path="/path/to/dataset",label="person",exclusive=False)
Find Images by Label Index
# Find images by label indeximages_with_label_0 = find_images_with_label_in_txt_type(dataset_path="/path/to/your/dataset",label=0, # First label (index 0)exclusive=False)print(f"Found {len(images_with_label_0)} images with label index 0")# Find images with specific label indeximages_with_label_2 = find_images_with_label_in_txt_type(dataset_path="/path/to/dataset",label=2, # Third label (index 2)exclusive=False)
Exclusive vs Inclusive Search
# Inclusive search (default): Find images with the label, regardless of other labelsinclusive_images = find_images_with_label_in_txt_type(dataset_path="/path/to/dataset",label="cat",exclusive=False # Images may have other labels too)# Exclusive search: Find images with ONLY this labelexclusive_images = find_images_with_label_in_txt_type(dataset_path="/path/to/dataset",label="cat",exclusive=True # Images must contain ONLY this label)print(f"Inclusive search: {len(inclusive_images)} images")print(f"Exclusive search: {len(exclusive_images)} images")# Compare resultsprint(f"Images with cat + other labels: {len(inclusive_images) - len(exclusive_images)}")
Advanced Image Filtering
# Find images with multiple specific labelsdef find_images_with_multiple_labels(dataset_path, labels, all_required=True):results = []for label in labels:label_images = find_images_with_label_in_txt_type(dataset_path=dataset_path,label=label,exclusive=False)results.append(set([img[0] for img in label_images]))if all_required:# Images that contain ALL specified labelscommon_images = set.intersection(*results)else:# Images that contain ANY of the specified labelscommon_images = set.union(*results)return list(common_images)# Find images containing both "cat" and "dog"cat_dog_images = find_images_with_multiple_labels(dataset_path="/path/to/dataset",labels=["cat", "dog"],all_required=True)print(f"Images with both cat and dog: {len(cat_dog_images)}")# Find images containing either "cat" or "dog"cat_or_dog_images = find_images_with_multiple_labels(dataset_path="/path/to/dataset",labels=["cat", "dog"],all_required=False)print(f"Images with cat or dog: {len(cat_or_dog_images)}")# Analyze split distribution of found imagesdef analyze_split_distribution(image_paths):split_counts = {}for img_path in image_paths:split = img_path.split('/')[-3] # Extract split from pathsplit_counts[split] = split_counts.get(split, 0) + 1return split_countscat_split_dist = analyze_split_distribution([img[0] for img in cat_images])print(f"\nCat images by split: {cat_split_dist}")
Best Practices
β Safe Operations
- β’ Always backup your dataset before modifications
- β’ Test operations on a small subset first
- β’ Verify label indices before removal
- β’ Check for empty images after label removal
- β’ Use descriptive label names
- β’ Document your changes
β οΈ Common Pitfalls
- β’ Removing labels without checking dependencies
- β’ Not verifying label indices after changes
- β’ Forgetting to update related documentation
- β’ Making bulk changes without testing
- β’ Ignoring class imbalance warnings
- β’ Not checking for missing labels in splits
Common Workflows
Dataset Cleanup Workflow
# 1. Analyze current datasetlabel_counts = count_labels_in_txt_dataset("/path/to/dataset")# 2. Identify problematic labelsproblematic_labels = []for split, counts in label_counts.items():for label, count in counts.items():if count < 5: # Labels with very few instancesproblematic_labels.append(label)print(f"Problematic labels: {problematic_labels}")# 3. Remove problematic labelsfor label in problematic_labels:remove_label_from_txt_dataset("/path/to/dataset", label)# 4. Verify cleanupnew_counts = count_labels_in_txt_dataset("/path/to/dataset")print("After cleanup:")print(new_counts)
Label Standardization Workflow
# 1. Replace inconsistent label namesreplace_labels_in_txt_yaml(dataset_path="/path/to/dataset",labels_dict={"cat": "feline","dog": "canine","person": "human","car": "vehicle"})# 2. Count labels to verify changeslabel_counts = count_labels_in_txt_dataset("/path/to/dataset")# 3. Find images with specific labels for verificationfeline_images = find_images_with_label_in_txt_type(dataset_path="/path/to/dataset",label="feline",exclusive=False)print(f"Found {len(feline_images)} images with 'feline' label")# 4. Generate final reportfrom cvpal.preprocessing import reportdf_report = report("/path/to/dataset")print(f"Final dataset: {len(df_report)} images")
Quality Control Workflow
# 1. Generate comprehensive reportdf_report = report("/path/to/dataset")# 2. Check for empty imagesempty_images = df_report[df_report['num_of_labels'] == 0]if len(empty_images) > 0:print(f"Found {len(empty_images)} empty images")print("Consider removing or relabeling these images")# 3. Analyze label distributionlabel_counts = count_labels_in_txt_dataset("/path/to/dataset")# 4. Check for missing labels in splitsfor split, counts in label_counts.items():missing_labels = []all_labels = set()for other_split, other_counts in label_counts.items():if other_split != split:all_labels.update(other_counts.keys())missing = all_labels - set(counts.keys())if missing:print(f"Missing labels in {split}: {missing}")# 5. Find images with rare labels for manual reviewrare_labels = []for split, counts in label_counts.items():for label, count in counts.items():if count < 10:rare_labels.append(label)for label in set(rare_labels):images = find_images_with_label_in_txt_type(dataset_path="/path/to/dataset",label=label,exclusive=False)print(f"Label '{label}' appears in {len(images)} images - consider review")