Core Features

Label Management

Manage, modify, and analyze labels in your TXT-format datasets with precision and ease.

Overview

cvPal's label management features provide comprehensive tools for modifying, analyzing, and maintaining labels in your datasets. Whether you need to remove unwanted labels, rename classes, or find specific images, these functions help you maintain clean and consistent datasets.

Label Management Features

Remove Labels

Delete unwanted labels and reindex

Replace Labels

Update label names in YAML

Count Labels

Analyze label distribution

Find Images

Locate images with specific labels

Remove Labels

Remove unwanted labels from your dataset and automatically reindex the remaining labels:

Remove Labels by Name

python

from cvpal.preprocessing import remove_label_from_txt_dataset

# Remove a label by name
remove_label_from_txt_dataset(
    dataset_path="/path/to/your/dataset",
    label_to_remove="unwanted_label"
)

# Example: Remove "noise" label
remove_label_from_txt_dataset(
    dataset_path="/path/to/dataset",
    label_to_remove="noise"
)

Remove Labels by Index

python

# Remove a label by index
remove_label_from_txt_dataset(
    dataset_path="/path/to/your/dataset",
    label_to_remove=2  # Remove label at index 2
)

# Example: Remove the third label (index 2)
remove_label_from_txt_dataset(
    dataset_path="/path/to/dataset",
    label_to_remove=2
)

What Happens When You Remove a Label

✅ Automatic Updates

• Label removed from data.yaml
• Class count (nc) decremented
• All TXT files updated
• Remaining labels reindexed
• Annotations with removed label deleted

⚠️ Important Notes

• Operation is irreversible
• Backup your dataset first
• Images with only removed label become empty
• Check for empty images after removal
• Verify label indices are correct

Replace Labels

Update label names in your dataset's YAML configuration without affecting the actual annotations:

Replace Multiple Labels

python

from cvpal.preprocessing import replace_labels_in_txt_yaml

# Replace multiple labels at once
replace_labels_in_txt_yaml(
    dataset_path="/path/to/your/dataset",
    labels_dict={
        "old_label1": "new_label1",
        "old_label2": "new_label2",
        "cat": "feline",
        "dog": "canine"
    }
)

# Example: Standardize animal labels
replace_labels_in_txt_yaml(
    dataset_path="/path/to/dataset",
    labels_dict={
        "cat": "feline",
        "dog": "canine",
        "bird": "avian"
    }
)

Single Label Replacement

python

# Replace a single label
replace_labels_in_txt_yaml(
    dataset_path="/path/to/your/dataset",
    labels_dict={
        "person": "human"  # Rename "person" to "human"
    }
)

# Example: Fix typo in label name
replace_labels_in_txt_yaml(
    dataset_path="/path/to/dataset",
    labels_dict={
        "persn": "person"  # Fix typo
    }
)

Label Replacement Use Cases

📝 Common Scenarios

• Fix typos in label names
• Standardize naming conventions
• Merge similar labels
• Update outdated terminology
• Prepare for model training

⚠️ Important Notes

• Only updates YAML file
• TXT files remain unchanged
• Label indices stay the same
• Backup before bulk changes
• Verify changes manually

Count Labels

Analyze label distribution across train, test, and validation splits:

Basic Label Counting

python

from cvpal.preprocessing import count_labels_in_txt_dataset

# Get label counts for each split
label_counts = count_labels_in_txt_dataset("/path/to/your/dataset")

# Display results
print("Label distribution across splits:")
for split, counts in label_counts.items():
    print(f"\n{split.upper()}:")
    for label, count in counts.items():
        print(f"  {label}: {count}")

Advanced Analysis

python

import pandas as pd

# Convert to DataFrame for analysis
df_counts = pd.DataFrame(label_counts).fillna(0)
print("\nLabel counts DataFrame:")
print(df_counts)

# Calculate totals
total_counts = df_counts.sum(axis=1)
print("\nTotal counts per label:")
print(total_counts.sort_values(ascending=False))

# Calculate percentages
percentages = (df_counts / df_counts.sum().sum() * 100).round(2)
print("\nLabel percentages:")
print(percentages)

# Find most/least common labels
most_common = total_counts.nlargest(5)
least_common = total_counts.nsmallest(5)
print(f"\nMost common labels: {most_common.to_dict()}")
print(f"Least common labels: {least_common.to_dict()}")

# Check for missing labels in splits
missing_in_splits = {}
for split in ['train', 'test', 'valid']:
    if split in df_counts.columns:
        missing = df_counts[df_counts[split] == 0].index.tolist()
        if missing:
            missing_in_splits[split] = missing

if missing_in_splits:
    print("\nLabels missing in certain splits:")
    for split, labels in missing_in_splits.items():
        print(f"{split}: {labels}")
else:
    print("\n✅ All labels present in all splits")

Class Imbalance Analysis

python

# Analyze class imbalance
max_count = total_counts.max()
min_count = total_counts.min()
imbalance_ratio = max_count / min_count

print(f"Class imbalance ratio: {imbalance_ratio:.2f}")

if imbalance_ratio > 10:
    print("⚠️  High class imbalance detected!")
    print("Consider data augmentation or resampling strategies")
elif imbalance_ratio > 5:
    print("⚠️  Moderate class imbalance detected")
    print("Monitor training performance carefully")
else:
    print("✅ Class distribution is relatively balanced")

# Identify problematic labels
problematic_labels = total_counts[total_counts < 10]
if len(problematic_labels) > 0:
    print(f"\nLabels with very few instances (<10): {problematic_labels.to_dict()}")
    print("Consider removing these labels or collecting more data")

# Calculate label coverage per split
for split in ['train', 'test', 'valid']:
    if split in df_counts.columns:
        coverage = (df_counts[split] > 0).sum() / len(df_counts) * 100
        print(f"{split} label coverage: {coverage:.1f}%")

Find Images with Specific Labels

Locate images containing specific labels for analysis, filtering, or quality control:

Find Images by Label Name

python

from cvpal.preprocessing import find_images_with_label_in_txt_type

# Find all images containing "cat" label
cat_images = find_images_with_label_in_txt_type(
    dataset_path="/path/to/your/dataset",
    label="cat",
    exclusive=False  # Include images with other labels too
)

print(f"Found {len(cat_images)} images with 'cat' label")
for img_path, label_path in cat_images[:5]:  # Show first 5
    print(f"Image: {img_path}")
    print(f"Label: {label_path}")

# Find images containing "person" label
person_images = find_images_with_label_in_txt_type(
    dataset_path="/path/to/dataset",
    label="person",
    exclusive=False
)

Find Images by Label Index

python

# Find images by label index
images_with_label_0 = find_images_with_label_in_txt_type(
    dataset_path="/path/to/your/dataset",
    label=0,  # First label (index 0)
    exclusive=False
)

print(f"Found {len(images_with_label_0)} images with label index 0")

# Find images with specific label index
images_with_label_2 = find_images_with_label_in_txt_type(
    dataset_path="/path/to/dataset",
    label=2,  # Third label (index 2)
    exclusive=False
)

Exclusive vs Inclusive Search

python

# Inclusive search (default): Find images with the label, regardless of other labels
inclusive_images = find_images_with_label_in_txt_type(
    dataset_path="/path/to/dataset",
    label="cat",
    exclusive=False  # Images may have other labels too
)

# Exclusive search: Find images with ONLY this label
exclusive_images = find_images_with_label_in_txt_type(
    dataset_path="/path/to/dataset",
    label="cat",
    exclusive=True  # Images must contain ONLY this label
)

print(f"Inclusive search: {len(inclusive_images)} images")
print(f"Exclusive search: {len(exclusive_images)} images")

# Compare results
print(f"Images with cat + other labels: {len(inclusive_images) - len(exclusive_images)}")

Advanced Image Filtering

python

# Find images with multiple specific labels
def find_images_with_multiple_labels(dataset_path, labels, all_required=True):
    results = []
    for label in labels:
        label_images = find_images_with_label_in_txt_type(
            dataset_path=dataset_path,
            label=label,
            exclusive=False
        )
        results.append(set([img[0] for img in label_images]))
    
    if all_required:
        # Images that contain ALL specified labels
        common_images = set.intersection(*results)
    else:
        # Images that contain ANY of the specified labels
        common_images = set.union(*results)
    
    return list(common_images)

# Find images containing both "cat" and "dog"
cat_dog_images = find_images_with_multiple_labels(
    dataset_path="/path/to/dataset",
    labels=["cat", "dog"],
    all_required=True
)

print(f"Images with both cat and dog: {len(cat_dog_images)}")

# Find images containing either "cat" or "dog"
cat_or_dog_images = find_images_with_multiple_labels(
    dataset_path="/path/to/dataset",
    labels=["cat", "dog"],
    all_required=False
)

print(f"Images with cat or dog: {len(cat_or_dog_images)}")

# Analyze split distribution of found images
def analyze_split_distribution(image_paths):
    split_counts = {}
    for img_path in image_paths:
        split = img_path.split('/')[-3]  # Extract split from path
        split_counts[split] = split_counts.get(split, 0) + 1
    return split_counts

cat_split_dist = analyze_split_distribution([img[0] for img in cat_images])
print(f"\nCat images by split: {cat_split_dist}")

Best Practices

✅ Safe Operations

• Always backup your dataset before modifications
• Test operations on a small subset first
• Verify label indices before removal
• Check for empty images after label removal
• Use descriptive label names
• Document your changes

⚠️ Common Pitfalls

• Removing labels without checking dependencies
• Not verifying label indices after changes
• Forgetting to update related documentation
• Making bulk changes without testing
• Ignoring class imbalance warnings
• Not checking for missing labels in splits

Common Workflows

Dataset Cleanup Workflow

python

# 1. Analyze current dataset
label_counts = count_labels_in_txt_dataset("/path/to/dataset")

# 2. Identify problematic labels
problematic_labels = []
for split, counts in label_counts.items():
    for label, count in counts.items():
        if count < 5:  # Labels with very few instances
            problematic_labels.append(label)

print(f"Problematic labels: {problematic_labels}")

# 3. Remove problematic labels
for label in problematic_labels:
    remove_label_from_txt_dataset("/path/to/dataset", label)

# 4. Verify cleanup
new_counts = count_labels_in_txt_dataset("/path/to/dataset")
print("After cleanup:")
print(new_counts)

Label Standardization Workflow

python

# 1. Replace inconsistent label names
replace_labels_in_txt_yaml(
    dataset_path="/path/to/dataset",
    labels_dict={
        "cat": "feline",
        "dog": "canine",
        "person": "human",
        "car": "vehicle"
    }
)

# 2. Count labels to verify changes
label_counts = count_labels_in_txt_dataset("/path/to/dataset")

# 3. Find images with specific labels for verification
feline_images = find_images_with_label_in_txt_type(
    dataset_path="/path/to/dataset",
    label="feline",
    exclusive=False
)

print(f"Found {len(feline_images)} images with 'feline' label")

# 4. Generate final report
from cvpal.preprocessing import report
df_report = report("/path/to/dataset")
print(f"Final dataset: {len(df_report)} images")

Quality Control Workflow

python

# 1. Generate comprehensive report
df_report = report("/path/to/dataset")

# 2. Check for empty images
empty_images = df_report[df_report['num_of_labels'] == 0]
if len(empty_images) > 0:
    print(f"Found {len(empty_images)} empty images")
    print("Consider removing or relabeling these images")

# 3. Analyze label distribution
label_counts = count_labels_in_txt_dataset("/path/to/dataset")

# 4. Check for missing labels in splits
for split, counts in label_counts.items():
    missing_labels = []
    all_labels = set()
    for other_split, other_counts in label_counts.items():
        if other_split != split:
            all_labels.update(other_counts.keys())
    
    missing = all_labels - set(counts.keys())
    if missing:
        print(f"Missing labels in {split}: {missing}")

# 5. Find images with rare labels for manual review
rare_labels = []
for split, counts in label_counts.items():
    for label, count in counts.items():
        if count < 10:
            rare_labels.append(label)

for label in set(rare_labels):
    images = find_images_with_label_in_txt_type(
        dataset_path="/path/to/dataset",
        label=label,
        exclusive=False
    )
    print(f"Label '{label}' appears in {len(images)} images - consider review")

Dataset Reporting Preprocessing API

Label Management

Overview

Label Management Features

Remove Labels

Replace Labels

Count Labels

Find Images

Remove Labels

Remove Labels by Name

Remove Labels by Index

What Happens When You Remove a Label

✅ Automatic Updates

⚠️ Important Notes

Replace Labels

Replace Multiple Labels

Single Label Replacement

Label Replacement Use Cases

📝 Common Scenarios

⚠️ Important Notes

Count Labels

Basic Label Counting

Advanced Analysis

Class Imbalance Analysis

Find Images with Specific Labels

Find Images by Label Name

Find Images by Label Index

Exclusive vs Inclusive Search

Advanced Image Filtering

Best Practices

✅ Safe Operations

⚠️ Common Pitfalls

Common Workflows

Dataset Cleanup Workflow

Label Standardization Workflow

Quality Control Workflow

Table of Contents