Core Features

Dataset Merging

Combine multiple TXT-format datasets with intelligent label alignment and conflict resolution.

Overview

cvPal's dataset merging feature allows you to combine multiple YOLO-format datasets into a single unified dataset. The system intelligently handles label conflicts, reindexes class IDs, and maintains data integrity across different dataset sources.

Key Features

Smart Merging

Intelligent label alignment and conflict resolution

Label Management

Automatic reindexing and label mapping

Structure Preservation

Maintains train/test/valid splits

Basic Usage

Merge multiple datasets using the merge_data_txt function:

Simple Merge Example

python
from cvpal.preprocessing import merge_data_txt
# Merge multiple datasets
unified_path, excluded_images = merge_data_txt(
path="/path/to/primary/dataset", # Primary dataset (reference)
paths=["/path/to/dataset2", "/path/to/dataset3"], # Additional datasets
folder_name_provided="merged_dataset", # Output folder name
base_storage_path="/output/location", # Where to save merged dataset
parental_reference=False # Merge all labels
)
print(f"Merged dataset saved to: {unified_path}")
print(f"Excluded images: {len(excluded_images)}")

πŸ“ Parameters

  • path - Primary dataset path (reference)
  • paths - List of additional dataset paths
  • folder_name_provided - Output folder name
  • base_storage_path - Output location (None for current dir)
  • parental_reference - Label merging strategy

🎯 Output

  • β€’ Unified dataset in specified folder
  • β€’ Preserved train/test/valid structure
  • β€’ Updated data.yaml with merged labels
  • β€’ Dictionary of excluded images (if any)
  • β€’ Reindexed label files

Merging Strategies

Choose the appropriate merging strategy based on your requirements:

Parental Reference Strategy

parental_reference=True - Only merge labels that exist in the primary dataset.

βœ… Advantages

  • β€’ Maintains consistent label set
  • β€’ Prevents label explosion
  • β€’ Better for specific use cases
  • β€’ Easier model training

⚠️ Considerations

  • β€’ May exclude useful labels
  • β€’ Requires careful primary dataset selection
  • β€’ Some images may be excluded

Example

text
# Primary dataset: ["cat", "dog"]
# Additional dataset: ["cat", "bird", "fish"]
# Result: Only "cat" labels are merged, "bird" and "fish" are excluded

Inclusive Strategy

parental_reference=False - Merge all unique labels from all datasets.

βœ… Advantages

  • β€’ Preserves all label information
  • β€’ Maximizes dataset diversity
  • β€’ No data loss
  • β€’ Better for general-purpose models

⚠️ Considerations

  • β€’ May create large label sets
  • β€’ Requires more training data
  • β€’ Potential class imbalance

Example

text
# Primary dataset: ["cat", "dog"]
# Additional dataset: ["cat", "bird", "fish"]
# Result: ["cat", "dog", "bird", "fish"] - all labels merged

Label Management

Additional functions for managing labels in your datasets:

Remove Labels

Remove specific labels from your dataset and automatically reindex remaining labels:

python
from cvpal.preprocessing import remove_label_from_txt_dataset
# Remove a label by name
remove_label_from_txt_dataset(
dataset_path="/path/to/dataset",
label_to_remove="unwanted_label"
)
# Remove a label by index
remove_label_from_txt_dataset(
dataset_path="/path/to/dataset",
label_to_remove=2 # Remove label at index 2
)

Replace Labels

Update label names in your dataset's YAML configuration:

python
from cvpal.preprocessing import replace_labels_in_txt_yaml
# Replace multiple labels
replace_labels_in_txt_yaml(
dataset_path="/path/to/dataset",
labels_dict={
"old_label1": "new_label1",
"old_label2": "new_label2",
"cat": "feline" # Rename "cat" to "feline"
}
)

Count Label Occurrences

Analyze label distribution across train/test/valid splits:

python
from cvpal.preprocessing import count_labels_in_txt_dataset
# Get label counts for each split
label_counts = count_labels_in_txt_dataset("/path/to/dataset")
print("Label distribution:")
for split, counts in label_counts.items():
print(f"\n{split.upper()}:")
for label, count in counts.items():
print(f" {label}: {count}")

Find Images with Specific Labels

Locate images containing specific labels for analysis or filtering:

python
from cvpal.preprocessing import find_images_with_label_in_txt_type
# Find all images containing "cat" label
cat_images = find_images_with_label_in_txt_type(
dataset_path="/path/to/dataset",
label="cat",
exclusive=False # Include images with other labels too
)
# Find images containing ONLY "cat" label
cat_only_images = find_images_with_label_in_txt_type(
dataset_path="/path/to/dataset",
label="cat",
exclusive=True # Only images with this label
)
print(f"Found {len(cat_images)} images with cat label")

Dataset Reporting

Generate comprehensive reports about your dataset structure and content:

Generate Dataset Report

Create a detailed pandas DataFrame report of your dataset:

python
from cvpal.preprocessing import report
import pandas as pd
# Generate comprehensive dataset report
df_report = report("/path/to/dataset")
# Display basic statistics
print(f"Total images: {len(df_report)}")
print(f"Average labels per image: {df_report['num_of_labels'].mean():.2f}")
print(f"Split distribution:")
print(df_report['directory'].value_counts())
# Analyze label distribution
all_labels = []
for labels in df_report['labels']:
all_labels.extend(labels)
label_counts = pd.Series(all_labels).value_counts()
print("\nLabel frequency:")
print(label_counts)
# Save report to CSV
df_report.to_csv("dataset_report.csv", index=False)

πŸ“Š Report Columns

  • image_path - Path to image file
  • label_path - Path to label file
  • num_of_labels - Number of objects in image
  • labels - List of label names
  • directory - Split (train/test/valid)

πŸ” Analysis Use Cases

  • β€’ Dataset quality assessment
  • β€’ Label distribution analysis
  • β€’ Split balance verification
  • β€’ Data cleaning and filtering
  • β€’ Training data preparation

Best Practices

βœ… Dataset Preparation

  • β€’ Ensure all datasets have proper YAML files
  • β€’ Verify train/test/valid folder structure
  • β€’ Check label consistency across datasets
  • β€’ Backup original datasets before merging
  • β€’ Validate image-label file correspondence

🎯 Merging Strategy

  • β€’ Choose primary dataset carefully
  • β€’ Consider label overlap between datasets
  • β€’ Use parental_reference=True for focused tasks
  • β€’ Use parental_reference=False for diversity
  • β€’ Review excluded images after merging

Common Issues & Solutions

Missing YAML Files

Issue: Datasets without data.yaml files cannot be merged.

Solution: Ensure each dataset has a proper data.yaml file with 'names' field containing label names.

Label Conflicts

Issue: Different datasets use different names for the same object.

Solution: Use replace_labels_in_txt_yaml to standardize label names before merging.

Excluded Images

Issue: Some images are excluded during merging due to label filtering.

Solution: Review excluded images dictionary and consider using parental_reference=False to preserve all data.