Core Features
Dataset Merging
Combine multiple TXT-format datasets with intelligent label alignment and conflict resolution.
Overview
cvPal's dataset merging feature allows you to combine multiple YOLO-format datasets into a single unified dataset. The system intelligently handles label conflicts, reindexes class IDs, and maintains data integrity across different dataset sources.
Key Features
Smart Merging
Intelligent label alignment and conflict resolution
Label Management
Automatic reindexing and label mapping
Structure Preservation
Maintains train/test/valid splits
Basic Usage
Merge multiple datasets using the merge_data_txt function:
Simple Merge Example
from cvpal.preprocessing import merge_data_txt# Merge multiple datasetsunified_path, excluded_images = merge_data_txt(path="/path/to/primary/dataset", # Primary dataset (reference)paths=["/path/to/dataset2", "/path/to/dataset3"], # Additional datasetsfolder_name_provided="merged_dataset", # Output folder namebase_storage_path="/output/location", # Where to save merged datasetparental_reference=False # Merge all labels)print(f"Merged dataset saved to: {unified_path}")print(f"Excluded images: {len(excluded_images)}")
π Parameters
path- Primary dataset path (reference)paths- List of additional dataset pathsfolder_name_provided- Output folder namebase_storage_path- Output location (None for current dir)parental_reference- Label merging strategy
π― Output
- β’ Unified dataset in specified folder
- β’ Preserved train/test/valid structure
- β’ Updated
data.yamlwith merged labels - β’ Dictionary of excluded images (if any)
- β’ Reindexed label files
Merging Strategies
Choose the appropriate merging strategy based on your requirements:
Parental Reference Strategy
parental_reference=True - Only merge labels that exist in the primary dataset.
β Advantages
- β’ Maintains consistent label set
- β’ Prevents label explosion
- β’ Better for specific use cases
- β’ Easier model training
β οΈ Considerations
- β’ May exclude useful labels
- β’ Requires careful primary dataset selection
- β’ Some images may be excluded
Example
# Primary dataset: ["cat", "dog"]# Additional dataset: ["cat", "bird", "fish"]# Result: Only "cat" labels are merged, "bird" and "fish" are excluded
Inclusive Strategy
parental_reference=False - Merge all unique labels from all datasets.
β Advantages
- β’ Preserves all label information
- β’ Maximizes dataset diversity
- β’ No data loss
- β’ Better for general-purpose models
β οΈ Considerations
- β’ May create large label sets
- β’ Requires more training data
- β’ Potential class imbalance
Example
# Primary dataset: ["cat", "dog"]# Additional dataset: ["cat", "bird", "fish"]# Result: ["cat", "dog", "bird", "fish"] - all labels merged
Label Management
Additional functions for managing labels in your datasets:
Remove Labels
Remove specific labels from your dataset and automatically reindex remaining labels:
from cvpal.preprocessing import remove_label_from_txt_dataset# Remove a label by nameremove_label_from_txt_dataset(dataset_path="/path/to/dataset",label_to_remove="unwanted_label")# Remove a label by indexremove_label_from_txt_dataset(dataset_path="/path/to/dataset",label_to_remove=2 # Remove label at index 2)
Replace Labels
Update label names in your dataset's YAML configuration:
from cvpal.preprocessing import replace_labels_in_txt_yaml# Replace multiple labelsreplace_labels_in_txt_yaml(dataset_path="/path/to/dataset",labels_dict={"old_label1": "new_label1","old_label2": "new_label2","cat": "feline" # Rename "cat" to "feline"})
Count Label Occurrences
Analyze label distribution across train/test/valid splits:
from cvpal.preprocessing import count_labels_in_txt_dataset# Get label counts for each splitlabel_counts = count_labels_in_txt_dataset("/path/to/dataset")print("Label distribution:")for split, counts in label_counts.items():print(f"\n{split.upper()}:")for label, count in counts.items():print(f" {label}: {count}")
Find Images with Specific Labels
Locate images containing specific labels for analysis or filtering:
from cvpal.preprocessing import find_images_with_label_in_txt_type# Find all images containing "cat" labelcat_images = find_images_with_label_in_txt_type(dataset_path="/path/to/dataset",label="cat",exclusive=False # Include images with other labels too)# Find images containing ONLY "cat" labelcat_only_images = find_images_with_label_in_txt_type(dataset_path="/path/to/dataset",label="cat",exclusive=True # Only images with this label)print(f"Found {len(cat_images)} images with cat label")
Dataset Reporting
Generate comprehensive reports about your dataset structure and content:
Generate Dataset Report
Create a detailed pandas DataFrame report of your dataset:
from cvpal.preprocessing import reportimport pandas as pd# Generate comprehensive dataset reportdf_report = report("/path/to/dataset")# Display basic statisticsprint(f"Total images: {len(df_report)}")print(f"Average labels per image: {df_report['num_of_labels'].mean():.2f}")print(f"Split distribution:")print(df_report['directory'].value_counts())# Analyze label distributionall_labels = []for labels in df_report['labels']:all_labels.extend(labels)label_counts = pd.Series(all_labels).value_counts()print("\nLabel frequency:")print(label_counts)# Save report to CSVdf_report.to_csv("dataset_report.csv", index=False)
π Report Columns
image_path- Path to image filelabel_path- Path to label filenum_of_labels- Number of objects in imagelabels- List of label namesdirectory- Split (train/test/valid)
π Analysis Use Cases
- β’ Dataset quality assessment
- β’ Label distribution analysis
- β’ Split balance verification
- β’ Data cleaning and filtering
- β’ Training data preparation
Best Practices
β Dataset Preparation
- β’ Ensure all datasets have proper YAML files
- β’ Verify train/test/valid folder structure
- β’ Check label consistency across datasets
- β’ Backup original datasets before merging
- β’ Validate image-label file correspondence
π― Merging Strategy
- β’ Choose primary dataset carefully
- β’ Consider label overlap between datasets
- β’ Use parental_reference=True for focused tasks
- β’ Use parental_reference=False for diversity
- β’ Review excluded images after merging
Common Issues & Solutions
Missing YAML Files
Issue: Datasets without data.yaml files cannot be merged.
Solution: Ensure each dataset has a proper data.yaml file with 'names' field containing label names.
Label Conflicts
Issue: Different datasets use different names for the same object.
Solution: Use replace_labels_in_txt_yaml to standardize label names before merging.
Excluded Images
Issue: Some images are excluded during merging due to label filtering.
Solution: Review excluded images dictionary and consider using parental_reference=False to preserve all data.