cvpal.preprocessing Module
The cvpal.preprocessing module provides comprehensive tools for dataset operations, label management, data analysis, and preparation for computer vision tasks.
Open Source Module
This is part of the open-source cvpal Python package. For platform API endpoints, see the Platform API Reference.
Installation
The preprocessing module is included with the cvpal package:
pip install cvpal
Import
from cvpal import preprocessing
Dataset Structure Requirements
The preprocessing module requires datasets to follow a specific structure for proper functionality:
Required Structure
dataset/βββ images/ # Folder containing all imagesβ βββ image1.jpgβ βββ image2.pngβ βββ ...βββ labels/ # Folder containing all label filesβ βββ image1.txt # YOLO format labelsβ βββ image2.txtβ βββ ...βββ dataset.yaml # Metadata file
YAML Metadata File
The dataset.yaml file must contain the following structure:
# Dataset metadataname: "my_dataset"version: "1.0"description: "Custom dataset for object detection"created: "2024-01-15"author: "Your Name"# Dataset pathsimages_path: "images"labels_path: "labels"# Classes and labelsclasses:- "person"- "car"- "bicycle"- "dog"# Dataset statistics (optional, can be auto-generated)total_images: 1000total_annotations: 5000class_distribution:person: 2000car: 1500bicycle: 800dog: 700# Additional metadatalicense: "MIT"tags: ["object-detection", "street-scene"]resolution: "1920x1080"
Example Dataset
You can find a complete example dataset structure at:
https://github.com/muhamed555/cvpal/tree/main/examples/sample_datasetFunctions
merge_datasets
Merge multiple datasets into a single unified dataset with proper label mapping and conflict resolution.
Function Signature
preprocessing.merge_datasets(dataset_paths: List[str],output_path: str,label_mapping: Optional[Dict[str, str]] = None,conflict_resolution: str = "rename") -> str
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| dataset_paths | List[str] | - | List of paths to datasets to merge |
| output_path | str | - | Path where merged dataset will be saved |
| label_mapping | Dict[str, str] | None | Mapping to standardize labels across datasets |
| conflict_resolution | str | "rename" | How to handle filename conflicts: "rename", "skip", "overwrite" |
Returns
str - Path to the merged dataset
Example
from cvpal import preprocessing# Merge multiple datasetsdataset_paths = ["path/to/street_dataset","path/to/pedestrian_dataset","path/to/vehicle_dataset"]# Define label mapping to standardize classeslabel_mapping = {"person": "pedestrian","people": "pedestrian","car": "vehicle","truck": "vehicle","bike": "bicycle"}merged_dataset = preprocessing.merge_datasets(dataset_paths=dataset_paths,output_path="merged_dataset",label_mapping=label_mapping,conflict_resolution="rename")print(f"Merged dataset saved to: {merged_dataset}")
generate_report
Generate a comprehensive analysis report for a dataset including statistics, visualizations, and quality metrics.
Function Signature
preprocessing.generate_report(dataset_path: str,output_path: Optional[str] = None,include_visualizations: bool = True,include_quality_metrics: bool = True) -> Dict[str, Any]
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| dataset_path | str | - | Path to the dataset to analyze |
| output_path | str | None | Path to save the report (optional) |
| include_visualizations | bool | True | Whether to include charts and plots |
| include_quality_metrics | bool | True | Whether to include data quality analysis |
Returns
Dict[str, Any] - Dictionary containing report data and statistics
Example
from cvpal import preprocessing# Generate comprehensive dataset reportreport = preprocessing.generate_report(dataset_path="my_dataset",output_path="dataset_report.html",include_visualizations=True,include_quality_metrics=True)# Access report dataprint(f"Total images: {report['total_images']}")print(f"Total annotations: {report['total_annotations']}")print(f"Class distribution: {report['class_distribution']}")print(f"Average objects per image: {report['avg_objects_per_image']}")# Quality metricsif 'quality_metrics' in report:print(f"Label consistency: {report['quality_metrics']['label_consistency']}")print(f"Image quality score: {report['quality_metrics']['image_quality']}")
replace_labels
Replace or standardize labels across a dataset using a mapping dictionary.
Function Signature
preprocessing.replace_labels(dataset_path: str,label_mapping: Dict[str, str],backup: bool = True) -> Dict[str, int]
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| dataset_path | str | - | Path to the dataset to modify |
| label_mapping | Dict[str, str] | - | Mapping from old labels to new labels |
| backup | bool | True | Whether to create backup before modification |
Returns
Dict[str, int] - Dictionary showing how many labels were replaced for each mapping
Example
from cvpal import preprocessing# Define label replacementslabel_mapping = {"person": "pedestrian","people": "pedestrian","car": "vehicle","truck": "vehicle","bike": "bicycle","motorcycle": "bicycle"}# Replace labels in datasetreplacement_stats = preprocessing.replace_labels(dataset_path="my_dataset",label_mapping=label_mapping,backup=True)# Check replacement statisticsprint("Label replacement statistics:")for old_label, new_label in label_mapping.items():count = replacement_stats.get(f"{old_label} -> {new_label}", 0)print(f" {old_label} -> {new_label}: {count} labels replaced")
validate_dataset
Validate dataset structure, label format, and data integrity.
Function Signature
preprocessing.validate_dataset(dataset_path: str,strict: bool = True) -> Dict[str, Any]
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| dataset_path | str | - | Path to the dataset to validate |
| strict | bool | True | Whether to use strict validation rules |
Returns
Dict[str, Any] - Validation results including errors, warnings, and statistics
Example
from cvpal import preprocessing# Validate dataset structure and integrityvalidation_result = preprocessing.validate_dataset(dataset_path="my_dataset",strict=True)# Check validation resultsif validation_result['is_valid']:print("β Dataset is valid!")print(f"Total images: {validation_result['total_images']}")print(f"Total annotations: {validation_result['total_annotations']}")else:print("β Dataset validation failed!")print("Errors:")for error in validation_result['errors']:print(f" - {error}")print("Warnings:")for warning in validation_result['warnings']:print(f" - {warning}")
split_dataset
Split a dataset into train, validation, and test sets with proper stratification.
Function Signature
preprocessing.split_dataset(dataset_path: str,output_path: str,train_ratio: float = 0.7,val_ratio: float = 0.2,test_ratio: float = 0.1,stratify: bool = True) -> Dict[str, str]
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| dataset_path | str | - | Path to the dataset to split |
| output_path | str | - | Base path for split datasets |
| train_ratio | float | 0.7 | Proportion for training set |
| val_ratio | float | 0.2 | Proportion for validation set |
| test_ratio | float | 0.1 | Proportion for test set |
| stratify | bool | True | Whether to maintain class distribution across splits |
Returns
Dict[str, str] - Dictionary with paths to train, val, and test datasets
Example
from cvpal import preprocessing# Split dataset into train/val/test setssplit_result = preprocessing.split_dataset(dataset_path="my_dataset",output_path="split_datasets",train_ratio=0.8,val_ratio=0.1,test_ratio=0.1,stratify=True)print("Dataset split completed!")print(f"Train set: {split_result['train']}")print(f"Validation set: {split_result['val']}")print(f"Test set: {split_result['test']}")
Complete Example
Here's a complete example showing how to use the preprocessing module for dataset management:
from cvpal import preprocessingimport osdef process_dataset_pipeline(input_datasets, output_dir):"""Complete dataset processing pipeline"""os.makedirs(output_dir, exist_ok=True)# 1. Validate input datasetsprint("Validating input datasets...")for dataset_path in input_datasets:validation = preprocessing.validate_dataset(dataset_path)if not validation['is_valid']:print(f"β Dataset {dataset_path} is invalid!")returnprint(f"β Dataset {dataset_path} is valid")# 2. Merge datasets with label standardizationprint("\nMerging datasets...")label_mapping = {"person": "pedestrian","people": "pedestrian","car": "vehicle","truck": "vehicle"}merged_path = preprocessing.merge_datasets(dataset_paths=input_datasets,output_path=os.path.join(output_dir, "merged_dataset"),label_mapping=label_mapping)print(f"β Merged dataset saved to: {merged_path}")# 3. Replace labels for consistencyprint("\nStandardizing labels...")replacement_stats = preprocessing.replace_labels(dataset_path=merged_path,label_mapping=label_mapping)print(f"β Replaced {sum(replacement_stats.values())} labels")# 4. Split into train/val/testprint("\nSplitting dataset...")split_result = preprocessing.split_dataset(dataset_path=merged_path,output_path=os.path.join(output_dir, "split_datasets"),train_ratio=0.8,val_ratio=0.1,test_ratio=0.1)print("β Dataset split completed")# 5. Generate reports for each splitprint("\nGenerating reports...")for split_name, split_path in split_result.items():report = preprocessing.generate_report(dataset_path=split_path,output_path=os.path.join(output_dir, f"{split_name}_report.html"))print(f"β {split_name.capitalize()} report: {report['total_images']} images, {report['total_annotations']} annotations")print("\nπ Dataset processing pipeline completed!")# Example usageinput_datasets = ["dataset1","dataset2","dataset3"]process_dataset_pipeline(input_datasets, "processed_datasets")
Error Handling
The preprocessing module includes comprehensive error handling for common issues:
Invalid Dataset Structure
Missing required folders or files will raise a DatasetStructureError
try:preprocessing.validate_dataset("invalid_dataset")except DatasetStructureError as e:print(f"Dataset structure error: {e}")
Label Format Error
Invalid label files will raise a LabelFormatError
try:preprocessing.replace_labels("dataset", {"old": "new"})except LabelFormatError as e:print(f"Label format error: {e}")
Best Practices
Dataset Organization
- Always follow the required dataset structure (images/, labels/, dataset.yaml)
- Use descriptive names for your dataset and classes
- Include comprehensive metadata in the YAML file
- Create backups before making modifications
Label Management
- Use consistent label naming across datasets
- Validate labels before merging datasets
- Document label mappings for future reference
- Check for label conflicts when merging
Quality Assurance
- Always validate datasets before processing
- Generate reports to understand data distribution
- Use stratified splitting for balanced datasets
- Monitor quality metrics during processing