cvpal.preprocessing Module

The cvpal.preprocessing module provides comprehensive tools for dataset operations, label management, data analysis, and preparation for computer vision tasks.

Open Source Module

This is part of the open-source cvpal Python package. For platform API endpoints, see the Platform API Reference.

Installation

The preprocessing module is included with the cvpal package:

bash
pip install cvpal

Import

python
from cvpal import preprocessing

Dataset Structure Requirements

The preprocessing module requires datasets to follow a specific structure for proper functionality:

Required Structure

text
dataset/
β”œβ”€β”€ images/ # Folder containing all images
β”‚ β”œβ”€β”€ image1.jpg
β”‚ β”œβ”€β”€ image2.png
β”‚ └── ...
β”œβ”€β”€ labels/ # Folder containing all label files
β”‚ β”œβ”€β”€ image1.txt # YOLO format labels
β”‚ β”œβ”€β”€ image2.txt
β”‚ └── ...
└── dataset.yaml # Metadata file

YAML Metadata File

The dataset.yaml file must contain the following structure:

yaml
# Dataset metadata
name: "my_dataset"
version: "1.0"
description: "Custom dataset for object detection"
created: "2024-01-15"
author: "Your Name"
# Dataset paths
images_path: "images"
labels_path: "labels"
# Classes and labels
classes:
- "person"
- "car"
- "bicycle"
- "dog"
# Dataset statistics (optional, can be auto-generated)
total_images: 1000
total_annotations: 5000
class_distribution:
person: 2000
car: 1500
bicycle: 800
dog: 700
# Additional metadata
license: "MIT"
tags: ["object-detection", "street-scene"]
resolution: "1920x1080"

Example Dataset

You can find a complete example dataset structure at:

https://github.com/muhamed555/cvpal/tree/main/examples/sample_dataset

Functions

merge_datasets

Merge multiple datasets into a single unified dataset with proper label mapping and conflict resolution.

Function Signature

python
preprocessing.merge_datasets(
dataset_paths: List[str],
output_path: str,
label_mapping: Optional[Dict[str, str]] = None,
conflict_resolution: str = "rename"
) -> str

Parameters

ParameterTypeDefaultDescription
dataset_pathsList[str]-List of paths to datasets to merge
output_pathstr-Path where merged dataset will be saved
label_mappingDict[str, str]NoneMapping to standardize labels across datasets
conflict_resolutionstr"rename"How to handle filename conflicts: "rename", "skip", "overwrite"

Returns

str - Path to the merged dataset

Example

python
from cvpal import preprocessing
# Merge multiple datasets
dataset_paths = [
"path/to/street_dataset",
"path/to/pedestrian_dataset",
"path/to/vehicle_dataset"
]
# Define label mapping to standardize classes
label_mapping = {
"person": "pedestrian",
"people": "pedestrian",
"car": "vehicle",
"truck": "vehicle",
"bike": "bicycle"
}
merged_dataset = preprocessing.merge_datasets(
dataset_paths=dataset_paths,
output_path="merged_dataset",
label_mapping=label_mapping,
conflict_resolution="rename"
)
print(f"Merged dataset saved to: {merged_dataset}")

generate_report

Generate a comprehensive analysis report for a dataset including statistics, visualizations, and quality metrics.

Function Signature

python
preprocessing.generate_report(
dataset_path: str,
output_path: Optional[str] = None,
include_visualizations: bool = True,
include_quality_metrics: bool = True
) -> Dict[str, Any]

Parameters

ParameterTypeDefaultDescription
dataset_pathstr-Path to the dataset to analyze
output_pathstrNonePath to save the report (optional)
include_visualizationsboolTrueWhether to include charts and plots
include_quality_metricsboolTrueWhether to include data quality analysis

Returns

Dict[str, Any] - Dictionary containing report data and statistics

Example

python
from cvpal import preprocessing
# Generate comprehensive dataset report
report = preprocessing.generate_report(
dataset_path="my_dataset",
output_path="dataset_report.html",
include_visualizations=True,
include_quality_metrics=True
)
# Access report data
print(f"Total images: {report['total_images']}")
print(f"Total annotations: {report['total_annotations']}")
print(f"Class distribution: {report['class_distribution']}")
print(f"Average objects per image: {report['avg_objects_per_image']}")
# Quality metrics
if 'quality_metrics' in report:
print(f"Label consistency: {report['quality_metrics']['label_consistency']}")
print(f"Image quality score: {report['quality_metrics']['image_quality']}")

replace_labels

Replace or standardize labels across a dataset using a mapping dictionary.

Function Signature

python
preprocessing.replace_labels(
dataset_path: str,
label_mapping: Dict[str, str],
backup: bool = True
) -> Dict[str, int]

Parameters

ParameterTypeDefaultDescription
dataset_pathstr-Path to the dataset to modify
label_mappingDict[str, str]-Mapping from old labels to new labels
backupboolTrueWhether to create backup before modification

Returns

Dict[str, int] - Dictionary showing how many labels were replaced for each mapping

Example

python
from cvpal import preprocessing
# Define label replacements
label_mapping = {
"person": "pedestrian",
"people": "pedestrian",
"car": "vehicle",
"truck": "vehicle",
"bike": "bicycle",
"motorcycle": "bicycle"
}
# Replace labels in dataset
replacement_stats = preprocessing.replace_labels(
dataset_path="my_dataset",
label_mapping=label_mapping,
backup=True
)
# Check replacement statistics
print("Label replacement statistics:")
for old_label, new_label in label_mapping.items():
count = replacement_stats.get(f"{old_label} -> {new_label}", 0)
print(f" {old_label} -> {new_label}: {count} labels replaced")

validate_dataset

Validate dataset structure, label format, and data integrity.

Function Signature

python
preprocessing.validate_dataset(
dataset_path: str,
strict: bool = True
) -> Dict[str, Any]

Parameters

ParameterTypeDefaultDescription
dataset_pathstr-Path to the dataset to validate
strictboolTrueWhether to use strict validation rules

Returns

Dict[str, Any] - Validation results including errors, warnings, and statistics

Example

python
from cvpal import preprocessing
# Validate dataset structure and integrity
validation_result = preprocessing.validate_dataset(
dataset_path="my_dataset",
strict=True
)
# Check validation results
if validation_result['is_valid']:
print("βœ… Dataset is valid!")
print(f"Total images: {validation_result['total_images']}")
print(f"Total annotations: {validation_result['total_annotations']}")
else:
print("❌ Dataset validation failed!")
print("Errors:")
for error in validation_result['errors']:
print(f" - {error}")
print("Warnings:")
for warning in validation_result['warnings']:
print(f" - {warning}")

split_dataset

Split a dataset into train, validation, and test sets with proper stratification.

Function Signature

python
preprocessing.split_dataset(
dataset_path: str,
output_path: str,
train_ratio: float = 0.7,
val_ratio: float = 0.2,
test_ratio: float = 0.1,
stratify: bool = True
) -> Dict[str, str]

Parameters

ParameterTypeDefaultDescription
dataset_pathstr-Path to the dataset to split
output_pathstr-Base path for split datasets
train_ratiofloat0.7Proportion for training set
val_ratiofloat0.2Proportion for validation set
test_ratiofloat0.1Proportion for test set
stratifyboolTrueWhether to maintain class distribution across splits

Returns

Dict[str, str] - Dictionary with paths to train, val, and test datasets

Example

python
from cvpal import preprocessing
# Split dataset into train/val/test sets
split_result = preprocessing.split_dataset(
dataset_path="my_dataset",
output_path="split_datasets",
train_ratio=0.8,
val_ratio=0.1,
test_ratio=0.1,
stratify=True
)
print("Dataset split completed!")
print(f"Train set: {split_result['train']}")
print(f"Validation set: {split_result['val']}")
print(f"Test set: {split_result['test']}")

Complete Example

Here's a complete example showing how to use the preprocessing module for dataset management:

python
from cvpal import preprocessing
import os
def process_dataset_pipeline(input_datasets, output_dir):
"""
Complete dataset processing pipeline
"""
os.makedirs(output_dir, exist_ok=True)
# 1. Validate input datasets
print("Validating input datasets...")
for dataset_path in input_datasets:
validation = preprocessing.validate_dataset(dataset_path)
if not validation['is_valid']:
print(f"❌ Dataset {dataset_path} is invalid!")
return
print(f"βœ… Dataset {dataset_path} is valid")
# 2. Merge datasets with label standardization
print("\nMerging datasets...")
label_mapping = {
"person": "pedestrian",
"people": "pedestrian",
"car": "vehicle",
"truck": "vehicle"
}
merged_path = preprocessing.merge_datasets(
dataset_paths=input_datasets,
output_path=os.path.join(output_dir, "merged_dataset"),
label_mapping=label_mapping
)
print(f"βœ… Merged dataset saved to: {merged_path}")
# 3. Replace labels for consistency
print("\nStandardizing labels...")
replacement_stats = preprocessing.replace_labels(
dataset_path=merged_path,
label_mapping=label_mapping
)
print(f"βœ… Replaced {sum(replacement_stats.values())} labels")
# 4. Split into train/val/test
print("\nSplitting dataset...")
split_result = preprocessing.split_dataset(
dataset_path=merged_path,
output_path=os.path.join(output_dir, "split_datasets"),
train_ratio=0.8,
val_ratio=0.1,
test_ratio=0.1
)
print("βœ… Dataset split completed")
# 5. Generate reports for each split
print("\nGenerating reports...")
for split_name, split_path in split_result.items():
report = preprocessing.generate_report(
dataset_path=split_path,
output_path=os.path.join(output_dir, f"{split_name}_report.html")
)
print(f"βœ… {split_name.capitalize()} report: {report['total_images']} images, {report['total_annotations']} annotations")
print("\nπŸŽ‰ Dataset processing pipeline completed!")
# Example usage
input_datasets = [
"dataset1",
"dataset2",
"dataset3"
]
process_dataset_pipeline(input_datasets, "processed_datasets")

Error Handling

The preprocessing module includes comprehensive error handling for common issues:

Invalid Dataset Structure

Missing required folders or files will raise a DatasetStructureError

python
try:
preprocessing.validate_dataset("invalid_dataset")
except DatasetStructureError as e:
print(f"Dataset structure error: {e}")

Label Format Error

Invalid label files will raise a LabelFormatError

python
try:
preprocessing.replace_labels("dataset", {"old": "new"})
except LabelFormatError as e:
print(f"Label format error: {e}")

Best Practices

Dataset Organization

  • Always follow the required dataset structure (images/, labels/, dataset.yaml)
  • Use descriptive names for your dataset and classes
  • Include comprehensive metadata in the YAML file
  • Create backups before making modifications

Label Management

  • Use consistent label naming across datasets
  • Validate labels before merging datasets
  • Document label mappings for future reference
  • Check for label conflicts when merging

Quality Assurance

  • Always validate datasets before processing
  • Generate reports to understand data distribution
  • Use stratified splitting for balanced datasets
  • Monitor quality metrics during processing