cvpal.preprocessing Module

The cvpal.preprocessing module provides comprehensive tools for dataset operations, label management, data analysis, and preparation for computer vision tasks.

Open Source Module

This is part of the open-source cvpal Python package. For platform API endpoints, see the Platform API Reference.

Installation

The preprocessing module is included with the cvpal package:

bash

pip install cvpal

Import

python

from cvpal import preprocessing

Dataset Structure Requirements

The preprocessing module requires datasets to follow a specific structure for proper functionality:

Required Structure

text

dataset/
├── images/           # Folder containing all images
│   ├── image1.jpg
│   ├── image2.png
│   └── ...
├── labels/           # Folder containing all label files
│   ├── image1.txt    # YOLO format labels
│   ├── image2.txt
│   └── ...
└── dataset.yaml      # Metadata file

YAML Metadata File

The dataset.yaml file must contain the following structure:

yaml

# Dataset metadata
name: "my_dataset"
version: "1.0"
description: "Custom dataset for object detection"
created: "2024-01-15"
author: "Your Name"

# Dataset paths
images_path: "images"
labels_path: "labels"

# Classes and labels
classes:
  - "person"
  - "car"
  - "bicycle"
  - "dog"

# Dataset statistics (optional, can be auto-generated)
total_images: 1000
total_annotations: 5000
class_distribution:
  person: 2000
  car: 1500
  bicycle: 800
  dog: 700

# Additional metadata
license: "MIT"
tags: ["object-detection", "street-scene"]
resolution: "1920x1080"

Example Dataset

You can find a complete example dataset structure at:

https://github.com/muhamed555/cvpal/tree/main/examples/sample_dataset

Functions

merge_datasets

Merge multiple datasets into a single unified dataset with proper label mapping and conflict resolution.

Function Signature

python

preprocessing.merge_datasets(
    dataset_paths: List[str],
    output_path: str,
    label_mapping: Optional[Dict[str, str]] = None,
    conflict_resolution: str = "rename"
) -> str

Parameters

Parameter	Type	Default	Description
dataset_paths	List[str]	-	List of paths to datasets to merge
output_path	str	-	Path where merged dataset will be saved
label_mapping	Dict[str, str]	None	Mapping to standardize labels across datasets
conflict_resolution	str	"rename"	How to handle filename conflicts: "rename", "skip", "overwrite"

Returns

str - Path to the merged dataset

Example

python

from cvpal import preprocessing

# Merge multiple datasets
dataset_paths = [
    "path/to/street_dataset",
    "path/to/pedestrian_dataset", 
    "path/to/vehicle_dataset"
]

# Define label mapping to standardize classes
label_mapping = {
    "person": "pedestrian",
    "people": "pedestrian", 
    "car": "vehicle",
    "truck": "vehicle",
    "bike": "bicycle"
}

merged_dataset = preprocessing.merge_datasets(
    dataset_paths=dataset_paths,
    output_path="merged_dataset",
    label_mapping=label_mapping,
    conflict_resolution="rename"
)

print(f"Merged dataset saved to: {merged_dataset}")

generate_report

Generate a comprehensive analysis report for a dataset including statistics, visualizations, and quality metrics.

Function Signature

python

preprocessing.generate_report(
    dataset_path: str,
    output_path: Optional[str] = None,
    include_visualizations: bool = True,
    include_quality_metrics: bool = True
) -> Dict[str, Any]

Parameters

Parameter	Type	Default	Description
dataset_path	str	-	Path to the dataset to analyze
output_path	str	None	Path to save the report (optional)
include_visualizations	bool	True	Whether to include charts and plots
include_quality_metrics	bool	True	Whether to include data quality analysis

Returns

Dict[str, Any] - Dictionary containing report data and statistics

Example

python

from cvpal import preprocessing

# Generate comprehensive dataset report
report = preprocessing.generate_report(
    dataset_path="my_dataset",
    output_path="dataset_report.html",
    include_visualizations=True,
    include_quality_metrics=True
)

# Access report data
print(f"Total images: {report['total_images']}")
print(f"Total annotations: {report['total_annotations']}")
print(f"Class distribution: {report['class_distribution']}")
print(f"Average objects per image: {report['avg_objects_per_image']}")

# Quality metrics
if 'quality_metrics' in report:
    print(f"Label consistency: {report['quality_metrics']['label_consistency']}")
    print(f"Image quality score: {report['quality_metrics']['image_quality']}")

replace_labels

Replace or standardize labels across a dataset using a mapping dictionary.

Function Signature

python

preprocessing.replace_labels(
    dataset_path: str,
    label_mapping: Dict[str, str],
    backup: bool = True
) -> Dict[str, int]

Parameters

Parameter	Type	Default	Description
dataset_path	str	-	Path to the dataset to modify
label_mapping	Dict[str, str]	-	Mapping from old labels to new labels
backup	bool	True	Whether to create backup before modification

Returns

Dict[str, int] - Dictionary showing how many labels were replaced for each mapping

Example

python

from cvpal import preprocessing

# Define label replacements
label_mapping = {
    "person": "pedestrian",
    "people": "pedestrian",
    "car": "vehicle", 
    "truck": "vehicle",
    "bike": "bicycle",
    "motorcycle": "bicycle"
}

# Replace labels in dataset
replacement_stats = preprocessing.replace_labels(
    dataset_path="my_dataset",
    label_mapping=label_mapping,
    backup=True
)

# Check replacement statistics
print("Label replacement statistics:")
for old_label, new_label in label_mapping.items():
    count = replacement_stats.get(f"{old_label} -> {new_label}", 0)
    print(f"  {old_label} -> {new_label}: {count} labels replaced")

validate_dataset

Validate dataset structure, label format, and data integrity.

Function Signature

python

preprocessing.validate_dataset(
    dataset_path: str,
    strict: bool = True
) -> Dict[str, Any]

Parameters

Parameter	Type	Default	Description
dataset_path	str	-	Path to the dataset to validate
strict	bool	True	Whether to use strict validation rules

Returns

Dict[str, Any] - Validation results including errors, warnings, and statistics

Example

python

from cvpal import preprocessing

# Validate dataset structure and integrity
validation_result = preprocessing.validate_dataset(
    dataset_path="my_dataset",
    strict=True
)

# Check validation results
if validation_result['is_valid']:
    print("✅ Dataset is valid!")
    print(f"Total images: {validation_result['total_images']}")
    print(f"Total annotations: {validation_result['total_annotations']}")
else:
    print("❌ Dataset validation failed!")
    print("Errors:")
    for error in validation_result['errors']:
        print(f"  - {error}")
    
    print("Warnings:")
    for warning in validation_result['warnings']:
        print(f"  - {warning}")

split_dataset

Split a dataset into train, validation, and test sets with proper stratification.

Function Signature

python

preprocessing.split_dataset(
    dataset_path: str,
    output_path: str,
    train_ratio: float = 0.7,
    val_ratio: float = 0.2,
    test_ratio: float = 0.1,
    stratify: bool = True
) -> Dict[str, str]

Parameters

Parameter	Type	Default	Description
dataset_path	str	-	Path to the dataset to split
output_path	str	-	Base path for split datasets
train_ratio	float	0.7	Proportion for training set
val_ratio	float	0.2	Proportion for validation set
test_ratio	float	0.1	Proportion for test set
stratify	bool	True	Whether to maintain class distribution across splits

Returns

Dict[str, str] - Dictionary with paths to train, val, and test datasets

Example

python

from cvpal import preprocessing

# Split dataset into train/val/test sets
split_result = preprocessing.split_dataset(
    dataset_path="my_dataset",
    output_path="split_datasets",
    train_ratio=0.8,
    val_ratio=0.1,
    test_ratio=0.1,
    stratify=True
)

print("Dataset split completed!")
print(f"Train set: {split_result['train']}")
print(f"Validation set: {split_result['val']}")
print(f"Test set: {split_result['test']}")

Complete Example

Here's a complete example showing how to use the preprocessing module for dataset management:

python

from cvpal import preprocessing
import os

def process_dataset_pipeline(input_datasets, output_dir):
    """
    Complete dataset processing pipeline
    """
    os.makedirs(output_dir, exist_ok=True)
    
    # 1. Validate input datasets
    print("Validating input datasets...")
    for dataset_path in input_datasets:
        validation = preprocessing.validate_dataset(dataset_path)
        if not validation['is_valid']:
            print(f"❌ Dataset {dataset_path} is invalid!")
            return
        print(f"✅ Dataset {dataset_path} is valid")
    
    # 2. Merge datasets with label standardization
    print("\nMerging datasets...")
    label_mapping = {
        "person": "pedestrian",
        "people": "pedestrian",
        "car": "vehicle",
        "truck": "vehicle"
    }
    
    merged_path = preprocessing.merge_datasets(
        dataset_paths=input_datasets,
        output_path=os.path.join(output_dir, "merged_dataset"),
        label_mapping=label_mapping
    )
    print(f"✅ Merged dataset saved to: {merged_path}")
    
    # 3. Replace labels for consistency
    print("\nStandardizing labels...")
    replacement_stats = preprocessing.replace_labels(
        dataset_path=merged_path,
        label_mapping=label_mapping
    )
    print(f"✅ Replaced {sum(replacement_stats.values())} labels")
    
    # 4. Split into train/val/test
    print("\nSplitting dataset...")
    split_result = preprocessing.split_dataset(
        dataset_path=merged_path,
        output_path=os.path.join(output_dir, "split_datasets"),
        train_ratio=0.8,
        val_ratio=0.1,
        test_ratio=0.1
    )
    print("✅ Dataset split completed")
    
    # 5. Generate reports for each split
    print("\nGenerating reports...")
    for split_name, split_path in split_result.items():
        report = preprocessing.generate_report(
            dataset_path=split_path,
            output_path=os.path.join(output_dir, f"{split_name}_report.html")
        )
        print(f"✅ {split_name.capitalize()} report: {report['total_images']} images, {report['total_annotations']} annotations")
    
    print("\n🎉 Dataset processing pipeline completed!")

# Example usage
input_datasets = [
    "dataset1",
    "dataset2", 
    "dataset3"
]

process_dataset_pipeline(input_datasets, "processed_datasets")

Error Handling

The preprocessing module includes comprehensive error handling for common issues:

Invalid Dataset Structure

Missing required folders or files will raise a DatasetStructureError

python

try:
    preprocessing.validate_dataset("invalid_dataset")
except DatasetStructureError as e:
    print(f"Dataset structure error: {e}")

Label Format Error

Invalid label files will raise a LabelFormatError

python

try:
    preprocessing.replace_labels("dataset", {"old": "new"})
except LabelFormatError as e:
    print(f"Label format error: {e}")

Best Practices

Dataset Organization

Always follow the required dataset structure (images/, labels/, dataset.yaml)
Use descriptive names for your dataset and classes
Include comprehensive metadata in the YAML file
Create backups before making modifications

Label Management

Use consistent label naming across datasets
Validate labels before merging datasets
Document label mappings for future reference
Check for label conflicts when merging

Quality Assurance

Always validate datasets before processing
Generate reports to understand data distribution
Use stratified splitting for balanced datasets
Monitor quality metrics during processing

cvpal.preprocessing Module

Open Source Module

Installation

Import

Dataset Structure Requirements

Required Structure

YAML Metadata File

Example Dataset

Functions

merge_datasets

Function Signature

Parameters

Returns

Example

generate_report

Function Signature

Parameters

Returns

Example

replace_labels

Function Signature

Parameters

Returns

Example

validate_dataset

Function Signature

Parameters

Returns

Example

split_dataset

Function Signature

Parameters

Returns

Example

Complete Example

Error Handling

Invalid Dataset Structure

Label Format Error

Best Practices

Dataset Organization

Label Management

Quality Assurance

Related Documentation

Table of Contents