Synthetic Data Generation

dropna() Function

Remove images with no detected objects to maintain dataset quality and training effectiveness.

Overview

The dropna() function removes images that have no detected objects from your dataset. This is essential for maintaining dataset quality and ensuring all images contribute meaningful training data for your models.

Key Benefits

Remove Empty Images

Clean up dataset automatically

Improve Quality

Ensure all images have labels

Optimize Training

Better model performance

Function Signature

python

def dropna(self) -> None:

What It Does

Automatic Cleanup Process

The dropna() function automatically:

• Identifies images with no detected objects
• Removes both image files and corresponding label files
• Updates the dataset structure
• Maintains file organization
• Provides feedback on removed files

text

# Before dropna()
# dataset/
# ├── images/
# │   ├── image001.jpg  # Has objects
# │   ├── image002.jpg  # No objects (empty)
# │   ├── image003.jpg  # Has objects
# │   └── image004.jpg  # No objects (empty)
# └── labels/
#     ├── image001.txt  # Contains labels
#     ├── image002.txt  # Empty file
#     ├── image003.txt  # Contains labels
#     └── image004.txt  # Empty file

detection_dataset.dropna()

# After dropna()
# dataset/
# ├── images/
# │   ├── image001.jpg  # Kept (has objects)
# │   └── image003.jpg  # Kept (has objects)
# └── labels/
#     ├── image001.txt  # Kept (has labels)
#     └── image003.txt  # Kept (has labels)

Basic Examples

Simple Cleanup

python

from cvpal.generate import DetectionDataset

# Initialize and generate dataset
detection_dataset = DetectionDataset()
detection_dataset.generate(
    prompt="a cat sitting on a chair",
    num_images=10,
    labels=["cat", "chair"],
    output_type="yolo"
)

# Check for empty images first
empty_images = detection_dataset.isnull()
print(f"Found {len(empty_images)} empty images")

# Remove empty images
detection_dataset.dropna()

# Verify cleanup
remaining_empty = detection_dataset.isnull()
print(f"Remaining empty images: {len(remaining_empty)}")

Quality Control Workflow

python

# Complete quality control workflow
def clean_dataset(detection_dataset):
    # 1. Check current state
    empty_images = detection_dataset.isnull()
    total_images = len(detection_dataset.images)  # Assuming this property exists
    
    print(f"Dataset status:")
    print(f"  Total images: {total_images}")
    print(f"  Empty images: {len(empty_images)}")
    
    if len(empty_images) == 0:
        print("✅ Dataset is already clean!")
        return
    
    # 2. Calculate quality percentage
    quality_percentage = ((total_images - len(empty_images)) / total_images) * 100
    print(f"  Quality: {quality_percentage:.1f}%")
    
    # 3. Clean dataset
    print("Cleaning dataset...")
    detection_dataset.dropna()
    
    # 4. Verify results
    final_empty = detection_dataset.isnull()
    final_total = len(detection_dataset.images)
    final_quality = ((final_total - len(final_empty)) / final_total) * 100
    
    print(f"After cleanup:")
    print(f"  Total images: {final_total}")
    print(f"  Empty images: {len(final_empty)}")
    print(f"  Quality: {final_quality:.1f}%")
    
    if len(final_empty) == 0:
        print("✅ Dataset cleanup complete!")
    else:
        print("⚠️  Some empty images remain")

# Use the workflow
clean_dataset(detection_dataset)

Advanced Usage

Batch Processing with Cleanup

Generate multiple batches and clean up after each:

python

# Generate multiple batches with cleanup
prompts = [
    "a cat sitting on a chair",
    "a dog running in a park",
    "a person riding a bicycle"
]

for i, prompt in enumerate(prompts):
    print(f"\nGenerating batch {i+1}: {prompt}")
    
    # Generate batch
    detection_dataset.generate(
        prompt=prompt,
        num_images=5,
        labels=["cat", "dog", "person"][i:i+1],
        output_type="yolo",
        overwrite=False
    )
    
    # Check quality
    empty_images = detection_dataset.isnull()
    print(f"  Generated 5 images, {len(empty_images)} empty")
    
    # Clean up if needed
    if len(empty_images) > 1:  # Threshold for cleanup
        print(f"  Cleaning up {len(empty_images)} empty images...")
        detection_dataset.dropna()
        print("  Cleanup complete")
    
    # Show final stats
    final_empty = detection_dataset.isnull()
    print(f"  Final empty images: {len(final_empty)}")

Selective Cleanup

Clean up specific types of empty images:

python

# Advanced cleanup with analysis
def selective_cleanup(detection_dataset, threshold=0.1):
    empty_images = detection_dataset.isnull()
    total_images = len(detection_dataset.images)
    
    if len(empty_images) == 0:
        return
    
    empty_percentage = len(empty_images) / total_images
    
    print(f"Empty images: {len(empty_images)} ({empty_percentage:.1%})")
    
    if empty_percentage > threshold:
        print(f"Empty percentage ({empty_percentage:.1%}) exceeds threshold ({threshold:.1%})")
        print("Performing cleanup...")
        
        # Show which images will be removed
        print("Removing images:")
        for img_path in empty_images:
            filename = img_path.split('/')[-1]
            print(f"  - {filename}")
        
        # Perform cleanup
        detection_dataset.dropna()
        
        # Verify results
        remaining_empty = detection_dataset.isnull()
        print(f"Cleanup complete. Remaining empty images: {len(remaining_empty)}")
    else:
        print(f"Empty percentage ({empty_percentage:.1%}) is acceptable")

# Use selective cleanup
selective_cleanup(detection_dataset, threshold=0.15)  # 15% threshold

Safety Considerations

⚠️ Irreversible Operation

Important: dropna() permanently removes files from your dataset.

Recommendation: Always backup your dataset before running dropna(), or use isnull() first to review what will be removed.

📁 File System Impact

Impact: Both image files (.jpg, .png) and label files (.txt) are removed.

Consideration: Ensure you have sufficient disk space and that the operation won't affect other processes.

🔄 Dataset Consistency

Maintained: File naming consistency and dataset structure are preserved.

Note: Image indices may have gaps after cleanup, but this doesn't affect functionality.

Best Practices

✅ Recommended Workflow

• Always use isnull() first to review
• Set quality thresholds before cleanup
• Backup dataset before major cleanup
• Clean up in batches, not all at once
• Verify results after cleanup

⚠️ Common Mistakes

• Running dropna() without checking first
• Not backing up before cleanup
• Setting unrealistic quality thresholds
• Not verifying cleanup results
• Cleaning up too aggressively

Integration Examples

Complete Dataset Generation Pipeline

python

def generate_clean_dataset(prompts, labels_per_prompt, num_images_per_prompt=5):
    """
    Complete pipeline: Generate -> Check -> Clean -> Verify
    """
    detection_dataset = DetectionDataset()
    
    for i, (prompt, labels) in enumerate(zip(prompts, labels_per_prompt)):
        print(f"\n=== Batch {i+1}: {prompt} ===")
        
        # Generate batch
        detection_dataset.generate(
            prompt=prompt,
            num_images=num_images_per_prompt,
            labels=labels,
            output_type="yolo",
            overwrite=False
        )
        
        # Check quality
        empty_images = detection_dataset.isnull()
        print(f"Generated {num_images_per_prompt} images, {len(empty_images)} empty")
        
        # Clean if needed
        if len(empty_images) > 0:
            print(f"Cleaning {len(empty_images)} empty images...")
            detection_dataset.dropna()
        
        # Verify
        final_empty = detection_dataset.isnull()
        print(f"Batch complete. Empty images: {len(final_empty)}")
    
    # Final cleanup
    print("\n=== Final Cleanup ===")
    final_empty = detection_dataset.isnull()
    if len(final_empty) > 0:
        print(f"Final cleanup: removing {len(final_empty)} empty images")
        detection_dataset.dropna()
    
    # Show final results
    detection_dataset.show_samples(num_samples=3)
    print("\n✅ Dataset generation complete!")

# Example usage
prompts = [
    "a cat sitting on a chair",
    "a dog running in a park",
    "a person riding a bicycle"
]

labels_per_prompt = [
    ["cat", "chair"],
    ["dog"],
    ["person", "bicycle"]
]

generate_clean_dataset(prompts, labels_per_prompt, num_images_per_prompt=3)

Troubleshooting

Files Not Removed

Issue: dropna() runs but files remain.

Solutions: Check file permissions, ensure files aren't locked by other processes, or verify the detection dataset state.

Dataset Structure Broken

Issue: Dataset structure becomes inconsistent after cleanup.

Solutions: Regenerate from backup, check file paths, or recreate dataset structure manually.

Too Many Files Removed

Issue: More files removed than expected.

Solutions: Restore from backup, check detection thresholds, or regenerate problematic batches.

isnull() Function show_samples() Function

dropna() Function

Overview

Key Benefits

Remove Empty Images

Improve Quality

Optimize Training

Function Signature

What It Does

Automatic Cleanup Process

Basic Examples

Simple Cleanup

Quality Control Workflow

Advanced Usage

Batch Processing with Cleanup

Selective Cleanup

Safety Considerations

⚠️ Irreversible Operation

📁 File System Impact

🔄 Dataset Consistency

Best Practices

✅ Recommended Workflow

⚠️ Common Mistakes

Integration Examples

Complete Dataset Generation Pipeline

Troubleshooting

Files Not Removed

Dataset Structure Broken

Too Many Files Removed

Table of Contents