Synthetic Data Generation
dropna() Function
Remove images with no detected objects to maintain dataset quality and training effectiveness.
Overview
The dropna() function removes images that have no detected objects from your dataset. This is essential for maintaining dataset quality and ensuring all images contribute meaningful training data for your models.
Key Benefits
Remove Empty Images
Clean up dataset automatically
Improve Quality
Ensure all images have labels
Optimize Training
Better model performance
Function Signature
def dropna(self) -> None:
What It Does
Automatic Cleanup Process
The dropna() function automatically:
- β’ Identifies images with no detected objects
- β’ Removes both image files and corresponding label files
- β’ Updates the dataset structure
- β’ Maintains file organization
- β’ Provides feedback on removed files
# Before dropna()# dataset/# βββ images/# β βββ image001.jpg # Has objects# β βββ image002.jpg # No objects (empty)# β βββ image003.jpg # Has objects# β βββ image004.jpg # No objects (empty)# βββ labels/# βββ image001.txt # Contains labels# βββ image002.txt # Empty file# βββ image003.txt # Contains labels# βββ image004.txt # Empty filedetection_dataset.dropna()# After dropna()# dataset/# βββ images/# β βββ image001.jpg # Kept (has objects)# β βββ image003.jpg # Kept (has objects)# βββ labels/# βββ image001.txt # Kept (has labels)# βββ image003.txt # Kept (has labels)
Basic Examples
Simple Cleanup
from cvpal.generate import DetectionDataset# Initialize and generate datasetdetection_dataset = DetectionDataset()detection_dataset.generate(prompt="a cat sitting on a chair",num_images=10,labels=["cat", "chair"],output_type="yolo")# Check for empty images firstempty_images = detection_dataset.isnull()print(f"Found {len(empty_images)} empty images")# Remove empty imagesdetection_dataset.dropna()# Verify cleanupremaining_empty = detection_dataset.isnull()print(f"Remaining empty images: {len(remaining_empty)}")
Quality Control Workflow
# Complete quality control workflowdef clean_dataset(detection_dataset):# 1. Check current stateempty_images = detection_dataset.isnull()total_images = len(detection_dataset.images) # Assuming this property existsprint(f"Dataset status:")print(f" Total images: {total_images}")print(f" Empty images: {len(empty_images)}")if len(empty_images) == 0:print("β Dataset is already clean!")return# 2. Calculate quality percentagequality_percentage = ((total_images - len(empty_images)) / total_images) * 100print(f" Quality: {quality_percentage:.1f}%")# 3. Clean datasetprint("Cleaning dataset...")detection_dataset.dropna()# 4. Verify resultsfinal_empty = detection_dataset.isnull()final_total = len(detection_dataset.images)final_quality = ((final_total - len(final_empty)) / final_total) * 100print(f"After cleanup:")print(f" Total images: {final_total}")print(f" Empty images: {len(final_empty)}")print(f" Quality: {final_quality:.1f}%")if len(final_empty) == 0:print("β Dataset cleanup complete!")else:print("β οΈ Some empty images remain")# Use the workflowclean_dataset(detection_dataset)
Advanced Usage
Batch Processing with Cleanup
Generate multiple batches and clean up after each:
# Generate multiple batches with cleanupprompts = ["a cat sitting on a chair","a dog running in a park","a person riding a bicycle"]for i, prompt in enumerate(prompts):print(f"\nGenerating batch {i+1}: {prompt}")# Generate batchdetection_dataset.generate(prompt=prompt,num_images=5,labels=["cat", "dog", "person"][i:i+1],output_type="yolo",overwrite=False)# Check qualityempty_images = detection_dataset.isnull()print(f" Generated 5 images, {len(empty_images)} empty")# Clean up if neededif len(empty_images) > 1: # Threshold for cleanupprint(f" Cleaning up {len(empty_images)} empty images...")detection_dataset.dropna()print(" Cleanup complete")# Show final statsfinal_empty = detection_dataset.isnull()print(f" Final empty images: {len(final_empty)}")
Selective Cleanup
Clean up specific types of empty images:
# Advanced cleanup with analysisdef selective_cleanup(detection_dataset, threshold=0.1):empty_images = detection_dataset.isnull()total_images = len(detection_dataset.images)if len(empty_images) == 0:returnempty_percentage = len(empty_images) / total_imagesprint(f"Empty images: {len(empty_images)} ({empty_percentage:.1%})")if empty_percentage > threshold:print(f"Empty percentage ({empty_percentage:.1%}) exceeds threshold ({threshold:.1%})")print("Performing cleanup...")# Show which images will be removedprint("Removing images:")for img_path in empty_images:filename = img_path.split('/')[-1]print(f" - {filename}")# Perform cleanupdetection_dataset.dropna()# Verify resultsremaining_empty = detection_dataset.isnull()print(f"Cleanup complete. Remaining empty images: {len(remaining_empty)}")else:print(f"Empty percentage ({empty_percentage:.1%}) is acceptable")# Use selective cleanupselective_cleanup(detection_dataset, threshold=0.15) # 15% threshold
Safety Considerations
β οΈ Irreversible Operation
Important: dropna() permanently removes files from your dataset.
Recommendation: Always backup your dataset before running dropna(), or use isnull() first to review what will be removed.
π File System Impact
Impact: Both image files (.jpg, .png) and label files (.txt) are removed.
Consideration: Ensure you have sufficient disk space and that the operation won't affect other processes.
π Dataset Consistency
Maintained: File naming consistency and dataset structure are preserved.
Note: Image indices may have gaps after cleanup, but this doesn't affect functionality.
Best Practices
β Recommended Workflow
- β’ Always use isnull() first to review
- β’ Set quality thresholds before cleanup
- β’ Backup dataset before major cleanup
- β’ Clean up in batches, not all at once
- β’ Verify results after cleanup
β οΈ Common Mistakes
- β’ Running dropna() without checking first
- β’ Not backing up before cleanup
- β’ Setting unrealistic quality thresholds
- β’ Not verifying cleanup results
- β’ Cleaning up too aggressively
Integration Examples
Complete Dataset Generation Pipeline
def generate_clean_dataset(prompts, labels_per_prompt, num_images_per_prompt=5):"""Complete pipeline: Generate -> Check -> Clean -> Verify"""detection_dataset = DetectionDataset()for i, (prompt, labels) in enumerate(zip(prompts, labels_per_prompt)):print(f"\n=== Batch {i+1}: {prompt} ===")# Generate batchdetection_dataset.generate(prompt=prompt,num_images=num_images_per_prompt,labels=labels,output_type="yolo",overwrite=False)# Check qualityempty_images = detection_dataset.isnull()print(f"Generated {num_images_per_prompt} images, {len(empty_images)} empty")# Clean if neededif len(empty_images) > 0:print(f"Cleaning {len(empty_images)} empty images...")detection_dataset.dropna()# Verifyfinal_empty = detection_dataset.isnull()print(f"Batch complete. Empty images: {len(final_empty)}")# Final cleanupprint("\n=== Final Cleanup ===")final_empty = detection_dataset.isnull()if len(final_empty) > 0:print(f"Final cleanup: removing {len(final_empty)} empty images")detection_dataset.dropna()# Show final resultsdetection_dataset.show_samples(num_samples=3)print("\nβ Dataset generation complete!")# Example usageprompts = ["a cat sitting on a chair","a dog running in a park","a person riding a bicycle"]labels_per_prompt = [["cat", "chair"],["dog"],["person", "bicycle"]]generate_clean_dataset(prompts, labels_per_prompt, num_images_per_prompt=3)
Troubleshooting
Files Not Removed
Issue: dropna() runs but files remain.
Solutions: Check file permissions, ensure files aren't locked by other processes, or verify the detection dataset state.
Dataset Structure Broken
Issue: Dataset structure becomes inconsistent after cleanup.
Solutions: Regenerate from backup, check file paths, or recreate dataset structure manually.
Too Many Files Removed
Issue: More files removed than expected.
Solutions: Restore from backup, check detection thresholds, or regenerate problematic batches.