Synthetic Data Generation

dropna() Function

Remove images with no detected objects to maintain dataset quality and training effectiveness.

Overview

The dropna() function removes images that have no detected objects from your dataset. This is essential for maintaining dataset quality and ensuring all images contribute meaningful training data for your models.

Key Benefits

Remove Empty Images

Clean up dataset automatically

Improve Quality

Ensure all images have labels

Optimize Training

Better model performance

Function Signature

python
def dropna(self) -> None:

What It Does

Automatic Cleanup Process

The dropna() function automatically:

  • β€’ Identifies images with no detected objects
  • β€’ Removes both image files and corresponding label files
  • β€’ Updates the dataset structure
  • β€’ Maintains file organization
  • β€’ Provides feedback on removed files
text
# Before dropna()
# dataset/
# β”œβ”€β”€ images/
# β”‚ β”œβ”€β”€ image001.jpg # Has objects
# β”‚ β”œβ”€β”€ image002.jpg # No objects (empty)
# β”‚ β”œβ”€β”€ image003.jpg # Has objects
# β”‚ └── image004.jpg # No objects (empty)
# └── labels/
# β”œβ”€β”€ image001.txt # Contains labels
# β”œβ”€β”€ image002.txt # Empty file
# β”œβ”€β”€ image003.txt # Contains labels
# └── image004.txt # Empty file
detection_dataset.dropna()
# After dropna()
# dataset/
# β”œβ”€β”€ images/
# β”‚ β”œβ”€β”€ image001.jpg # Kept (has objects)
# β”‚ └── image003.jpg # Kept (has objects)
# └── labels/
# β”œβ”€β”€ image001.txt # Kept (has labels)
# └── image003.txt # Kept (has labels)

Basic Examples

Simple Cleanup

python
from cvpal.generate import DetectionDataset
# Initialize and generate dataset
detection_dataset = DetectionDataset()
detection_dataset.generate(
prompt="a cat sitting on a chair",
num_images=10,
labels=["cat", "chair"],
output_type="yolo"
)
# Check for empty images first
empty_images = detection_dataset.isnull()
print(f"Found {len(empty_images)} empty images")
# Remove empty images
detection_dataset.dropna()
# Verify cleanup
remaining_empty = detection_dataset.isnull()
print(f"Remaining empty images: {len(remaining_empty)}")

Quality Control Workflow

python
# Complete quality control workflow
def clean_dataset(detection_dataset):
# 1. Check current state
empty_images = detection_dataset.isnull()
total_images = len(detection_dataset.images) # Assuming this property exists
print(f"Dataset status:")
print(f" Total images: {total_images}")
print(f" Empty images: {len(empty_images)}")
if len(empty_images) == 0:
print("βœ… Dataset is already clean!")
return
# 2. Calculate quality percentage
quality_percentage = ((total_images - len(empty_images)) / total_images) * 100
print(f" Quality: {quality_percentage:.1f}%")
# 3. Clean dataset
print("Cleaning dataset...")
detection_dataset.dropna()
# 4. Verify results
final_empty = detection_dataset.isnull()
final_total = len(detection_dataset.images)
final_quality = ((final_total - len(final_empty)) / final_total) * 100
print(f"After cleanup:")
print(f" Total images: {final_total}")
print(f" Empty images: {len(final_empty)}")
print(f" Quality: {final_quality:.1f}%")
if len(final_empty) == 0:
print("βœ… Dataset cleanup complete!")
else:
print("⚠️ Some empty images remain")
# Use the workflow
clean_dataset(detection_dataset)

Advanced Usage

Batch Processing with Cleanup

Generate multiple batches and clean up after each:

python
# Generate multiple batches with cleanup
prompts = [
"a cat sitting on a chair",
"a dog running in a park",
"a person riding a bicycle"
]
for i, prompt in enumerate(prompts):
print(f"\nGenerating batch {i+1}: {prompt}")
# Generate batch
detection_dataset.generate(
prompt=prompt,
num_images=5,
labels=["cat", "dog", "person"][i:i+1],
output_type="yolo",
overwrite=False
)
# Check quality
empty_images = detection_dataset.isnull()
print(f" Generated 5 images, {len(empty_images)} empty")
# Clean up if needed
if len(empty_images) > 1: # Threshold for cleanup
print(f" Cleaning up {len(empty_images)} empty images...")
detection_dataset.dropna()
print(" Cleanup complete")
# Show final stats
final_empty = detection_dataset.isnull()
print(f" Final empty images: {len(final_empty)}")

Selective Cleanup

Clean up specific types of empty images:

python
# Advanced cleanup with analysis
def selective_cleanup(detection_dataset, threshold=0.1):
empty_images = detection_dataset.isnull()
total_images = len(detection_dataset.images)
if len(empty_images) == 0:
return
empty_percentage = len(empty_images) / total_images
print(f"Empty images: {len(empty_images)} ({empty_percentage:.1%})")
if empty_percentage > threshold:
print(f"Empty percentage ({empty_percentage:.1%}) exceeds threshold ({threshold:.1%})")
print("Performing cleanup...")
# Show which images will be removed
print("Removing images:")
for img_path in empty_images:
filename = img_path.split('/')[-1]
print(f" - {filename}")
# Perform cleanup
detection_dataset.dropna()
# Verify results
remaining_empty = detection_dataset.isnull()
print(f"Cleanup complete. Remaining empty images: {len(remaining_empty)}")
else:
print(f"Empty percentage ({empty_percentage:.1%}) is acceptable")
# Use selective cleanup
selective_cleanup(detection_dataset, threshold=0.15) # 15% threshold

Safety Considerations

⚠️ Irreversible Operation

Important: dropna() permanently removes files from your dataset.

Recommendation: Always backup your dataset before running dropna(), or use isnull() first to review what will be removed.

πŸ“ File System Impact

Impact: Both image files (.jpg, .png) and label files (.txt) are removed.

Consideration: Ensure you have sufficient disk space and that the operation won't affect other processes.

πŸ”„ Dataset Consistency

Maintained: File naming consistency and dataset structure are preserved.

Note: Image indices may have gaps after cleanup, but this doesn't affect functionality.

Best Practices

βœ… Recommended Workflow

  • β€’ Always use isnull() first to review
  • β€’ Set quality thresholds before cleanup
  • β€’ Backup dataset before major cleanup
  • β€’ Clean up in batches, not all at once
  • β€’ Verify results after cleanup

⚠️ Common Mistakes

  • β€’ Running dropna() without checking first
  • β€’ Not backing up before cleanup
  • β€’ Setting unrealistic quality thresholds
  • β€’ Not verifying cleanup results
  • β€’ Cleaning up too aggressively

Integration Examples

Complete Dataset Generation Pipeline

python
def generate_clean_dataset(prompts, labels_per_prompt, num_images_per_prompt=5):
"""
Complete pipeline: Generate -> Check -> Clean -> Verify
"""
detection_dataset = DetectionDataset()
for i, (prompt, labels) in enumerate(zip(prompts, labels_per_prompt)):
print(f"\n=== Batch {i+1}: {prompt} ===")
# Generate batch
detection_dataset.generate(
prompt=prompt,
num_images=num_images_per_prompt,
labels=labels,
output_type="yolo",
overwrite=False
)
# Check quality
empty_images = detection_dataset.isnull()
print(f"Generated {num_images_per_prompt} images, {len(empty_images)} empty")
# Clean if needed
if len(empty_images) > 0:
print(f"Cleaning {len(empty_images)} empty images...")
detection_dataset.dropna()
# Verify
final_empty = detection_dataset.isnull()
print(f"Batch complete. Empty images: {len(final_empty)}")
# Final cleanup
print("\n=== Final Cleanup ===")
final_empty = detection_dataset.isnull()
if len(final_empty) > 0:
print(f"Final cleanup: removing {len(final_empty)} empty images")
detection_dataset.dropna()
# Show final results
detection_dataset.show_samples(num_samples=3)
print("\nβœ… Dataset generation complete!")
# Example usage
prompts = [
"a cat sitting on a chair",
"a dog running in a park",
"a person riding a bicycle"
]
labels_per_prompt = [
["cat", "chair"],
["dog"],
["person", "bicycle"]
]
generate_clean_dataset(prompts, labels_per_prompt, num_images_per_prompt=3)

Troubleshooting

Files Not Removed

Issue: dropna() runs but files remain.

Solutions: Check file permissions, ensure files aren't locked by other processes, or verify the detection dataset state.

Dataset Structure Broken

Issue: Dataset structure becomes inconsistent after cleanup.

Solutions: Regenerate from backup, check file paths, or recreate dataset structure manually.

Too Many Files Removed

Issue: More files removed than expected.

Solutions: Restore from backup, check detection thresholds, or regenerate problematic batches.