Synthetic Data Generation
isnull() Function
Identify images with no detected objects for quality control and dataset cleaning.
Overview
The isnull() function identifies images in your dataset that have no detected objects. This is crucial for maintaining dataset quality and ensuring all images contribute meaningful training data.
Use Cases
Quality Control
Identify problematic images
Dataset Analysis
Analyze dataset completeness
Preprocessing
Prepare for dropna() function
Function Signature
def isnull(self) -> List[str]:
Return Value
List[str] - Image Paths
Returns a list of image file paths that have no detected objects. These images can be removed using the dropna() function.
# Example return valueempty_images = detection_dataset.isnull()print(empty_images)# Output: [# "/path/to/dataset/images/image001.jpg",# "/path/to/dataset/images/image005.jpg",# "/path/to/dataset/images/image012.jpg"# ]
Basic Examples
Check for Empty Images
from cvpal.generate import DetectionDataset# Initialize and generate datasetdetection_dataset = DetectionDataset()detection_dataset.generate(prompt="a cat sitting on a chair",num_images=10,labels=["cat", "chair"],output_type="yolo")# Check for empty imagesempty_images = detection_dataset.isnull()print(f"Found {len(empty_images)} images with no detections")if len(empty_images) > 0:print("Empty images:")for img_path in empty_images:print(f" - {img_path}")
Quality Assessment
# Assess dataset qualityempty_images = detection_dataset.isnull()total_images = len(detection_dataset.images) # Assuming this property existsif len(empty_images) > 0:empty_percentage = (len(empty_images) / total_images) * 100print(f"Dataset quality: {100 - empty_percentage:.1f}%")print(f"Empty images: {len(empty_images)} ({empty_percentage:.1f}%)")if empty_percentage > 20:print("β οΈ High percentage of empty images - consider improving prompts")elif empty_percentage > 10:print("β οΈ Moderate percentage of empty images")else:print("β Good dataset quality")else:print("β All images have detected objects!")
Advanced Usage
Batch Quality Control
Check multiple datasets for quality issues:
# Check multiple generation batchesprompts = ["a cat sitting on a chair","a dog running in a park","a person riding a bicycle"]for i, prompt in enumerate(prompts):detection_dataset.generate(prompt=prompt,num_images=5,labels=["cat", "dog", "person"][i:i+1],output_type="yolo",overwrite=False)# Check quality after each batchempty_images = detection_dataset.isnull()print(f"Batch {i+1}: {len(empty_images)} empty images")if len(empty_images) > 2: # Threshold for cleanupprint(f" Cleaning batch {i+1}...")detection_dataset.dropna()
Detailed Analysis
Analyze empty images by generation batch or prompt:
# Analyze empty images in detailempty_images = detection_dataset.isnull()if len(empty_images) > 0:print("Analysis of empty images:")# Group by filename patterns (if using systematic naming)for img_path in empty_images:filename = img_path.split('/')[-1]print(f" - {filename}")# Check if empty images are clusteredempty_indices = []for img_path in empty_images:# Extract image number from filenamefilename = img_path.split('/')[-1]if 'image' in filename:try:index = int(filename.split('image')[1].split('.')[0])empty_indices.append(index)except:passif empty_indices:empty_indices.sort()print(f"Empty image indices: {empty_indices}")# Check for patternsif len(empty_indices) > 1:gaps = [empty_indices[i+1] - empty_indices[i] for i in range(len(empty_indices)-1)]if max(gaps) > 5:print("Empty images are scattered - likely prompt issues")else:print("Empty images are clustered - possible generation batch issue")
Integration with dropna()
Complete Quality Control Workflow
# Complete workflow: Check and clean datasetdef quality_control_workflow(detection_dataset):# 1. Check for empty imagesempty_images = detection_dataset.isnull()if len(empty_images) == 0:print("β No empty images found - dataset is clean!")returnprint(f"Found {len(empty_images)} empty images")# 2. Show empty images for reviewprint("Empty images:")for img_path in empty_images:print(f" - {img_path}")# 3. Ask for confirmation (in real usage)# user_input = input("Remove empty images? (y/n): ")# if user_input.lower() == 'y':# 4. Remove empty imagesprint("Removing empty images...")detection_dataset.dropna()# 5. Verify cleanupremaining_empty = detection_dataset.isnull()print(f"Remaining empty images: {len(remaining_empty)}")if len(remaining_empty) == 0:print("β Dataset cleanup complete!")else:print("β οΈ Some empty images remain")# Use the workflowquality_control_workflow(detection_dataset)
Common Scenarios
High Empty Image Count
Symptom: Many images have no detected objects.
Causes: Vague prompts, objects too small, detection threshold too high, or model limitations.
Solutions: Improve prompts, adjust detection settings, or use different generation parameters.
Clustered Empty Images
Symptom: Empty images appear in consecutive batches.
Causes: Specific prompt issues, generation batch problems, or model instability.
Solutions: Review specific prompts, regenerate problematic batches, or adjust generation parameters.
Scattered Empty Images
Symptom: Empty images are randomly distributed.
Causes: General prompt quality issues, detection model limitations, or random generation failures.
Solutions: Improve overall prompt quality, adjust detection thresholds, or increase generation attempts.
Best Practices
β Recommended Workflow
- β’ Check isnull() after each generation batch
- β’ Set quality thresholds (e.g., <10% empty)
- β’ Review empty images before removal
- β’ Document quality metrics
- β’ Use dropna() for cleanup
β οΈ Common Mistakes
- β’ Not checking for empty images
- β’ Removing images without review
- β’ Ignoring quality patterns
- β’ Not documenting issues
- β’ Setting unrealistic thresholds