Synthetic Data Generation

isnull() Function

Identify images with no detected objects for quality control and dataset cleaning.

Overview

The isnull() function identifies images in your dataset that have no detected objects. This is crucial for maintaining dataset quality and ensuring all images contribute meaningful training data.

Use Cases

Quality Control

Identify problematic images

Dataset Analysis

Analyze dataset completeness

Preprocessing

Prepare for dropna() function

Function Signature

python

def isnull(self) -> List[str]:

Return Value

List[str] - Image Paths

Returns a list of image file paths that have no detected objects. These images can be removed using the dropna() function.

python

# Example return value
empty_images = detection_dataset.isnull()
print(empty_images)
# Output: [
#     "/path/to/dataset/images/image001.jpg",
#     "/path/to/dataset/images/image005.jpg",
#     "/path/to/dataset/images/image012.jpg"
# ]

Basic Examples

Check for Empty Images

python

from cvpal.generate import DetectionDataset

# Initialize and generate dataset
detection_dataset = DetectionDataset()
detection_dataset.generate(
    prompt="a cat sitting on a chair",
    num_images=10,
    labels=["cat", "chair"],
    output_type="yolo"
)

# Check for empty images
empty_images = detection_dataset.isnull()
print(f"Found {len(empty_images)} images with no detections")

if len(empty_images) > 0:
    print("Empty images:")
    for img_path in empty_images:
        print(f"  - {img_path}")

Quality Assessment

python

# Assess dataset quality
empty_images = detection_dataset.isnull()
total_images = len(detection_dataset.images)  # Assuming this property exists

if len(empty_images) > 0:
    empty_percentage = (len(empty_images) / total_images) * 100
    print(f"Dataset quality: {100 - empty_percentage:.1f}%")
    print(f"Empty images: {len(empty_images)} ({empty_percentage:.1f}%)")
    
    if empty_percentage > 20:
        print("⚠️  High percentage of empty images - consider improving prompts")
    elif empty_percentage > 10:
        print("⚠️  Moderate percentage of empty images")
    else:
        print("✅ Good dataset quality")
else:
    print("✅ All images have detected objects!")

Advanced Usage

Batch Quality Control

Check multiple datasets for quality issues:

python

# Check multiple generation batches
prompts = [
    "a cat sitting on a chair",
    "a dog running in a park", 
    "a person riding a bicycle"
]

for i, prompt in enumerate(prompts):
    detection_dataset.generate(
        prompt=prompt,
        num_images=5,
        labels=["cat", "dog", "person"][i:i+1],
        output_type="yolo",
        overwrite=False
    )
    
    # Check quality after each batch
    empty_images = detection_dataset.isnull()
    print(f"Batch {i+1}: {len(empty_images)} empty images")
    
    if len(empty_images) > 2:  # Threshold for cleanup
        print(f"  Cleaning batch {i+1}...")
        detection_dataset.dropna()

Detailed Analysis

Analyze empty images by generation batch or prompt:

python

# Analyze empty images in detail
empty_images = detection_dataset.isnull()

if len(empty_images) > 0:
    print("Analysis of empty images:")
    
    # Group by filename patterns (if using systematic naming)
    for img_path in empty_images:
        filename = img_path.split('/')[-1]
        print(f"  - {filename}")
    
    # Check if empty images are clustered
    empty_indices = []
    for img_path in empty_images:
        # Extract image number from filename
        filename = img_path.split('/')[-1]
        if 'image' in filename:
            try:
                index = int(filename.split('image')[1].split('.')[0])
                empty_indices.append(index)
            except:
                pass
    
    if empty_indices:
        empty_indices.sort()
        print(f"Empty image indices: {empty_indices}")
        
        # Check for patterns
        if len(empty_indices) > 1:
            gaps = [empty_indices[i+1] - empty_indices[i] for i in range(len(empty_indices)-1)]
            if max(gaps) > 5:
                print("Empty images are scattered - likely prompt issues")
            else:
                print("Empty images are clustered - possible generation batch issue")

Integration with dropna()

Complete Quality Control Workflow

python

# Complete workflow: Check and clean dataset
def quality_control_workflow(detection_dataset):
    # 1. Check for empty images
    empty_images = detection_dataset.isnull()
    
    if len(empty_images) == 0:
        print("✅ No empty images found - dataset is clean!")
        return
    
    print(f"Found {len(empty_images)} empty images")
    
    # 2. Show empty images for review
    print("Empty images:")
    for img_path in empty_images:
        print(f"  - {img_path}")
    
    # 3. Ask for confirmation (in real usage)
    # user_input = input("Remove empty images? (y/n): ")
    # if user_input.lower() == 'y':
    
    # 4. Remove empty images
    print("Removing empty images...")
    detection_dataset.dropna()
    
    # 5. Verify cleanup
    remaining_empty = detection_dataset.isnull()
    print(f"Remaining empty images: {len(remaining_empty)}")
    
    if len(remaining_empty) == 0:
        print("✅ Dataset cleanup complete!")
    else:
        print("⚠️  Some empty images remain")

# Use the workflow
quality_control_workflow(detection_dataset)

Common Scenarios

High Empty Image Count

Symptom: Many images have no detected objects.

Causes: Vague prompts, objects too small, detection threshold too high, or model limitations.

Solutions: Improve prompts, adjust detection settings, or use different generation parameters.

Clustered Empty Images

Symptom: Empty images appear in consecutive batches.

Causes: Specific prompt issues, generation batch problems, or model instability.

Solutions: Review specific prompts, regenerate problematic batches, or adjust generation parameters.

Scattered Empty Images

Symptom: Empty images are randomly distributed.

Causes: General prompt quality issues, detection model limitations, or random generation failures.

Solutions: Improve overall prompt quality, adjust detection thresholds, or increase generation attempts.

Best Practices

✅ Recommended Workflow

• Check isnull() after each generation batch
• Set quality thresholds (e.g., <10% empty)
• Review empty images before removal
• Document quality metrics
• Use dropna() for cleanup

⚠️ Common Mistakes

• Not checking for empty images
• Removing images without review
• Ignoring quality patterns
• Not documenting issues
• Setting unrealistic thresholds

add_labels() Function dropna() Function

isnull() Function

Overview

Use Cases

Quality Control

Dataset Analysis

Preprocessing

Function Signature

Return Value

List[str] - Image Paths

Basic Examples

Check for Empty Images

Quality Assessment

Advanced Usage

Batch Quality Control

Detailed Analysis

Integration with dropna()

Complete Quality Control Workflow

Common Scenarios

High Empty Image Count

Clustered Empty Images

Scattered Empty Images

Best Practices

✅ Recommended Workflow

⚠️ Common Mistakes

Table of Contents