Get Started

Dataset Structure

Learn how to organize your computer vision datasets for optimal compatibility with cvPal.

Overview

cvPal supports multiple dataset formats and structures. The recommended structure follows a clean separation of images and labels, with metadata stored in YAML configuration files. This organization ensures compatibility with popular frameworks like YOLO, COCO, and custom formats.

Basic Dataset Structure

The most common and recommended structure for cvPal datasets:

text
dataset/
β”œβ”€β”€ images/
β”‚ β”œβ”€β”€ train/
β”‚ β”‚ β”œβ”€β”€ image001.jpg
β”‚ β”‚ β”œβ”€β”€ image002.jpg
β”‚ β”‚ └── ...
β”‚ β”œβ”€β”€ test/
β”‚ β”‚ β”œβ”€β”€ image101.jpg
β”‚ β”‚ β”œβ”€β”€ image102.jpg
β”‚ β”‚ └── ...
β”‚ └── valid/
β”‚ β”œβ”€β”€ image201.jpg
β”‚ β”œβ”€β”€ image202.jpg
β”‚ └── ...
β”œβ”€β”€ labels/
β”‚ β”œβ”€β”€ train/
β”‚ β”‚ β”œβ”€β”€ image001.txt
β”‚ β”‚ β”œβ”€β”€ image002.txt
β”‚ β”‚ └── ...
β”‚ β”œβ”€β”€ test/
β”‚ β”‚ β”œβ”€β”€ image101.txt
β”‚ β”‚ β”œβ”€β”€ image102.txt
β”‚ β”‚ └── ...
β”‚ └── valid/
β”‚ β”œβ”€β”€ image201.txt
β”‚ β”œβ”€β”€ image202.txt
β”‚ └── ...
└── data.yaml

πŸ“ Images Folder

Contains all image files organized by split (train/test/valid). Supports JPG, PNG, and other common formats.

🏷️ Labels Folder

Contains corresponding label files in TXT or JSON format. Each label file matches an image file.

βš™οΈ data.yaml

Configuration file containing dataset metadata, class names, and paths to training/validation sets.

YAML Configuration File

The data.yaml file contains essential metadata about your dataset:

Example data.yaml

yaml
# Dataset configuration
names:
- cat
- dog
- bird
nc: 3 # number of classes
# Dataset paths
train: images/train
val: images/valid
test: images/test
# Optional: Additional metadata
roboflow:
license: Private
project: animal-detection
url: https://universe.roboflow.com/your-project
version: 1
workspace: your-workspace
# Optional: Dataset info
info:
description: "Animal detection dataset with cats, dogs, and birds"
version: "1.0"
created: "2024-01-01"
author: "Your Name"

Required Fields

  • names - List of class names
  • nc - Number of classes
  • train - Path to training images
  • val - Path to validation images

Optional Fields

  • test - Path to test images
  • roboflow - Roboflow metadata
  • info - Additional dataset info

Label Formats

cvPal supports multiple label formats. Choose the one that best fits your workflow:

TXT Format (YOLO)

Each line represents one object: class_id x_center y_center width height

text
# image001.txt
0 0.5 0.3 0.2 0.4 # cat at center-left
1 0.7 0.6 0.15 0.3 # dog at bottom-right
# image002.txt
2 0.2 0.8 0.1 0.2 # bird at bottom-left

Note: All coordinates are normalized (0-1) relative to image dimensions.

JSON Format (COCO)

Structured format with detailed annotations and metadata:

json
{
"images": [
{
"id": 1,
"file_name": "image001.jpg",
"width": 640,
"height": 480
}
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1,
"bbox": [100, 50, 200, 150],
"area": 30000,
"iscrowd": 0
}
],
"categories": [
{
"id": 1,
"name": "cat",
"supercategory": "animal"
}
]
}

Alternative Structures

cvPal also supports other common dataset organizations:

Flat Structure

All images and labels in single directories:

text
dataset/
β”œβ”€β”€ images/
β”‚ β”œβ”€β”€ image001.jpg
β”‚ β”œβ”€β”€ image002.jpg
β”‚ └── ...
β”œβ”€β”€ labels/
β”‚ β”œβ”€β”€ image001.txt
β”‚ β”œβ”€β”€ image002.txt
β”‚ └── ...
└── data.yaml

Paired Structure

Images and labels in the same directory:

text
dataset/
β”œβ”€β”€ image001.jpg
β”œβ”€β”€ image001.txt
β”œβ”€β”€ image002.jpg
β”œβ”€β”€ image002.txt
└── data.yaml

Best Practices

βœ… Do

  • β€’ Use consistent naming conventions
  • β€’ Keep images and labels synchronized
  • β€’ Include comprehensive YAML metadata
  • β€’ Validate label coordinates (0-1 range)
  • β€’ Use meaningful class names
  • β€’ Organize by train/test/valid splits

❌ Don't

  • β€’ Mix different label formats
  • β€’ Use absolute pixel coordinates in TXT
  • β€’ Skip the data.yaml file
  • β€’ Use spaces in file names
  • β€’ Have mismatched image/label pairs
  • β€’ Forget to update class counts

Quick Start

Using cvPal with Your Dataset

python
from cvpal.preprocessing import ImagesDetection
# Load your dataset
cp = ImagesDetection()
cp.read_data("/path/to/your/dataset", data_type="txt")
# Generate a report
cp.report()
# Merge with another dataset
cp.merge_datasets([
"/path/to/dataset1",
"/path/to/dataset2"
])