本文主要介绍几个自己用到的数据集

KITTI

KITTI数据集由德国卡尔斯鲁厄理工学院和丰田美国技术研究院联合创办,是一个自动驾驶场景下的大规模数据集。
KITTI数据采集平台包括2个灰度摄像机,2个彩色摄像机,一个Velodyne 3D激光雷达,4个光学镜头,以及1个GPS导航系统。
往往常用的是左侧彩色摄像头和激光雷达传感器。

数据格式

Values Name Description
1 type Describes the type of object: ‘Car’, ‘Van’, ‘Truck’, ‘Pedestrian’, ‘Person_sitting’, ‘Cyclist’, ‘Tram’, ‘Misc’ or ‘DontCare’
1 truncated Float from 0 (non-truncated) to 1 (truncated), where truncated refers to the object leaving image boundaries
1 occluded Integer (0,1,2,3) indicating occlusion state:
0 = fully visible, 1 = partly occluded
2 = largely occluded, 3 = unknown
1 alpha Observation angle of object, ranging [-pi…pi]
4 bbox 2D bounding box of object in the image (0-based index):
contains left, top, right, bottom pixel coordinates
3 dimensions 3D object dimensions: height, width, length (in meters)
3 location 3D object location x,y,z in camera coordinates (in meters)
1 rotation_y Rotation ry around Y-axis in camera coordinates [-pi…pi]
1 score Only for results: Float, indicating confidence in detection, needed for p/r curves, higher is better.

KITTI 2d Detection

下载以下两项即可:

  • Download left color images of object data set (12 GB)
  • Download training labels of object data set (5 MB)

训练注意事项

类别:‘Car’-汽车, ‘Van’-厢式货车, ‘Truck’-载货卡车, ‘Pedestrian’-行人, ‘Person_sitting’, ‘Cyclist’-骑车人, ‘Tram’-电车, ‘Misc’ or ‘DontCare’

CrowdHuman

CrowdHuman数据集是旷世发布的用于行人检测的数据集,图片数据大多来自于google搜索。约每张图片包含23个人,同时存在各种各样的遮挡。每个人类实例都用头部边界框、人类可见区域边界框和人体全身边界框注释。

数据集准备

CrowdHuman官网
个人链接
数据集包括3个train,1一个val,1个test,两个odgt格式的标签文件

数据处理

官方给的是odgt标注格式,一般不用。例如,我想用YOLO进行训练,就需要进行格式转换。代码如下
odgt标注格式转为yolo格式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import os
import json
from PIL import Image

def load_func(fpath):
"""
Load and parse ODGT file
"""
assert os.path.exists(fpath), f"File not found: {fpath}"
with open(fpath, 'r') as fid:
lines = fid.readlines()
records = [json.loads(line.strip('\n')) for line in lines]
return records


def convert_crowdhuman_odgt_to_txt(odgt_path, output_dir):
"""
Convert CrowdHuman ODGT annotations to COCO format TXT files
将边界框坐标从[x, y, w, h]转换为[x_center, y_center, w, h]
并进行归一化
"""
# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

# Create classes.txt
with open(os.path.join(output_dir, 'classes.txt'), 'w') as f:
f.write('person\n')

# Load ODGT annotations
bbox_records = load_func(odgt_path)

# Process each record
for record in bbox_records:
# Get image ID
image_id = record['ID']
txt_filename = f"{image_id}.txt"
txt_path = os.path.join(output_dir, txt_filename)

# Get image size 这里也需要修改
img_path = os.path.join(os.path.dirname(odgt_path),"images",
f"{image_id}.jpg") # Assuming images are in the same folder as ODGT
img = Image.open(img_path)
img_width, img_height = img.size

with open(txt_path, 'w') as f:
for bbox in record['gtboxes']:
# 跳过mask标签和需要忽略的框
if bbox['tag'] == 'mask' or bbox.get('extra', {}).get('ignore', 0) == 1:
continue

# 获取全身框坐标 [x, y, w, h]
x, y, w, h = bbox['fbox']

# 计算中心点坐标
x_center = x + w / 2
y_center = y + h / 2

# 归一化坐标
x_center /= img_width
y_center /= img_height
w /= img_width
h /= img_height

# 写入COCO格式:class_id x_center y_center width height
bbox_str = f"0 {x_center} {y_center} {w} {h}"
f.write(f"{bbox_str}\n")


def main():
# 需要修改的两处
odgt_path = './data/val/annotation_val.odgt' # Replace with your ODGT file path
output_dir = './data/val/labels' # Replace with desired output directory

try:
convert_crowdhuman_odgt_to_txt(odgt_path, output_dir)
print(f"转换完成。标签文件保存在: {output_dir}")
print("\n每个txt文件的格式:")
print("class_id x_center y_center width height")
print("其中class_id=0表示person类")
print("注意:坐标格式为COCO格式(中心点坐标 + 宽高)")

except Exception as e:
print(f"转换过程中出错: {str(e)}")


if __name__ == "__main__":
main()

其中,需要在main函数中修改数据集的位置。以及如果图片不和odgt文件在一个文件夹下,需要修改convert_crowdhuman_odgt_to_txt函数。(PS:这里才发现这个数据集真是从网上爬来的,很多图甚至还有水印,大小也都不一致。)

Nuscenes数据集

该数据集是用于自动驾驶的公共大规模数据集,收集了波士顿和新加坡的1000个驾驶场景。相机运行在12Hz,而激光雷达运行在20Hz。

数据集结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
- v1.0-mini
- maps
four maps(.jpg)
- samples
- CAM_BACK
- CAM_BACK_LEFT
- LIDAR_TOP
- RADAR_BACK_LEFT
...(the sensors' data)
- sweeps
same to 'samples'
# 过渡帧或中间帧
- v1.0-mini
labels