You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: python-sdk/nuscenes/eval/detection/README.md
+64-52Lines changed: 64 additions & 52 deletions
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@ The results will be presented at the Workshop on Autonomous Driving ([wad.ai](ht
27
27
* We release annotations for the train and val set, but not for the test set.
28
28
* We release sensor data for train, val and test set.
29
29
* Users apply their method on the test set and submit their results to our evaluation server, which returns the metrics listed below.
30
-
* We do not use strata (cf. easy / medium / hard in KITTI). We only filter annotations and predictions beyond 40m distance.
30
+
* We do not use strata (cf. easy / medium / hard in KITTI). Instead, we filter annotations and predictions beyond class specific distances.
31
31
* Every submission has to provide information on the method and any external / map data used. We encourage publishing code, but do not make it a requirement.
32
32
* Top leaderboard entries and their papers will be manually reviewed.
33
33
* The maximum time window of past sensor data that may be used is 0.5s.
@@ -36,10 +36,10 @@ The results will be presented at the Workshop on Autonomous Driving ([wad.ai](ht
36
36
## Results format
37
37
We define a standardized detection results format to allow users to submit results to our evaluation server.
38
38
Users need to create single JSON file for the evaluation set, zip the file and upload it to our evaluation server.
39
-
The submission JSON includes a dictionary that maps each sample to its result boxes:
39
+
The submission JSON includes a dictionary that maps each sample_token to a list of `sample_result` entries.
40
40
```
41
41
submission {
42
-
"all_sample_results": <dict> -- Maps each sample_token to a list of sample_results.
42
+
sample_token <str>: [sample_result] -- Maps each sample_token to a list of sample_results.
43
43
}
44
44
```
45
45
For the result box we create a new database table called `sample_result`.
@@ -48,22 +48,24 @@ This allows for processing of results and annotations using the same tools.
48
48
A `sample_result` is defined as follows:
49
49
```
50
50
sample_result {
51
-
"sample_token": <str> -- Foreign key. Identifies the sample/keyframe for which objects are detected.
52
-
"translation": <float> [3] -- Estimated bounding box location in m in the global frame: center_x, center_y, center_z.
Below we define the metrics for the nuScenes detection task.
118
126
Our final score is a weighted sum of mean Average Precision (mAP) and several True Positive (TP) metrics.
119
127
128
+
### Preprocessing
129
+
Before running the evaluation code the following pre-processing is done on the data
130
+
* All boxes (gt and prediction) are filtered on class-specific max-distance.
131
+
* All bikes and motorcycle boxes (gt and prediction) that fall inside a bike-rack are removed. The reason is that we do not annotate bikes inside bike-racks.
132
+
* All boxes (gt) without any lidar nor radar points in them are removed. The reason is that we can not guarantee that they are actually visible in the frame. We do not filter the estimated boxes here.
133
+
120
134
### Average Precision metric
121
135
***mean Average Precision (mAP)**:
122
136
We use the well-known Average Precision metric as in KITTI,
123
137
but define a match by considering the 2D center distance on the ground plane rather than intersection over union based affinities.
124
138
Specifically, we match predictions with the ground truth objects that have the smallest center-distance up to a certain threshold.
125
-
For a given match threshold we calculate average precision (AP) by integrating recall between 0.1 and 1.
126
-
Note that we pick *0.1* as the lowest recall threshold, as precision values at recall < 0.1 tends to be noisy.
127
-
If a recall value is not achieved, its precision is set to 0.
139
+
For a given match threshold we calculate average precision (AP) by integrating the recall vs precision curve for
140
+
recalls and precisions > 0.1. We thus exclude operating points with recall or precision < 0.1 from the calculation.
128
141
We finally average over match thresholds of {0.5, 1, 2, 4} meters and compute the mean across classes.
129
142
130
-
### True Positive metrics
131
-
Here we define metrics for a set of true positives (TP) that measure translation / scale / orientation / velocity and attribute errors.
143
+
### True Positive errors
144
+
Here we define metrics for a set of true positives (TP) that measure translation / scale / orientation / velocity and attribute errors.
132
145
All true positive metrics use a fixed matching threshold of 2m center distance and the matching and scoring happen independently per class.
133
-
The metric is averaged over the same recall thresholds as for mAP.
134
-
To bring all TP metrics into a similar range, we bound each metric to be below an arbitrarily selected metric bound and then normalize to be in *[0, 1]*.
135
-
The metric bound is *0.5* for mATE, *0.5* for mASE, *π/2* for mAOE, *1.5* for mAVE, *1.0* for mAAE.
136
-
If a recall value is not achieved for a certain range, the error is set to 1 in that range.
137
-
This mechanism enforces that submitting only the top *k* boxes does not result in a lower error.
138
-
This is particularly important as some TP metrics may decrease with increasing recall values.
146
+
The metric is averaged over the same recall thresholds as for mAP.
147
+
If a recall value > 0.1 is not achieved, the TP error for that class is set to 1.
148
+
139
149
Finally we compute the mean over classes.
150
+
140
151
***mean Average Translation Error (mATE)**: For each match we compute the translation error as the Euclidean center distance in 2D in meters.
141
152
***mean Average Scale Error (mASE)**: For each match we compute the 3D IOU after aligning orientation and translation.
142
-
***mean Average Orientation Error (mAOE)**: For each match we compute the orientation error as the smallest yaw angle difference between prediction and ground-truth in radians.
143
-
***mean Average Velocity Error (mAVE)**: For each match we compute the absolute velocity error as the L2 norm of the velocity differences in 2D in m/s.
144
-
***mean Average Attribute Error (mAAE)**: For each match we compute the attribute error as as *1 - acc*, where acc is the attribute classification accuracy of all the relevant attributes of the ground-truth class. The attribute error is ignored for annotations without attributes.
153
+
***mean Average Orientation Error (mAOE)**: For each match we compute the orientation error as the smallest yaw angle difference between prediction and ground-truth in radians. Orientation error is evaluated at 360 degree for all classes except barriers where it is only evaluated at 180 degrees. Orientation errors for cones are ignored.
154
+
***mean Average Velocity Error (mAVE)**: For each match we compute the absolute velocity error as the L2 norm of the velocity differences in 2D in m/s. Velocity error for barriers and cones are ignored.
155
+
***mean Average Attribute Error (mAAE)**: For each match we compute the attribute error as as *1 - acc*, where acc is the attribute classification accuracy of all the relevant attributes of the ground-truth class. Attribute error for barriers and cones are ignored.
156
+
157
+
All errors are >= 0, but note that for translation and velocity errors the errors are unbounded, and can be any positive value.
145
158
146
159
### Weighted sum metric
147
160
***Weighted sum**: We compute the weighted sum of the above metrics: mAP, mATE, mASE, mAOE, mAVE and mAAE.
148
-
For each error metric x (excl. mAP), we use *1 - x*.
149
-
We assign a weight of *5* to mAP and *1* to the 5 TP metrics.
150
-
Then we normalize by 10.
161
+
As a first step we convert the TP errors to TP scores as *x_score = max(1 - x_err, 0.0)*.
162
+
We then assign a weight of *5* to mAP and *1* to each of the 5 TP scores and calculate the normalized sum.
151
163
152
164
## Leaderboard & challenge tracks
153
165
Compared to other datasets and challenges, nuScenes will have a single leaderboard for the detection task.
154
166
For each submission the leaderboard will list method aspects and evaluation metrics.
155
167
Method aspects include input modalities (lidar, radar, vision), use of map data and use of external data.
156
168
To enable a fair comparison between methods, the user will be able to filter the methods by method aspects.
157
-
The user can also filter the metrics that should be taken into account for the weighted sum metric.
158
-
159
-
We define two such filters here.
169
+
170
+
We define three such filters here.
160
171
These filters correspond to the tracks in the nuScenes detection challenge.
161
172
Methods will be compared within these tracks and the winners will be decided for each track separately:
162
173
163
174
***LIDAR detection track**:
164
175
This track allows only lidar sensor data as input.
165
-
It is supposed to be easy to setup and support legacy lidar methods.
166
-
No external data or map data is allowed.
176
+
No external data or map data is allowed. The only exception is that ImageNet may be used for pre-training (initialization).
177
+
178
+
***VISION detection track**:
179
+
This track allows only camera sensor data (images) as input.
180
+
No external data or map data is allowed. The only exception is that ImageNet may be used for pre-training (initialization).
167
181
168
-
***Open detection track**:
182
+
***OPEN detection track**:
169
183
This is where users can go wild.
170
184
We allow any combination of sensors, map and external data as long as these are reported.
171
-
172
-
Note that for both tracks mAVE and mAAE will have 0 weight.
0 commit comments