Welcome to evaluation of CNN design choises performance on ImageNet-2012. Here you can find prototxt's of tested nets and full train logs.
upd.: Here is technical report version of this benchmark
If you use results from this benchmark, please cite
@Article{CaffeNetBench2017,
Title = {Systematic evaluation of convolution neural network advances on the Imagenet },
Author = {Dmytro Mishkin and Nikolay Sergievskiy and Jiri Matas},
Journal = {Computer Vision and Image Understanding },
Year = {2017},
Doi = {https://doi.org/10.1016/j.cviu.2017.05.007},
ISSN = {1077-3142},
Keywords = {CNN},
Url = {http://www.sciencedirect.com/science/article/pii/S1077314217300814}
}
**upd2.: Some of the pretrained models are in Releases section. They are licensed for unrestricted use.
***upd3.: Nice paper on noise sensitiveness: Fine-grained Recognition in the Noisy Wild: Sensitivity Analysis of Convolutional Neural Networks Approaches
The basic architecture is similar to CaffeNet, but has several differences:
- Images are resized to small side = 128 for speed reasons. Therefore pool5 spatial size is 3x3 instead of 6x6.
- fc6 and fc7 layers have 2048 neurons instead of 4096.
- Networks are initialized with LSUV-init (code)
- Because LRN layers add nothing to accuracy (validated here), they were removed for speed reasons in most experiments.
Taking into account Neural Network Training Variations in Speech and Subsequent Performance Evaluation, results can vary from run to run (data order is the same, but random seeds are different). However, I haven`t experienced results difference for several CaffeNet-ReLU training runs.
On-going evaluations with graphs:
- activations
- pooling
- solvers
- lr_policy
- architectures
- First layer parameters
- Conv1 depth
- classfier architectures
- augmentation
- batchnorm
- colorspace
- regularization
- resnets, not yet successfull
- batch size
- dataset size
- Network width
- other mix
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
ReLU | 0.470 | 2.36 | With LRN layers |
ReLU | 0.471 | 2.36 | No LRN, as in rest |
TanH | 0.401 | 2.78 | |
1.73TanH(2x/3) | 0.423 | 2.66 | As recommended in Efficient BackProp, LeCun98 |
ArcSinH | 0.417 | 2.71 | |
VLReLU | 0.469 | 2.40 | y=max(x,x/3) |
RReLU | 0.478 | 2.32 | |
Maxout | 0.482 | 2.30 | sqrt(2) narrower layers, 2 pieces. Same complexity, as for ReLU |
Maxout | 0.517 | 2.12 | same width layers, 2 pieces |
PReLU | 0.485 | 2.29 | |
ELU | 0.488 | 2.28 | alpha=1, as in paper |
ELU | 0.485 | 2.29 | alpha=0.5 |
(ELU+LReLU) / 2 | 0.486 | 2.28 | alpha=1, slope=0.05 |
SELU = Scaled ELU | 0.470 | 2.38 | 1.05070 * ELU(x,alpha = 1.6732) |
FReLU = ReLU + (learned) bias | 0.488 | 2.27 | |
[FELU = ELU + (learned) bias] | 0.489 | 2.28 | |
Shifted Softplus | 0.486 | 2.29 | Shifted BNLL aka softplus, y = log(1 + exp(x)) - log(2). Same as ELU, as expected |
No, with max pooling | 0.389 | 2.93 | No non-linearity |
No, no max pooling | 0.035 | 6.28 | No non-linearity, strided convolution |
APL2 | 0.471 | 2.38 | 2 linear pieces. Unlike other activations, current author`s implementation leads to different parameters for each x,y position of neuron |
APL5 | 0.465 | 2.39 | 5 linear pieces. Unlike other activations, current author`s implementation leads to different parameters for each x,y position of neuron |
ConvReLU,FCMaxout2 | 0.490 | 2.26 | ReLU in convolution, Maxout (sqrt(2) narrower) 2 pieces in FC. Inspired by kaggle and INVESTIGATION OF MAXOUT NETWORKS FOR SPEECH RECOGNITION* |
ConvELU,FCMaxout2 | 0.499 | 2.22 | ELU in convolution, Maxout (sqrt(2) narrower) 2 pieces in FC. |
The above analyses show that the bottom layers seem to waste a large portion of the additional parametrisation (figure 2 (a,e)) thus could be replaced, for example, by smaller ReLU layers. Similarly, maxout units in higher layers seem to use piecewise-linear components in a more active way suggesting the use of larger pools._
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
MaxPool | 0.471 | 2.36 | |
Stochastic | 0.438 | 2.54 | Underfitting, may be try without Dropout |
Stochastic, no dropout | 0.429 | 2.96 | Stoch pool does not prevent overfitting without dropout :(. Good start,bad finish |
AvgPool | 0.435 | 2.56 | |
Max+AvgPool | 0.483 | 2.29 | Element-wise sum |
NoPool | 0.472 | 2.35 | Strided conv2,conv3,conv4 |
General | - | - | Depends on arch, click for details |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
MaxPool 3x3/2 | 0.471 | 2.36 | default alexnet |
MaxPool 2x2/2 | 0.484 | 2.29 | Leads to larger feature map, Pool5=4x4 instead of 3x3 |
MaxPool 3x3/2 pad1 | 0.488 | 2.25 | Leads to even larger feature map, Pool5=5x5 instead of 3x3 |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Default ReLU | 0.470 | 2.36 | fc6 = conv 3x3x2048 -> fc7 2048 -> 1000 fc8 |
Conv5-fc6=2048C3_2048C1_clf_avg | 0.494 | 2.34 | no pool5 -> fc6 = conv 3x3x2048 -> fc7=conv 1x1x2048 -> fc8 as 1x1 conv -> ave_pool. |
Pool5-fc6=2048C3_2048C1_avg_clf | 0.489 | 2.28 | no pool5 -> fc6 = conv 3x3x2048 -> fc7=conv 1x1x2048 -> ave_pool -> fc8 |
SPP2-FC-FC | 0.471 | 2.36 | pool5 = SPP with 2 levels (2x2 and 1x1) -> FC6 -> FC7 |
SPP3-FC-FC | 0.483 | 2.30 | pool5 = SPP with 3 levels (3x3 and 2x2 and 1x1) -> FC6 -> FC7 |
fc6=512C3_1024C3_1536C1 | 0.482 | 2.52 | pool5 zero pad -> fc6 = conv 3x3x512 -> fc7=conv 3x3x1024 -> 1x1x1536 -> fc8 as 1x1 conv -> ave_pool. |
fc6=512C3_1024C3_1536C1_drop | 0.491 | 2.29 | pool5 zero pad -> fc6 = conv 3x3x512 -> fc7=conv 3x3x1024 -> drop 0.3 -> 1x1x1536 -> drop 0.5-> fc8 as 1x1 conv -> ave_pool. |
Default ReLU, 4096 | 0.497 | 2.24 | fc6 = conv 3x3x4096 -> fc7 4096 -> 1000 fc8 == original caffenet |
pool5pad following nets mistakenly were trained with ELU non-linearity instead of default ReLU
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Default ELU | 0.488 | 2.28 | fc6 = conv 3x3x2048 -> fc7 2048 -> 1000 fc8 |
pool5pad_fc6ave | 0.481 | 2.32 | pool5 zero pad -> fc6 = conv 3x3x2048 -> AvePool -> as usual |
pool5pad_fc6ave_fc7as1x1fc8ave | 0.511 | 2.21 | pool5 zero pad -> fc6 = conv 3x3x2048 -> fc7 as 1x1 conv -> fc8 as 1x1 conv -> ave_pool. |
pool5pad_fc6ave_fc7as1x1avefc8 | 0.508 | 2.22 | pool5 zero pad -> fc6 = conv 3x3x2048 -> fc7 as 1x1 conv -> ave_pool -> fc8 |
pool5pad_fc6ave_fc7as1x1_avemax_fc8 | 0.509 | 2.19 | pool5 zero pad -> fc6 = conv 3x3x2048 -> fc7 as 1x1 conv -> fc8 as 1x1 conv -> ave_pool + max_pool. |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Default, 128_K11_S4 | 0.471 | 2.36 | Input size =128x128px, conv1 = 11x11x96, stride = 4 |
224_K11_S8 | 0.453 | 2.45 | Input size =256x256px, conv1 = 11x11x96, stride = 8. Not finished yet |
160_K11_S5 | 0.470 | 2.35 | Input size =160x160px, conv1 = 11x11x96, stride = 5 |
96_K7_S3 | 0.459 | 2.43 | Input size =96x96px, conv1 = 7x7x96, stride = 3 |
64_K5_S2 | 0.445 | 2.50 | Input size =64x64px, conv1 = 5x5x96, stride = 2 |
32_K3_S1 | 0.390 | 2.84 | Input size =32x32px, conv1 = 3x3x96, stride = 1 |
4x slower, 227_K11_S4 | 0.565 | 1.87 | Input size = 227x227px, conv1 = 11x11x96, stride = 4, Not finished yet |
For example, for using activations in image retrieval.
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
pool5pad_fc6ave_fc7as1x1fc8ave | 0.508 | 2.22 | Baseline. pool5 zero pad -> fc6 = conv 3x3x2048 -> fc7 as 1x1 conv -> ave_pool -> fc8 as 1x1 conv. |
pool5pad_fc6ave_fc7as1x1=512_fc8ave | 0.489 | 2.30 | fc7 as 1x1 conv = 512 |
pool5pad_fc6ave_fc7as1x1_bottleneck=512_fc8ave | 0.490 | 2.28 | fc7 as 1x1 conv = 2048 then fc7a = 512 |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
SGD with momentum | 0.471 | 2.36 | |
Nesterov | 0.473 | 2.34 | |
RMSProp | 0.327 | 3.20 | rms_decay=0.9, delta=1.0 |
RMSProp | 0.453 | 2.45 | rms_decay=0.9, delta=1.0, base_lr: 0.045, stepsize=10K. gamma=0.94 (from here) |
RMSProp | 0.451 | 2.43 | rms_decay=0.9, delta=1.0, base_lr: 0.1, stepsize=10K. gamma=0.94 |
RMSProp | 0.472 | 2.36 | rms_decay=0.9, delta=1.0, base_lr: 0.1, stepsize=5K. gamma=0.94 |
RMSProp | 0.486 | 2.28 | rms_decay=0.9, delta=1.0, lr=0.1, linear lr_policy |
SGD with momentum, linear | 0.493 | 2.24 | linear lr_policy |
Not converge at all:
ADAM: lr=0.001 m=0.9 m2=0.999 delta=1e-8 lr=0.001 m=0.95 m2=0.999 delta=1e-8 lr=0.001 m=0.95 m2=0.999 delta=1e-7 lr=0.01 m=0.9 m2=0.999 delta=1e-8 lr=0.01 m=0.9 m2=0.999 delta=1e-7 lr=0.01 m=0.9 m2=0.999 delta=1e-9 lr=0.01 m=0.9 m2=0.99 delta=1e-8 lr=0.01 m=0.9 m2=0.999 delta=1e-8 lr=0.01 m=0.95 m2=0.999 delta=1e-9
AdaDelta: delta: 1e-5
RMSProp, lr=0.01, rms_decay=0.5 lr=0.01, rms_decay=0.9 lr=0.01, rms_decay=0.95 lr=0.01, rms_decay=0.98 lr=0.001, rms_decay=0.9 lr=0.001, rms_decay=0.98
Converge, but much worse that SGD: Adagrad, lr=0.01, lr=0.02 AdaDelta: delta: 1e-6, delta: 1e-7, delta: 1e-8 RMSProp, lr=0.01, rms_decay=0.99
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Step 100K | 0.471 | 2.36 | Default caffenet solver, max_iter=320K |
Poly lr, p=0.5, sqrt | 0.483 | 2.29 | bvlc_quick_googlenet_solver, All the way worse than "step", leading at finish |
Poly lr, p=2.0, sqr | 0.483 | 2.299 | |
Poly lr, p=1.0, linear | 0.493 | 2.24 | |
Poly lr, p=1.0, linear | 0.466 | 2.39 | max_iter=160K |
Exp, 0.035 | 0.441 | 2.53 | max_iter=160K, stepsize=2K, gamma=0.915, same as in base_dereyly |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Step 100K | 0.527 | 2.09 | Default caffenet solver, max_iter=320K |
Poly lr, p=1.0, linear | 0.496 | 2.24 | max_iter=105K, |
Poly lr, p=1.0, start_lr=0.02 | 0.505 | 2.21 | max_iter=105K |
Exp, 0.035 | 0.506 | 2.19 | max_iter=160K, stepsize=2K, gamma=0.915, same as in base_dereyly |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
default | 0.471 | 2.36 | weight_decay=0.0005, L2, fc-dropout=0.5 |
wd=0.0001 | 0.450 | 2.48 | weight_decay=0.0001, L2, fc-dropout=0.5 |
wd=0.00001 | 0.450 | 2.48 | weight_decay=0.00001, L2, fc-dropout=0.5 |
wd=0.00001_L1 | 0.453 | 2.45 | weight_decay=0.00001, L1, fc-dropout=0.5 |
drop=0.3 | 0.497 | 2.25 | weight_decay=0.0005, L2, fc-dropout=0.3 |
drop=0.2 | 0.494 | 2.28 | weight_decay=0.0005, L2, fc-dropout=0.2 |
drop=0.1 | 0.473 | 2.45 | weight_decay=0.0005, L2, fc-dropout=0.1. Same acc, as in 0.5, but bigger logloss |
Hypothesis about "same effective neurons = same performance" looks unvalidated
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
fc6,fc7=2048, dropout=0.5 | 0.471 | 2.36 | default |
fc6,fc7=2048, dropout=0.3 | 0.497 | 2.25 | best for fc6,fc7=2048. (1-0.3)*2048=1433 neurons work each time |
fc6,fc7=4096, dropout=0.65 | 0.465 | 2.38 | (1-0.65)*4096=1433 neurons work each time |
fc6,fc7=6144, dropout=0.77 | 0.447 | 2.48 | (1-0.77)*6144=1433 neurons work each time |
fc6,fc7=4096, dropout=0.5 | 0.497 | 2.24 | |
fc6,fc7=1433, dropout=0 | 0.456 | 2.52 |
CaffeNet only
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
CaffeNet256 | 0.565 | 1.87 | Reference BVLC model, LSUV init |
CaffeNet128 | 0.470 | 2.36 | Pool5 = 3x3 |
CaffeNet128_4096 | 0.497 | 2.24 | Pool5 = 3x3, fc6-fc7=4096 |
CaffeNet128All | 0.530 | 2.05 | All improvements without caffenet arch change: ELU + SPP + color_trans3-10-3 + Nesterov+ (AVE+MAX) Pool + linear lr_policy |
+ 0.06 | Gain over vanilla caffenet128. "Sum of gains" = 0.018 + 0.013 + 0.015 + 0.003 + 0.013 + 0.023 = 0.085 | ||
SqueezeNet128 | 0.530 | 2.08 | Reference solver, but linear lr_policy and batch_size=256 (320K iters). WITHOUT tricks like ELU, SPP, AVE+MAX, etc. |
SqueezeNet128 | 0.547 | 2.08 | New SqueezeNet solver. WITHOUT tricks like ELU, SPP, AVE+MAX, etc. |
SqueezeNet224 | 0.592 | 1.80 | New SqueezeNet solver. WITHOUT tricks like ELU, SPP, AVE+MAX, etc., 2 GPU |
CaffeNet256All | 0.613 | 1.64 | All improvements without caffenet arch change: ELU + SPP + color_trans3-10-3 + Nesterov+ (AVE+MAX) Pool + linear lr_policy |
CaffeNet128, no pad | 0.411 | 2.70 | No padding, but conv1 stride=2 instead of 4 to keep size of pool5 the same |
CaffeNet128, dropout in conv | 0.426 | 2.60 | Dropout before pool2=0.1, after conv3 = 0.1, after conv4 = 0.2 |
CaffeNet128SPP | 0.483 | 2.30 | SPP= 3x3 + 2x2 + 1x1 |
DarkNet128BN | 0.502 | 2.25 | 16C3->MP2->32C3->MP2->64C3->MP2->128C3->MP2->256C3->MP2->512C3->MP2->1024C3->1000CLF.BN |
+ PreLU + base_lr=0.035, exp lr_policy, 160K iters | |||
NiN128 | 0.519 | 2.15 | Step lr_policy. Be carefull to not use dropout on maxpool in-place |
Others
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
DarkNetBN | 0.502 | 2.25 | 16C3->MP2->32C3->MP2->64C3->MP2->128C3->MP2->256C3->MP2->512C3->MP2->1024C3->1000CLF.BN |
HeNet2x2 | 0.561 | 1.88 | No SPP, Pool5 = 3x3, VLReLU, J' from paper |
HeNet3x1 | 0.560 | 1.88 | No SPP, Pool5 = 3x3, VLReLU, J' from paper, 2x2->3x1 |
GoogLeNet128 | 0.619 | 1.61 | linear lr_policy, batch_size=256. obviously slower than caffenet |
[GoogLeNet128_BN_lim0606][https://github.com/lim0606/caffe-googlenet-bn] | 0.645 | 1.54 | BN before ReLU + scale bias, linear LR, batch_size = 128, base_lr = 0.005, 640K iter, LSUV init.!!!! 5x5 replaced by two 3x3, no in-place |
GoogLeNet128Res | 0.634 | 1.56 | linear lr_policy, batch_size=256. Resudial connections between inception block. No BN |
GoogLeNet128Res_color | 0.638 | 1.52 | linear lr_policy, batch_size=256. Resudial connections between inception block. No BN. + color_trans3-10-3 |
googlenet_loss2_clf | 0.571 | 1.80 | from net above, aux classifier after inception_4d |
googlenet_loss1_clf | 0.520 | 2.06 | from net above, aux classifier after inception_4a |
fitnet1_elu | 0.333 | 3.21 | |
VGGNet16_128 | 0.651 | 1.46 | Surprisingly much better that GoogLeNet128, even with step-based solver. |
VGGNet16_128_All | 0.682 | 1.47 | ELU (a=0.5. a=1 leads to divergence :( ), avg+max pool, color conversion, linear lr_policy |
ResNet attempts are moved to ResNets.md
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
ResNet-50ELU-2xThinner | 0.616 | 1.63 | Without BN, ELU, dropout=0.2 before classifier. 2x thinner, than in paper. Quite fast. No large overfitting (unlike upper table) |
GoogLeNet-128 | 0.619 | 1.61 | For reference. linear lr_policy, batch_size=256. |
GoogLeNet128Res | 0.634 | 1.56 | linear lr_policy, batch_size=256. Resudial connections between inception block. No BN |
VggLikeResNet-50-ELU-RoR-var | 0.626 | 1.59 | Step LR policy, max_iter = 200K, no BN, 4x thinner than VGG, Residual on residual . |
VggLikeResNet-50-ELU | 0.632 | 1.57 | Step LR policy, max_iter = 200K, no BN, 4x thinner than VGG. More RoR . |
VggLikeResNet-50-ELU-RoR 1x5 | 0.628 | 1.58 | Step LR policy, max_iter = 200K, no BN, 4x thinner than VGG. 1x5 layers |
VggLikeResNet-50-ELU-RoR 1x3 | 0.631 | 1.58 | Step LR policy, max_iter = 200K, no BN, 4x thinner than VGG . |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Default | 0.471 | 2.36 | Random flip, random crop 128x128 from 144xN, N > 144 |
Drop 0.1 | 0.306 | 3.56 | + Input dropout 10%. not finished, 186K iters result |
Multiscale | 0.462 | 2.40 | Random flip, random crop 128x128 from ( 144xN, - 50%, 188xN - 20%, 256xN - 20%, 130xN - 10%) |
5 deg rot | 0.448 | 2.47 | Random rotation to [0..5] degrees. |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
RGB | 0.471 | 2.36 | default, no changes. Input = 0.04 * (Img - [104, 117,124]) |
RGB_by_BN | 0.469 | 2.38 | Input = BatchNorm(Img) |
CLAHE | 0.467 | 2.38 | RGB -> LAB -> CLAHE(L)->RGB->BatchNorm(RGB) |
HISTEQ | 0.448 | 2.48 | RGB -> HiestEq |
YCrCb | 0.458 | 2.42 | RGB->YCrCb->BatchNorm(YCrCb) |
HSV | 0.451 | 2.46 | RGB->HSV->BatchNorm(HSV) |
Lab | - | - | Doesn`t leave 6.90 loss after 1.5K iters |
RGB->10->3 TanH | 0.463 | 2.40 | RGB -> conv1x1x10 tanh -> conv1x1x3 tanh |
RGB->10->3 VlReLU | 0.485 | 2.28 | RGB -> conv1x1x10 vlrelu -> conv1x1x3 vlrelu |
RGB->10->3 Maxout | 0.488 | 2.26 | RGB -> conv1x1x10 maxout(2) -> conv1x1x3 maxout(2) |
RGB->16->3 VlReLU | 0.483 | 2.30 | RGB -> conv1x1x16 vlrelu -> conv1x1x3 vlrelu |
RGB->3->3 VlReLU | 0.480 | 2.32 | RGB -> conv1x1x3 vlrelu -> conv1x1x3 vlrelu |
RGB->10->3 VlReLU->sum(RGB) | 0.482 | 2.30 | RGB -> conv1x1x10 vlrelu -> conv1x1x3 -> sum(RGB) ->vlrelu |
RGB and log(RGB)->10->3 VlReLU | 0.482 | 2.29 | RGB and log (RGB) -> conv1x1x10 vlrelu -> conv1x1x3 vlrelu |
RGB and log(RGB) and log (256-RGB)->10->3 VlReLU | 0.484 | 2.29 | RGB and log (RGB) and log (256 - RGB) -> conv1x1x10 vlrelu -> conv1x1x3 vlrelu |
NN-Scale | 0.467 | 2.38 | Nearest neightbor instead of linear interpolation for rescale. Faster, but worse :( |
concat_rgb_each_pool | 0.441 | 2.51 | Concat avepoolRGB with each pool |
OpenCV RGB2Gray | 0.413 | 2.70 | RGB->Grayscale Gray = 0.299 R + 0.587 G + 0.114 B |
Learned RGB2Gray | 0.419 | 2.66 | RGB->conv1x1x1. Gray = -1.779 *R + 6.511 * G + 1.493 *B + 3.279 |
BN-paper, caffe-PR Note, that results are obtained without mentioned in paper y=kx+b additional layer.
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Before | 0.474 | 2.35 | As in paper |
Before + scale&bias layer | 0.478 | 2.33 | As in paper |
After | 0.499 | 2.21 | |
After + scale&bias layer | 0.493 | 2.24 |
So in all next experiments, BN is put after non-linearity
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
ReLU | 0.499 | 2.21 | |
RReLU | 0.500 | 2.20 | |
PReLU | 0.503 | 2.19 | |
ELU | 0.498 | 2.23 | |
Maxout | 0.487 | 2.28 | |
Sigmoid | 0.475 | 2.35 | |
TanH | 0.448 | 2.50 | |
No | 0.384 | 2.96 |
ReLU non-linearity, fc6 and fc7 layer only
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Dropout = 0.5 | 0.499 | 2.21 | |
Dropout = 0.2 | 0.527 | 2.09 | |
Dropout = 0 | 0.513 | 2.19 |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Caffenet | 0.471 | 2.36 | |
Caffenet BN Before + scale&bias layer LSUV | 0.478 | 2.33 | |
Caffenet BN Before + scale&bias layer Ortho | 0.482 | 2.31 | |
Caffenet BN After LSUV | 0.499 | 2.21 | |
Caffenet BN After Ortho | 0.500 | 2.20 |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
GoogLeNet128 | 0.619 | 1.61 | |
GoogLeNet BN Before + scale&bias layer LSUV | 0.603 | 1.68 | |
GoogLeNet BN Before + scale&bias layer Ortho | 0.607 | 1.67 | |
GoogLeNet BN After LSUV | 0.596 | 1.70 | |
GoogLeNet BN After Ortho | 0.584 | 1.77 | |
[GoogLeNet128_BN_lim0606][https://github.com/lim0606/caffe-googlenet-bn] | 0.645 | 1.54 | BN before ReLU + scale bias, linear LR, batch_size = 128, base_lr = 0.005, 640K iter, LSUV init, 5x5 replaced with 3x3 + 3x3. 3x3 replaced with 3x1+1x3 |
Tanh results are moved [here] (https://github.com/ducha-aiki/caffenet-benchmark/blob/master/BatchSize.md)
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
BS=1024, 4xlr | 0.465 | 2.38 | lr=0.04, 80K iters |
BS=1024 | 0.419 | 2.65 | lr=0.01, 80K iters |
BS=512, 2xlr | 0.469 | 2.37 | lr=0.02, 160K iters |
BS=512 | 0.455 | 2.46 | lr=0.01, 160K iters |
BS=256, default | 0.471 | 2.36 | lr=0.01, 320K iters |
BS=128 | 0.472 | 2.35 | lr=0.01, 640K iters |
BS=128, 1/2 lr | 0.470 | 2.36 | lr=0.005, 640K iters |
BS=64 | 0.471 | 2.34 | lr=0.01, 1280K iters |
BS=64, 1/4 lr | 0.475 | 2.34 | lr=0.0025, 1280K iters |
BS=32 | 0.463 | 2.40 | lr=0.01, 2560K iter |
BS=32, 1/8 lr | 0.470 | 2.37 | lr=0.00125, 2560K iter |
BS=1, 1/256 lr | 0.474 | 2.35 | lr=3.9063e-05, 81920K iter. Online training |
So general recommendation: too big batch_sizes leads to a bit inferior results, but in general batch_size should be selected based computation speed. If learning rate is adjusted, than no practial differenc e between different batch sizes.
Base net is caffenet+BN+ReLU+drop=0.2 There difference in filters (main, 5x5 -> 3x3 + 3x3 or 1x5+5x1) and solver.
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Base | 0.527 | 2.09 | |
Base_dereyly_lr, noBN, ReLU | 0.441 | 2.53 | max_iter=160K, stepsize=2K, gamma=0.915, but default caffenet |
Base_dereyly 5x1, noBN, ReLU | 0.474 | 2.31 | 5x5->1x5+5x1 |
Base_dereyly_PReLU | 0.550 | 1.93 | BN, PreLU + base_lr=0.035, exp lr_policy, 160K iters, 5x5->3x3+3x3 |
Base_dereyly 3x1 | 0.553 | 1.92 | PreLU + base_lr=0.035, exp lr_policy, 160K iters, 5x5->1x3+1x3+3x1+1x3 |
Base_dereyly 3x1 scale aug | 0.530 | 2.04 | Same as previous, img: 128 crop from (128...300)px image, test resize to 144, crop 128 |
Base_dereyly 3x1 scale aug | 0.512 | 2.17 | Same as previous, img: 128 crop from (128...300)px image, test resize to (128+300)/2, crop 128 |
Base_dereyly 3x1->5x1 | 0.546 | 1.97* | PreLU + base_lr=0.035, exp lr_policy, 160K iters, 5x5->1x5+1x5+5x1+1x5 |
Base_dereyly 3x1,halfBN | 0.544 | 1.95 | PreLU + base_lr=0.035, exp lr_policy, 160K iters,5x5->1x3+1x3+3x1+1x3, BN only for pool and fc6 |
Base_dereyly 5x1 | 0.540 | 2.00 | PreLU + base_lr=0.035, exp lr_policy, 160K iters, 5x5->1x5+5x1 |
DarkNetBN | 0.502 | 2.25 | 16C3->MP2->32C3->MP2->64C3->MP2->128C3->MP2->256C3->MP2->512C3->MP2->1024C3->1000CLF.BN |
+ PreLU + base_lr=0.035, exp lr_policy, 160K iters |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
VGG-Like | 0.521 | 2.14 | 1st layer = 7x7 stride 2, unlike VGG. All other layer = 1/2 VGG width |
VGG-LikeRes | 0.576 | 1.83 | with residual connections, no BN |
VGG-LikeResDrop | 0.568 | 1.91 | with residual connections, no BN , dropout in conv |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
4sqrt(2)x wider | 0.565 | 1.96 | Start overfitting |
4x wider | 0.563 | 1.92 | Still no overfitting %) |
2sqrt(2)x wider | 0.552 | 1.94 | |
2 wider | 0.533 | 2.04 | |
sqrt(2) wider | 0.506 | 2.17 | |
Default | 0.471 | 2.36 | |
sqrt(2)x narrower | 0.460 | 2.41 | |
2x narrower | 0.416 | 2.68 | |
2sqrt(2)x narrower | 0.340 | 3.11 | no group conv |
2sqrt(2)x narrower | 0.318 | 3.25 | |
4x narrower | 0.256 | 3.33 |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Default, 1.2M images | 0.471 | 2.36 | |
800K images | 0.438 | 2.54 | |
600K images | 0.425 | 2.63 | |
400K images | 0.393 | 2.92 | |
200K images | 0.305 | 4.04 |
Or why input var=1 for LSUV is so important
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
800K images | 0.438 | 2.54 | |
600K images | 0.425 | 2.63 | |
600K images, no scale | 0.379 | 2.92 | |
400K images | 0.393 | 2.92 | |
400K images, no scale | 0.357 | 3.10 | |
200K images | 0.305 | 4.04 | |
200K images, no scale | 0.277 | 4.06 |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
64x64 | 0.309 | 3.34 | |
96x96 | 0.414 | 2.69 | |
128x128 | 0.471 | 2.36 | |
180x180 | 0.521 | 2.10 | |
224x224 | 0.565 | 1.87 | |
300x300 | 0.559 | 2.03 | In progress, results for 115K |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Default, clean labels | 0.471 | 2.36 | |
5% incorrect labels | 0.458 | 2.45 | |
10% incorrect labels | 0.447 | 2.58 | |
15% incorrect labels | 0.437 | 2.69 | |
50% incorrect labels | 0.347 | 3.44 |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Default, no 1x1 or 3x3 | 0.471 | 2.36 | conv1 -> pool1 |
+ 1x1x96 NiN | 0.490 | 2.24 | conv1 -> 96C1 -> pool1 |
+ 3x (1x1x96 NiN) | 0.509 | 2.10 | conv1 -> 3x(96C1) -> pool1 |
+ 5x (1x1x96 NiN) | 0.514 | 2.11 | conv1 -> 5x(96C1) -> pool1 |
+ 7x (1x1x96 NiN) | 0.514 | 2.11 | conv1 -> 7x(96C1) -> pool1 |
+ 9x (1x1x96 NiN) | 0.516 | 2.10 | conv1 -> 9x(96C1) -> pool1 |
+ 9x (1x1x96 NiN)R | 0.509 | 2.13 | conv1 -> Residual9x(96C1) -> pool1. 276k iters |
+ 1x (3x3x96 NiN) | 0.500 | 2.19 | conv1 -> 1x(96C3) -> pool1 |
+ 3x (3x3x96 NiN) | 0.538 | 1.99 | conv1 -> 1x(96C3) -> pool1 |
+ 5x (3x3x96 NiN) | 0.551 | 1.91 | conv1 -> 1x(96C3) -> pool1 |
ReLU non-linearity, fc6 and fc7 layer only
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Default | 0.471 | 2.36 | bias lr_rate = 2x weights lr_rate |
1x | 0.470 | 2.37 | bias lr_rate = 1x weights lr_rate |
5x | 0.472 | 2.35 | bias lr_rate = 5x weights lr_rate |
NoBias | 0.445 | 2.50 | Biases initialized with zeros, lr_rate = 0 |
The PRs with test are welcomed
P.S. Logs are merged from lots of "save-resume", because were trained at nights, so plot "Anything vs. seconds" will give weird results.