An AlexNet finetune report in Tensorflow
Yes, I still play with ancient NNs
OMG, how could implementation of Padding and Lrn be different between Tensorflow and Caffe
In short, code is here
The de facto standard in domain adaptation is finetuning Alexnet on Office-31. (Update 2019: no longer the case any more. Report Resnet-tp results instead on Office and VisDA…) A trival baseline is using source-only data to finetune Alexnet and see how good the performance can be on target domain. On the Amazon2Webcam task, the reported baseline ranges from 60% to 64%, depending on choices of learning rate schedule, dropout, and trainable layers. Because almost all models in Domain Adaptation are published in Caffe, I decide to run baselines for my thesis in Tensorflow. This seems a pretty simple task by any mean, but it turned out to kill hours of my life to figure out how to do that properly in Tensorflow. (Of course, there are some Caffe2Tensorflow traps, otherwise why would all DA researchers stick to Caffe)
The pre-trained Alexnet model is originally trained in Caffe. Apparently nobody wants to train it from scratch in another platform cuz you don’t want to burn GPU on a Alexnet when you have ResNet. So people convert the Caffe model to Tensorflow/PyTorch/… and there are tools like Caffe-Tensorflow that helps you. So, you got the converted model and you then check the necessary layers. Hmmmm, AlexNet, easy, Conv, Pool, FC, Lrn, Dropout. All shipped with TF. Copied the parameters from caffe porto file, wrote 50 lines of code, ran it on the shiny DGX, pretty confident that everything would be fine. Oooops, result is only 54%, 6-10% lower than reported numbers.
Previously, I heard some rumors from Ms.GitHub saying that padding in Caffe is different from Tensorflow, when I was trying to reproduce Resnet50 results, but I never had further investigation. Immediately I think that might be the cause. After reading this post with lovely illustrations, I was pretty sure that everything was in control and the difference in pad must be the source of evil.
Padding difference
In Tensorflow, Conv2d padding is done only on right side and bottom side, while in Caffe, you can specify how many pixels you want to pad in each direction. In Alexnet’s original implementation, pad is 0, 2, 1, 1, 1 for the five conv layers respectively, which means that using padding=”VALID” for Conv1 in tf is equivalent to Caffe, but the rest can not be easily done with “VALID” or “SAME”. StackOverflow tells me the magical existence of tf.pad:
pad = 2
x = tf.pad(x, [[0, 0], [pad, pad], [pad, pad], [0, 0]], "CONSTANT")
conv2 = tf.nn.conv2d(x, filters, strides, padding="VALID")
Problem solved. Wait.. what? Caffe-tensorflow already deals with that? OK, u don’t need to add the above lines…
But what else might be the hidden criminal then. I started to read issues in Ms.Github’s similar repos (while watching Worldcup). This one attracted me:
Basically the author forgot why he used a different alpha in LRN layer other than the number from the original paper. (Sure he copied it from somewhere.)
Lrn difference
Tho Local Response Normalization is now considered outdated, it still means something to Alexnet users (maybe only Alexnet users). This blog saves my life by explaining why alpha and depth_radius are defined differently in Tensorflow and Caffe. In short:
local_size_in_tf = 2*depth_radius_in_tf + 1
alpha_in_caffe = local_size * alpha_in_tf
So the magic numbers you need for tensorflow are
radius = 2; alpha = 2e-05; beta = 0.75; bias = 1.0
Preprocessing
You will need to use cv2.resize(256) -> randomcrop(227). I found that following the exact pipeline boosts the performance. I mean exact, use cv2. Even tf.image could hurt performance.
Lr setting
I am not sure whether this helps, but I followed the exact lr settings of the GRL paper, which means 1x for pretrained kernels, 2x for pretrained biases, 10x for fc(31), 20x for fc(31) bias.
Now you can reproduce the 60% baseline easily and move one step further to DANN, DRCN, or whatever new trick…
Oops, more hours of my life wasted, more useless tricks learnt.
My implementation: GitHub
Let Keras wrap all platforms so that we can reproduce any result from any DL toolbox flawlessly and happily. (Nahh, TF is the best)
Update 2019
I am using PyTorch now π
looking forward to your implementation π