Based on my experience, not just for ImageNet, if you have enough data it's better to train your network from scratch. There are numerous reasons that I can explain why.
First of all, I don't know whether you have had this experience or not but I've trained complicated CNNs neworks with over 25 million parameters. After reaching 95% accuracy, after convergence, I changed the learning rate to a bit bigger number to find another probable local minimum. I've not found any answer till today but whenever I did so, my accuracy decreased signifincantly all the time and it never imporoved even after more than thosands of epochs.
The other problem is that whenver you use transfer learning, your training data should have two options. First of all, the distribution of the training data which your pre-trianed model has used should be like the data that you are going to face during test time or at least don't vary too much. Second, the number of training data for transfer learning should be in a way that you don't overfit the model. If you have a few number of labeled training data, and your typical pre-trained model, like ResNet has millions of parameters, you have to be aware of overfitting by choosing appropriate metrics for evaluation and good testing data which represent the real distribution of the real population.
Next problem is that you can not remove layers with confidence to reduce the number of parameters. Basically the number of layers, as you yourself know, is a hyper-parameter that there is no consensus on how to be chosen. If you remove the convolutional layers from the first layers, again based on experience, you won't have good learning because of the nature of the architecture which finds low level features. Furthermore, if you remove first layers you will have problem for your dense layers, because the number of trainable parameters changes in this way. Densely connected layers and deep convolutional layers can be good points for reduction but it may take time to find how many layers and neurons to be diminished inorder not to overfit.
If you don't have enough data and there is already a pre-trained model, you can do something that can help you. I divide my answer to two parts for this situation:
The pre-trained model does not have common labels for most of its classes
If this is the case, you can forget all the fully connected layers and replace new ones. Basically what they do is just classifying the features that the network has found to reduce the error rate. About the convolutional layers you have to consider two major points:
Pooling layers try to sum up the information in a local neighbourhood and make a higher representation of the inputs. Suppose that the inputs of a pooling layer have nose, eyes, eybrows and so on. Your pooling layers somehow attempt to check whether they exist in a neighbourhood or not. Consequently, convolutional layers after pooling layers usually keep information which may be irrelevant to your task. There is a downside for this interpretation. The information may be distributed among different activation maps as Jason Yosinski et al. have investigated this nature at Understanding neural networks through deep visualization where you can read One of the most interesting conclusions so far has been that representations on some layers seem to be surprisingly local. Instead of finding distributed representations on all layers, we see, for example, detectors for text, flowers, fruit, and faces on conv4 and conv5. These conclusions can be drawn either from the live visualization or the optimized images (or, best, by using both in concert) and suggest several directions for
future research and These visualizations suggest that further study into the exact nature of learned representations—whether they are local to a single channel or distributed across several .... A partial solution can be keeping the first layers which find low level features that are usually shared among different data distributions and removing deeper convolutional layers which find higher abstractions.
As stated, the main problem of convolutional layers is that the information they find may be distributed among different activation maps. Consequently, you can not be sure by removing a layer you can have better performance or not.
The pre-trained model has some common classes with your labels
If this is the case you can visualize the activations using the techniques described here. They have shown that although human face is not a label in ImageNet, some of internal activation maps get activated by face. This observation can be seen for other labels too. For instance, netwroks for deciding a scene contains a car or not are usually sensetive to roads and trees. The image below shows that which parts of ouputs get activated for which parts of images. This can help you when you don't have enough data and you have to use transfer learning.

Based on the answer here The standard classification setting is an input distribution $p(X)$ and a label distribution $p(Y|X)$. Domain adaptation: when $p(X)$ changes between training and test. Transfer learning: when $p(Y|X)$ changes between training and test. In other words, in DA the input distribution changes but the labels remain the same; in TL, the input distributions stay the same, but the labels change. Consequently, domain adaption problems also can be considered for the mentioned solutions.
Resnet50onothello,InceptiononGo.AlexNetforImageNet. AlexNet forNotMNISTand others. The former had more than 50 million training data if I remember. Even my hand made architectures for eastern ocr. – Green Falcon Apr 26 '18 at 12:34ResNethas a high cost. If you have two class labels which are already inImageNet, it is not wise to use all the convolutional layers. – Green Falcon Apr 26 '18 at 21:21