Knowledge Distillation (KD) is a pivotal technique in model optimization, where the detailed information stored in a highly skilled "teacher" model is transferred to a smaller "student" model. This process involves the student model learning from the intricate and nuanced representations of the teacher model, leading to enhanced performance in tasks like classification and regression. By distilling knowledge from the teacher model, the student model can achieve comparable accuracy with significantly fewer parameters, making it more efficient for deployment in resource-constrained environments. Moreover, KD can also be employed for model compression, enabling faster inference and reducing computational costs. Additionally, the adaptability of KD allows for customization to different tasks and datasets, illustrating its versatility in improving the efficiency and effectiveness of machine learning models.
The basic idea was introduced in a seminal paper by Hinton, Vinyals, and Dean entitled "Distilling the Knowledge in a Neural Network" in 2015. The idea was to look at the vector of logits at the final fully connected layer of the two networks and try to match them as closely as possible, i.e., the probability distributions induced at the last layer are matched between the two layers. Typically, the student network is trained with a loss function that combines the standard classification loss based on true labels and a second term that measures the distance between the two output distributions (for example, using KL divergence). Thus, the student network can leverage the benefits of good representations at the output produced by the teacher network. Incidentally, this paper was published at the NIPS Deep Learning and Representation Learning Workshop and has yet to have a conference or journal version!
Once we see KD as transferring knowledge in the form of good representations, one immediately starts thinking about transferring knowledge from representations in other parts of the teacher network, e.g., one could transfer knowledge at intermediate layers. Here one assumes that the two networks share a similar overall architecture where one could match corresponding blocks (even though the blocks might have a different number and structure of layers). Typically, a student network would have fewer layers within a corresponding block, and the layer may be much thinner. This problem was solved in a natural way by a paper by Romero, Bengio et al. called "Fitnets: A Hint for Thin Deep Nets," which appeared soon after the Hinton et al. paper at ICLR 2015. They introduced a "regressor block" in the student network that performed the function of matching the output size of the last layer of the corresponding blocks.
Further thought suggests that one could transfer knowledge in other ways besides directly matching representations. For example, a paper by Tung and Mori in ICCV 2019 took a similarity-based approach. The basic idea is that similar inputs should produce similar activations. To make this more precise, consider a batch of data of size b and look at the activations they produce at the end of corresponding blocks of the two networks. Since the sizes of the intermediate layers may differ, we can't directly compare the representations, and one approach is the one discussed above with Fitnets. Tung and Mori had a different elegant approach: they considered the two b X b similarity matrices corresponding to the activations produced by the two networks on the batch of size b. Now, we have the two b X b similarity matrices to compare, and we can use a standard matrix norm such as the Frobenius norm. In several cases, they demonstrated that this way of implementing KD performs better than the Fitnets approach.
Knowledge Distillation (KD) improves accuracy and provides additional benefits such as adversarial robustness, invariance properties, and transfer learning properties. A recent study by Ojha et al presented at NeurIPS 2023, titled "What knowledge gets distilled in Knowledge Distillation?" revealed that these advantageous properties of the teacher network could be transferred through the distillation process. This highlights the potential for KD to enhance neural networks' overall robustness and generalization capabilities, making it a valuable tool for improving performance across various tasks.
Knowledge distillation (KD) is a crucial tool for model compression, offering an effective means of transferring knowledge from a large, complex model to a more compact version that has undergone pruning and quantization. By leveraging KD, we can ensure that the compressed model retains valuable insights and learnings from its larger predecessor, enhancing efficiency and performance without sacrificing accuracy. This process ultimately enables us to strike a balance between model size and computational resources, unlocking the full potential of compressed models in various applications.
One recent example of innovative compression techniques can be seen in the work by Yu, Chong, et al. presented at CVPR 2023. In their paper, "Boost Vision Transformer with GPU-Friendly Sparsity and Quantization," the researchers successfully compressed the Swin Transformer model to run efficiently on Nvidia A100 and Orrin hardware, resulting in notable performance enhancements. Through their ablation studies, they demonstrated the crucial role of Knowledge Distillation (KD) in achieving these performance boosts, highlighting the importance of this technique in the compression process. This work showcases the impact of advanced compression methods in optimizing model efficiency and performance.
One of the experiments conducted at Embedl involved compressing the MobilnetV2 model for ARM CPUs. Initially, the model had a frame rate of 5.2 fps and an accuracy of 72%. The frame rate increased to 32 fps through pruning and quantization techniques, but the accuracy dropped to 61%. To improve the accuracy, knowledge distillation (KD) was applied, resulting in a boost to 63% accuracy. This improvement was achieved using vanilla KD without any additional modifications.
Knowledge distillation is a versatile technique that plays a crucial role in Embedl's toolkit for optimizing models. This method involves transferring knowledge from a large, complex model (referred to as the teacher) to a smaller, more streamlined model (referred to as the student). By distilling the essential information from the teacher model, the student model can achieve similar performance levels with reduced computational resources. This process allows for more efficient and faster model inference, making it an invaluable tool for tasks involving deep learning. Knowledge distillation is a powerful approach that enhances model performance and scalability in various domains.
Do you want to learn more about Knowledge distillation?
Watch our latest webinar here: