In this study, we conduct a comprehensive comparative analysis of six convolutional neural network (CNN) architectures for food image classification, including EfficientNet B0, VGG16, ResNet50, YOLOv5-cls, YOLOv8-cls, and a custom-designed CNN-Z model. The dataset contains 11 categories of food images, and multiple data augmentation techniques such as Gaussian noise, random erasing, color adjustment, rotation, and contrast variation were employed to enhance model generalization. Experimental results demonstrate that YOLOv8-cls achieved the highest classification accuracy (99.64%), followed by CNN-Z (96.12%) and YOLOv5-cls (95.42%), whereas ResNet50 showed relatively lower accuracy (86.11%). t-SNE visualizations were utilized to analyze feature representations at intermediate and top layers, providing insights into the internal learning mechanisms of different models. Additionally, a zero-shot learning experiment using the CLIP model was performed to evaluate model generalization on unseen food categories. Overall, the study highlights that EfficientNet B0 and YOLOv8-cls offer a strong balance between accuracy and computational efficiency. The findings provide valuable guidance for selecting suitable CNN architectures and designing data augmentation strategies for food image classification tasks.
@artical{y14102025ijsea14101027,
Title = "A Comparative Study of CNN Architectures for Food Image Classification with Data Augmentation and Zero-Shot Analysis",
Journal ="International Journal of Science and Engineering Applications (IJSEA)",
Volume = "14",
Issue ="10",
Pages ="165 - 169",
Year = "2025",
Authors ="Yan Zhu"}