Breaking new ground in artificial intelligence: How multimodal large language models are reshaping age and gender estimation

Breaking new ground in artificial intelligence: How multimodal large language models are reshaping age and gender estimation

Written By Adarsh Shankar Jha

The rapid growth of (MLLM) has been remarkable, particularly those incorporating language and vision modes (LVM). Their evolution is attributed to their high accuracy, generalization ability, reasoning skills, and strong performance, and these models are experts at handling unpredictable tasks beyond their original training domain. MLLMs are revolutionizing several fields, causing a re-evaluation of specialized models. Their rapid development is driving interest in their use for computer vision tasks such as segmenting objects and integrating them into complex lines such as instruction-based image processing.

While models such as ShareGPTV have their uses in tasks such as data annotation, their practicality in production is limited due to their high cost. Instead, specialized models like the MiVOLO offer a cost-effective solution. This paper compares the best general purpose MLLMs with technical models such as MiVOLO to understand their ability to replace them. The results show significant differences in computational cost and speed for some tasks. This includes tasks such as labeling new data or filtering old data sets.

The SaluteDevices Research team presented MiVOLOv2, a model that not only outperforms all expert models such as CNN, ResNet34 and GoogLeNet, but also the first version of MiVOLO. This second version, the state-of-the-art model for gender and age determination, uses advanced evaluation metrics such as mean absolute error (MAE) for age estimation, accuracy for gender prediction, and cumulative score out of 5 (CS@ 5) for age estimation. The team also conducted experiments to compare the best general-purpose MLLMs with specialized models, aiming to benchmark all SOTA MLLMs such as LLaVA 1.5 and LLaVA-NeXT, ShareGPT4V, and ChatGPT4V.

MiVOLO uses face and body cultures for predictions, while other models make predictions based on body culture prompts and images. It uses a transformer to estimate age and gender from these inputs. Furthermore, we optimize an MLLM for gender and age estimation by comparing it with a specialized model. The authors explore the potential of multimodal ChatGPT (ChatGPT4V), evaluating its adequacy in predicting facial features and performing face recognition tasks. With zero training, the model outperformed a specialized age recognition model, but performed less effectively in gender classification.

For MiVOLOv2, the training dataset is expanded by 40% from the previous data used in MiVOLO and now contains more than 807,694 samples: 390,730 males and 416,964 females. Most of the images were selected where MiVOLOv1 made significant mistakes. Production pipelines and some open source data such as LAION-5B are mainly used to achieve this. Between the two datasets, LAGENDA is chosen over IMDB. It minimizes the risk that MLLMs will give correct answers not through age and gender estimation but because of their familiarity with famous people, famous movies, etc. Despite the lack of ground truths, LAGENDA offers reduced risk and accelerates MiVOLOv2 to outperform any general purpose MLLM in age estimation. However, LLaVA-NeXT 34B leads in this area among open source alternatives, making specialized versions of LLaVA more efficient.

ebTqzzWzS0pwRv5pRzo8xrFtoJtIdOMCqXkEpYDQ ATq7KQHPdMXqlwWHVTji3Bhd3katwbGd eSZ3o UhKd8dmJE2qwmFchPf2j0m2WsDn 7G6WtRNPtjd435nPRayYys9y Bd8LciLK 42fjVcpGI

In conclusion, this work aimed to evaluate the effectiveness of MiVOLO2 compared to MLLMs for age and gender estimation tasks. The second version of MiVOLO2 outperforms all general-purpose MLLMs in age estimation and succeeds in human image processing. The results encouraged a comprehensive evaluation of the capabilities of neural networks, including LLaVA and ShareGPT.


check it Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us Twitter and Google news. Participation Our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channeland LinkedIn Groops.

If you like our work, you will love our work newsletter..

Don’t forget to join us Telegram channel

You might also like ours FREE AI Courses….


photo sajjad Ansari

Sajjad Ansari is a senior from IIT Kharagpur. As a technology enthusiast, he delves into the practical applications of AI with an emphasis on understanding the impact of AI technologies and their real-world implications. It aims to articulate complex AI concepts in a clear and accessible way.


You May Also Like

0 Comments