Template-Type: ReDIF-Paper 1.0 Author-Name: Alexey Chernikov Author-Email: alexeynchernikov@gmail.com Author-Workplace-Name: SoDa Labs & Department of Econometrics and Business Statistics, Monash University Author-Name: Klaus Ackermann Author-Email: klaus.ackermann@monash.edu Author-Workplace-Name: SoDa Labs & Department of Econometrics and Business Statistics, Monash University Author-Name: Caitlin Brown Author-Email: caitlin.brown@ecn.ulaval.ca Author-Workplace-Name: Department of Economics, Université Laval Author-Name: Denni Tommasi Author-Email: denni.tommasi@unibo.it Author-Workplace-Name: Department of Economics, University of Bologna Title: Leveraging Computer Vision and Visual LLMs for Cost-Effective and Consistent Street Food Safety Assessment in Kolkata India Abstract: Ensuring street food safety in developing countries is crucial due to the high prevalence of foodborne illnesses. Traditional methods of food safety assessments face challenges such as resource constraints, logistical issues, and subjective biases influenced by surveyors personal lived experiences, particularly when interacting with local communities. For instance, a local food safety inspector may inadvertently overrate the quality of infrastructure due to prior familiarity or past purchases, thereby compromising objective assessment. This subjectivity highlights the necessity for technologies that reduce human biases and enhance the accuracy of survey data across various domains. This paper proposes a novel approach based on a combination of Computer Vision and a lightweight Visual Large Language Model (VLLM) to automate the detection and analysis of critical food safety infrastructure in street food vendor environments at a field experiment in Kolkata, India. The system utilises a three-stage object extraction pipeline from the video to identify, extract and select unique representations of critical elements such as hand-washing stations, dishwashing areas, garbage bins, and water tanks. These four infrastructure items are crucial for maintaining safe food practices, irrespective of the specific methods employed by the vendors. A VLLM then analyses the extracted representations to assess compliance with food safety standards. Notably, over half of the pipeline can be processed using a user's smartphone, significantly reducing government server workload. By leveraging this decentralised approach, the proposed system decreases the analysis cost by many orders of magnitude compared to alternatives like ChatGPT or Claude 3.5. Additionally, processing data on local government servers provides better privacy and security than cloud platforms, addressing critical ethical considerations. This automated approach significantly improves efficiency, consistency, and scalability, providing a robust solution to enhance public health outcomes in developing regions. Creation-Date: 2025-03 File-URL: http://soda-wps.s3-website-ap-southeast-2.amazonaws.com/RePEc/ajr/sodwps/2025-02.pdf File-Format: Application/pdf Number: 2025-02 Classification-JEL: C83, I18, O33, Q18, O12 Keywords: Food Safety, Visual Language Models , Survey Accuracy , Field Assessments , Bias Reduction Handle: RePEc:ajr:sodwps:2025-02