What if you can understand your surroundings with just a click of a picture?
The speakers have built an intelligent camera app that assists people with visual impairment in understanding their surroundings. This application takes camera input in the form of an image and attempts to answer questions related to the image. Simply put, it's a Visual Question Answering (VQA) app.
The deep learning based application is built using a transformer based model called Vision Language Transformer (ViLT) which is both computationally fast and efficient, thus providing answers to users’ questions within a fraction of seconds. The application is integrated with speech-to-text and text-to-speech capabilities to enhance accessibility.
This talk would mainly cover the following: * What is the Vision Language Transformer (ViLT) model? * Advantages of ViLT over traditional vision language pre-trained models * Best practices around modularizing the application into different services * Steps to deploy this deep learning based application on cloud (GCP) * How in-built python libraries helped in implementing and deploying such complex models (viz. ViLT) easily
The entire code is open-sourced and the talk will provide a walkthrough of the steps to build your own visual question answering application.