Creating a digital shopping assistant with computer vision

Photo by Fancycrave on Unsplash

We have leveraged the technologies that power voice assistants as well as image recognition algorithms to provide the service of an assistant, in the specialized domain of a particular shop. Therefore the assistant will be capable of answering queries regarding products and inventory.

Furthermore, we have added vision capabilities to the assistant to go beyond the traditional realm of voice assistants, to give ‘eyes’ to the system. This is done in order to allow the device to recognize products that the user holds in his hand, to ensure that the customer can ask queries regarding a product without necessarily having to know the name of product id of a certain product.

The system is primarily intended to be used by shoppers within the store. These shoppers are not expected to have any technical knowledge, and it is also difficult to provide these shoppers with training. Therefore the system should be extremely user friendly, and essentially simulate the experience of interacting with a human assistant.

The system uses many methods to simulate this behaviour. For instance the system is designed to understand user queries spoken in natural language, so that the user does not have to engage with a computer or any touch screen device, or have to remember any specific commands.

Trending Articles on Chatbot Marketing:

1. How to Increase Sales using Messenger Marketing

2. How my Chatbot got 100K users in 1 week

3. How to Increase Sales using Messenger Marketing

4. Chatbot Conference in San Francisco

Furthermore, the need for a customer to specifically know the product id of a certain item is eliminated by equipping the system with cameras capable of identifying the items held in the hand of the user.

In terms of hardware, the device consists of a raspberry pi with a connected camera and speakers. User queries are taken in using a smartphone, and the answer is given to the user through the speakers.


This demo system is specifically designed to work with a clothing store, and answer queries regarding size, color and price of clothing items.

The user interacts with the system using Google’s Dialogflow API, which is an API used to process voice queries in natural language and extract user intents. The API can be integrated into any mobile application or webpage, or even to bots created using existing to voice assistant services like Google Assistant or Facebook Messenger. For convenience, this system was implemented using a bot developed with Google Assistant. The bot can be activated via Google Assistant on mobile devices with the phrase “Talk to shopping agent”. Since Google assistant can be activated using “Okay Google”, no separate wakeword detection was necessary.

Once activated, the user can ask a query, while holding up the product he is referring to in his right hand. For instance assume that he is in a shirt store, and wants to know if a certain shirt is available in size “small”. He can hold up the shirt in front of the device, and ask “Can I get this in small?”

Once the query is submitted, it is processed by the Dialogflow API. The Dialogflow agent has been trained to recognize certain things that the user might request information on, such as size, color and price. These are called user intents. Example training phrases such as “Can I get this in small?”, “is this available in small?” are used to train the agent on each user intent. Under the intents such as size and color, parameters have been defined for size and color. These parameters have been linked to objects called entities which can take different pre-defined values. For instance, the size entity can take the values extra small, small, medium, large and extra large.

Once a Dialogflow agent identifies the user request it can either provide an answer by itself,

or request further information using a webhook call. This process is called fulfillment. The webhook acts as a simple event driven POST message. In our case since the information must be further processed, a webhook is used to send the user intent as a POST message, This post message must be sent to a server, and in the above example scenario it contains the intent “getSize” and the parameter “small”. Since Dialogflow is an API on the internet, it is only able to post to a public IP address of another web server. Therefore it is impossible to send a POST message directly to the device (raspberry pi). Instead the post request is sent to a nodeJS server hosted on the Heroku platform.

A separate server running on the Raspberry Pi is connected to the Heroku server using The flask server on the raspberry pi uses polling to constantly check if the Heroku server has received a POST message from the Dialogflow agent. Once a request is received, it retrieves the user intent and the parameter from the Heroku server. It also runs a python script that takes a picture from the camera. This picture is immediately sent to the Deep learning server for analysis. The deep learning process is executed on a separate server in order to increase the response time, since it is difficult to perform heavy calculations om the limited hardware of the raspberry pi.

The deep learning server has a neural pipeline that consists of two separate neural networks.

The first is a object detection network. The primary advantage of using an object detection network is that not only does it provide the labels of the products, but it also provides the coordinates of the bounding boxes, which can be used to calculate the center point of the product.

A convolutional neural network object detection classifier is trained for some images of products in store with a labeled map. Labeled map contains coordinates of bounding boxes of each product in the image. Tensorflow Faster-RCNN-Inception-V2 model is used to create the object detection classifier. Classifier runs on the image taken from raspberry pi camera and then provides bounding box coordinates and product id as shown in the image to deep learning server. The system is currently trained on three clothing items. Classification probability had to be manually tuned for each clothing item to achieve sufficient accuracy.

The second neural network is a Human Pose detection neural network. A pretrained network developed using Keras on TensorFlow is used for pose detection. The purpose of running the pose detection algorithm is to identify the position of the user’s right hand (specifically the right wrist). By comparing the position of the right hand of the user with the centers of the bounding boxes of the identified products, the product closest to the right hand of the user can be identified.

Once the product in the hand of the user is determined, the ID of the product is sent back to the Flask server running on the raspberry pi. The flask server connects to the store database to run a query in order to check the answer to the user’s query. In the above scenario, the system would run a query to check the inventory of the recognized product in the size “small” and determine if the item in question is available in stock. We implemented the database using the MongoDB NoSQL database system, because it allows the database to be hosted on a separate server, thereby allowing the store’s database system to be a completely separate entity from the system.

Once the flask server identifies the answer to the query, it uses an on-device text to speech library to output a spoken answer via the speakers connected to the raspberry pi. In this case the answer would either be “small size is available”, or “Sorry, small size is not available”.

System architecture

The primary modules of the system are

  1. Dialogflow agent

This agent performs voice recognition and natural language processing to identify user intents.

2. Heroku nodeJS server

This server acts as an intermediary between the Dialogflow agent and the server running on the raspberry pi since they cannot directly communicate with each other,

3. Raspberry pi flask server

This server handles a variety of tasks. It polls the Heroku server to identify if a request is present. It uses the webcam module to take a user photo and uploads the photo to the deep learning backend server. Upon receiving the result, it queries the database to get the answer to the user’s query, and uses a text to speech module to announce the answer.

4. Deep learning backend server

This server enters images submitted to the server to two separate neural networks to identify the product ID of the product that is closest to the right hand of the user.

5. MongoDB database

This database contains information regarding the inventory of the store.


During implementation of the system, we learnt that some of our assumptions regarding the underlying technology stack were false and had to make significant changes to the system.

Image classification vs. object detection

Although the original plan was to use a simple image classification network, we quickly ran across a problem. When simulating a real shopping assistant, the assistant device would have a wide field of view, which means that the camera would not only see the item in the hand of the user but also other products that are in the background. It might even mistake the clothing of the user as a product. Therefore we came to the conclusion that a method was needed to identify each of the products that are in the view of the camera, and then isolate the item in the hand of the user. To identify all of the products in the view of the camera, we decided to use an object detection network, which is used to draw bounding boxes around each of the objects, which provides us with the coordinates of each object in the image.

Dialogflow response time restriction

The webhook fulfillment system used in Dialogflow is designed to send a POST request to a server, allow the server to process the information and provide Dialogflow with the results, so that the agent can provide the reply to the user. In other words, the phrase “Small size is available” should be outputted by the Google assistant bot on the user’s phone. Unfortunately Dialogflow implements a five second timeout for responses to POST messages it sends out. Therefore the entire process mentioned in the previous section has to happen in under five seconds. Even though we tried to use various image classification models to improve the response time of the neural network, and even migrated the deep learning process to a separate local server hosted on a laptop with a dedicated graphics card, we were unable to bring the processing time to under five seconds.

Therefore we were forced to discard the reply system offered by Dialogflow, and instead use a separate text to speech service on the raspberry pi to provide a spoken response to the user via speakers connected to the device rather than through the user’s phone.

Results and Conclusions

The system was capable of identifying the correct product over 90% of the time, provided with good lighting conditions. Even small changes in lighting conditions resulted in considerably lower performance. The system was able to successfully identify the product in the right hand even when user was holding a different product in his left hand or wearing a different product.

In conclusion the system has potential to disrupt the way brick and mortar shops interact with customers, by replacing human assistants with digital assistants. The combination of voice recognition and natural language processing with computer vision allows us to create assistants that can mimic human behaviour. (i.e the assistant can identify what the user is holding in his/her hand rather than the user having to scan a barcode). Using the Dialogflow API imposed latency restrictions that resulted in having to redesign a part of the system. In the future hosting the deep learning component on a cloud server with dedicated GPUs could reduce latency. Furthermore the results could be affected by lighting/ multiple users in the field of view. Therefore perhaps a barcode reader should be included as a backup.

While the system provided satisfactory results when classifying between the different clothing items after tuning the parameters, much better training would be needed to create a system that can classify between thousands of visually similar products. In contrast Dialogflow managed to identify the user’s intent correctly in almost all test scenarios. Thus the system has the potential to be viable if improvements are made.

Don’t forget to give us your 👏 !

Creating a digital shopping assistant with computer vision was originally published in Chatbots Life on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source link

Related posts

Chip upgrade helps miniature drones navigate


A Method to Recognize Anatomical Site and Image Acquisition View in X-ray Images.


One-pass Multi-task Networks with Cross-task Guided Attention for Brain Tumor Segmentation. (arXiv:1906.01796v1 [cs.CV])


This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy