Simple Computer Vision Image Creative Analysis using Google Vision API | by Zikry Adjie Nugraha | Aug, 2022
Create your first computer vision project using label detection, object detection, face expression detection, text detection, and dominant color detection
Computer vision can be used to extract useful information from images, videos, and audio. It allows the computers to see and understand what information can be gleaned from visual inputs. After receiving visual input, it can gather valuable information within the image and determine the next step that must be taken.
Google Vision API is a Google cloud service that enables the use of computer vision to extract valuable information from image inputs. As a beginner, you can use this service to gain meaningful insights into the image. The following image shows how the Google vision API works.

The image above shows what the Google Vision API is capable of. The Google Vision API can recognize the face expression, text, and dominant colors in the ad image. The face expression clearly shows the person’s joyful expression, the text describes the word “LEARN MORE,” and the dominant colors show the top 10 dominant colors within the image.
We can see that by utilizing the Google vision API capability, we can gain a lot of insight from an image. For example, suppose we want to know which factors in an ad image cause a customer to click and view our ads. This can be discovered by utilizing the Google vision API services.
This article will primarily focus on how to gain insight factors within the image and what insight we can gain from the specific image. We will not use an ad image example because it cannot be published due to company confidentiality. Instead, we will use a product image available for data analysis in the Kaggle dataset.
This project’s dataset image is based on the stylish product image dataset from Kaggle. Because the dataset contains a large number of product images from an e-commerce site, we will only take a small portion of the images that can be used in our creative analysis. This dataset license allows you to copy, modify, distribute, and perform work on it.
Before we begin, we must first configure the vision API service from Google Cloud services. Step-by-step instructions can be found here. But, to make things easier, we’ll show you how to set up the vision API from Google Cloud Services step by step.
(Note: You must configure this Google Cloud services vision API from your own Google Cloud account; we will not provide you with the file containing the confidential Google Cloud Keys in this tutorial).
Step 1: Log in to Google Cloud Project and, from the Home page, select “Go to APIs overview”.

Step 2: Select “ENABLE APIS AND SERVICES,” then search for and enable Cloud Vision API.


Step 3: Go to Credentials, then click “CREATE CREDENTIALS” and then “Service Account”.

Step 4: Enter your service account information (you can skip the optional parts) and click Done.

Step 5: Navigate to the service account you created. Go to KEYS, then “ADD KEY” and “Create new key”.


Step 6: Create a JSON key type, then download and place the JSON file in the Python script’s working directory.

Before we begin computer vision modeling, we must first install the required library. The first library we’ll install is google-cloud-vision, which is used for computer vision model detection. We can use this library after we have gained access to the Google Cloud Vision API.
The next library is webcolors, which comes in handy when we need to convert the hex color number from color detection to the closest color name we know.
After installing the necessary libraries, we will import them into our script. We will import vision from the Google Cloud library for vision modeling detection purposes. For data preprocessing, other libraries such as Ipython, io, and pandas were used.
Webcolors is used to convert hex color formats to color names that we are familiar with. KDTree is used to find the closest color match in the CSS3 library. KDTree provides an index to a set of k-dimensional points that can be used to quickly find any point’s nearest neighbors.
After placing the JSON file in our directory, we must activate the Google Cloud Vision API services in our Python script.
Any label in the image can be detected using label detection. LabelAnnotation can be used to identify general objects, locations, activities, products, and other things within an image. The code below describes how we will extract the label information from the stylish dataset’s images.
From this image, we can see the google vision API has detected several general labels such as:
- Face expression (Smile)
- The human body (Face, Joint, Skin, Arm, Shoulder, Leg, Human body, Sleeve)
- Object (Shoe)
Despite the fact that the vision has identified many labels, some of the general objects have been misidentified and are not mentioned. The vision mistook the sandal image for a shoe. It also failed to recognize the clothing, leaf plant, mug, and chair in the image above.
Object detection can be used to detect any object in the image. Unlike labeling, object detection is primarily concerned with the detection’s confidence level. LocalizedObjectAnnotation scans multiple objects within an image and displays object position within the rectangular bounds.
From this image, we can see the google vision API has detected several objects as follows:
- Sunglasses (Confidence: 90%)
- Necklace 1 (Confidence: 83%)
- Necklace 2 (Confidence: 77%)
- Miniskirt (Confidence: 76%)
- Shirt (Confidence: 75%)
- Clothing (Confidence: 70%)
- Necklace 3 (Confidence: 51%)
In the image above, we can see that the majority of the vision has identified an object of clothing. With their high confidence level, the crystal clear objects have been identified several objects such as sunglasses, necklace 1, necklace 2, miniskirt, shirt, and clothing. Necklace 3 has the lowest confidence because the vision considers the image in the bottom right corner to be a necklace as well. Because the necklace 3 object is more like a bracelet than a necklace, its confidence level is lower than the others.
Face detection can detect any human face and emotion in the image. FaceAnnotation is a technique for scanning the position of the human face within an image. While scanning for human faces, it can also scan for various facial emotional expressions.
From this image above, we can see the google vision API has detected various expressions on the human faces such as:
- Joy: VERY_LIKELY
- Sorrow: VERY_UNLIKELY
- Anger: VERY_UNLIKELY
- Surprise: VERY_UNLIKELY
We can see from the image above that the expression is a smile, and vision API recognizes it as a joyful expression. The other expressions of sorrow, anger, and surprise do not appear to match the picture above because the person does not express those emotions. As a result, the joy is scored as very likely, while the other is scored as very unlikely.
TextAnnotation can be used to detect and extract text from images. Individual words and sentences within their rectangular bounds are included in the extracted text.
From this image, we can see the google vision API has detected various text such as:
- 文化
- THISIS
- WHAT
- awesome
- LOOKS
- LIKE
For some reason, the vision has identified what appears to be a Japanese word of 文化. This can happen if the vision inadvertently detects any word from the rattan carving behind the clothing picture and somehow translates it into a Japanese word.
The text in the image shows that it detects capitalized and non-capitalized words. It also found the word “THISIS” in the text above, which should be “THIS IS.” As a result, the vision API’s limitation is that it will detect the word “THIS IS” to “THISIS” because the word was too narrow.
The detection of dominant colors is one of the features of image properties annotation. It can detect the top ten dominant feature colors and their color fractions within an image using dominant color detection.
The Google Vision API has detected the top ten different colors in hex format, as shown in the image above. To obtain the true color name, we must convert the hex color format to the color name format using the CSS3 library. Then, we utilize the KDTree to get the nearest color we are familiar with in the CSS3 library.
We used the hex color A41B24 as an example of the second dominant color. Using the above function, we discovered that the closest color in the CSS3 library is firebrick. This is indicated by the reddish color of the sneaker in the image above.
We have already done computer vision modeling using labels, objects, face expressions, text, and dominant color detection from the creative analysis above. After we run the vision API, there are still numerous limitations from each detection annotate.
- Labels detection: It can detect many general objects and facial expressions on a picture, but some of the objects will be misidentified (for example in our analysis, it misidentified a sandal as a shoe).
- Object detection: There is also the misidentified situation in these factors, but we can anticipate it by looking at the object detection confidence level (for example in our analysis, it misidentified the bracelet as another necklace).
- Face expression detection: The image clearly shows the boy’s joyful expression. However, if you try to use the image with the expression excluded from the vision modeling, it will classify all expression detection as very unlikely because the vision modeling cannot determine which expression it is.
- Text detection: The texts can be detected using the text annotate, but there is an unwanted text that is included in our vision modeling (for example, in our analysis, it detects the Japanese word “文化” even though there are no Japanese words on the picture).
- Dominant color detection: It can detect the multiple dominant colors in the image above, but it can only be converted into RGB or hex color formats at the moment. To be converted into a color that we are familiar with, a function that converts hex colors into color names must be added.
If you want to get more details on the code used in this modeling method, you can check my Github repositories.
(Noted: the Google Cloud Keys will not be available on the repository, you will have to create it on your own using the step-by-step above).
Github: https://github.com/nugrahazikry
Create your first computer vision project using label detection, object detection, face expression detection, text detection, and dominant color detection
Computer vision can be used to extract useful information from images, videos, and audio. It allows the computers to see and understand what information can be gleaned from visual inputs. After receiving visual input, it can gather valuable information within the image and determine the next step that must be taken.
Google Vision API is a Google cloud service that enables the use of computer vision to extract valuable information from image inputs. As a beginner, you can use this service to gain meaningful insights into the image. The following image shows how the Google vision API works.

The image above shows what the Google Vision API is capable of. The Google Vision API can recognize the face expression, text, and dominant colors in the ad image. The face expression clearly shows the person’s joyful expression, the text describes the word “LEARN MORE,” and the dominant colors show the top 10 dominant colors within the image.
We can see that by utilizing the Google vision API capability, we can gain a lot of insight from an image. For example, suppose we want to know which factors in an ad image cause a customer to click and view our ads. This can be discovered by utilizing the Google vision API services.
This article will primarily focus on how to gain insight factors within the image and what insight we can gain from the specific image. We will not use an ad image example because it cannot be published due to company confidentiality. Instead, we will use a product image available for data analysis in the Kaggle dataset.
This project’s dataset image is based on the stylish product image dataset from Kaggle. Because the dataset contains a large number of product images from an e-commerce site, we will only take a small portion of the images that can be used in our creative analysis. This dataset license allows you to copy, modify, distribute, and perform work on it.
Before we begin, we must first configure the vision API service from Google Cloud services. Step-by-step instructions can be found here. But, to make things easier, we’ll show you how to set up the vision API from Google Cloud Services step by step.
(Note: You must configure this Google Cloud services vision API from your own Google Cloud account; we will not provide you with the file containing the confidential Google Cloud Keys in this tutorial).
Step 1: Log in to Google Cloud Project and, from the Home page, select “Go to APIs overview”.

Step 2: Select “ENABLE APIS AND SERVICES,” then search for and enable Cloud Vision API.


Step 3: Go to Credentials, then click “CREATE CREDENTIALS” and then “Service Account”.

Step 4: Enter your service account information (you can skip the optional parts) and click Done.

Step 5: Navigate to the service account you created. Go to KEYS, then “ADD KEY” and “Create new key”.


Step 6: Create a JSON key type, then download and place the JSON file in the Python script’s working directory.

Before we begin computer vision modeling, we must first install the required library. The first library we’ll install is google-cloud-vision, which is used for computer vision model detection. We can use this library after we have gained access to the Google Cloud Vision API.
The next library is webcolors, which comes in handy when we need to convert the hex color number from color detection to the closest color name we know.
After installing the necessary libraries, we will import them into our script. We will import vision from the Google Cloud library for vision modeling detection purposes. For data preprocessing, other libraries such as Ipython, io, and pandas were used.
Webcolors is used to convert hex color formats to color names that we are familiar with. KDTree is used to find the closest color match in the CSS3 library. KDTree provides an index to a set of k-dimensional points that can be used to quickly find any point’s nearest neighbors.
After placing the JSON file in our directory, we must activate the Google Cloud Vision API services in our Python script.
Any label in the image can be detected using label detection. LabelAnnotation can be used to identify general objects, locations, activities, products, and other things within an image. The code below describes how we will extract the label information from the stylish dataset’s images.
From this image, we can see the google vision API has detected several general labels such as:
- Face expression (Smile)
- The human body (Face, Joint, Skin, Arm, Shoulder, Leg, Human body, Sleeve)
- Object (Shoe)
Despite the fact that the vision has identified many labels, some of the general objects have been misidentified and are not mentioned. The vision mistook the sandal image for a shoe. It also failed to recognize the clothing, leaf plant, mug, and chair in the image above.
Object detection can be used to detect any object in the image. Unlike labeling, object detection is primarily concerned with the detection’s confidence level. LocalizedObjectAnnotation scans multiple objects within an image and displays object position within the rectangular bounds.
From this image, we can see the google vision API has detected several objects as follows:
- Sunglasses (Confidence: 90%)
- Necklace 1 (Confidence: 83%)
- Necklace 2 (Confidence: 77%)
- Miniskirt (Confidence: 76%)
- Shirt (Confidence: 75%)
- Clothing (Confidence: 70%)
- Necklace 3 (Confidence: 51%)
In the image above, we can see that the majority of the vision has identified an object of clothing. With their high confidence level, the crystal clear objects have been identified several objects such as sunglasses, necklace 1, necklace 2, miniskirt, shirt, and clothing. Necklace 3 has the lowest confidence because the vision considers the image in the bottom right corner to be a necklace as well. Because the necklace 3 object is more like a bracelet than a necklace, its confidence level is lower than the others.
Face detection can detect any human face and emotion in the image. FaceAnnotation is a technique for scanning the position of the human face within an image. While scanning for human faces, it can also scan for various facial emotional expressions.
From this image above, we can see the google vision API has detected various expressions on the human faces such as:
- Joy: VERY_LIKELY
- Sorrow: VERY_UNLIKELY
- Anger: VERY_UNLIKELY
- Surprise: VERY_UNLIKELY
We can see from the image above that the expression is a smile, and vision API recognizes it as a joyful expression. The other expressions of sorrow, anger, and surprise do not appear to match the picture above because the person does not express those emotions. As a result, the joy is scored as very likely, while the other is scored as very unlikely.
TextAnnotation can be used to detect and extract text from images. Individual words and sentences within their rectangular bounds are included in the extracted text.
From this image, we can see the google vision API has detected various text such as:
- 文化
- THISIS
- WHAT
- awesome
- LOOKS
- LIKE
For some reason, the vision has identified what appears to be a Japanese word of 文化. This can happen if the vision inadvertently detects any word from the rattan carving behind the clothing picture and somehow translates it into a Japanese word.
The text in the image shows that it detects capitalized and non-capitalized words. It also found the word “THISIS” in the text above, which should be “THIS IS.” As a result, the vision API’s limitation is that it will detect the word “THIS IS” to “THISIS” because the word was too narrow.
The detection of dominant colors is one of the features of image properties annotation. It can detect the top ten dominant feature colors and their color fractions within an image using dominant color detection.
The Google Vision API has detected the top ten different colors in hex format, as shown in the image above. To obtain the true color name, we must convert the hex color format to the color name format using the CSS3 library. Then, we utilize the KDTree to get the nearest color we are familiar with in the CSS3 library.
We used the hex color A41B24 as an example of the second dominant color. Using the above function, we discovered that the closest color in the CSS3 library is firebrick. This is indicated by the reddish color of the sneaker in the image above.
We have already done computer vision modeling using labels, objects, face expressions, text, and dominant color detection from the creative analysis above. After we run the vision API, there are still numerous limitations from each detection annotate.
- Labels detection: It can detect many general objects and facial expressions on a picture, but some of the objects will be misidentified (for example in our analysis, it misidentified a sandal as a shoe).
- Object detection: There is also the misidentified situation in these factors, but we can anticipate it by looking at the object detection confidence level (for example in our analysis, it misidentified the bracelet as another necklace).
- Face expression detection: The image clearly shows the boy’s joyful expression. However, if you try to use the image with the expression excluded from the vision modeling, it will classify all expression detection as very unlikely because the vision modeling cannot determine which expression it is.
- Text detection: The texts can be detected using the text annotate, but there is an unwanted text that is included in our vision modeling (for example, in our analysis, it detects the Japanese word “文化” even though there are no Japanese words on the picture).
- Dominant color detection: It can detect the multiple dominant colors in the image above, but it can only be converted into RGB or hex color formats at the moment. To be converted into a color that we are familiar with, a function that converts hex colors into color names must be added.
If you want to get more details on the code used in this modeling method, you can check my Github repositories.
(Noted: the Google Cloud Keys will not be available on the repository, you will have to create it on your own using the step-by-step above).
Github: https://github.com/nugrahazikry