Integrating Google Cloud Vision
|Description||Integrating Google Cloud Vision into Pega using native Data Pages, Connect-REST, and declarative properties|
|Version as of||8.x|
|Capability/Industry Area||Data Integration|
Google Cloud Vision
What if Pega was able to interpret images and take appropriate action? Google Cloud Vision (GCV) is a powerful API that integrates perfectly with our platform and dramatically enhances its capabilities with computer vision. If you're new to the party, computer vision (CV) can be defined as "an interdisciplinary scientific field that deals with how computers can be made to gain high-level understanding from digital images or videos" (Wikipedia). But instead of training your own model, GCV comes with pre-trained models that offer a wide variety of features.
Use case examples
Here are some examples where GCV can help:
- Face detection: locate faces along with likely expressions such as joy or anger
- Landmarks: detect names and geo coordinates for certain landmarks
- Logos: locate individual logos
- Labels: provide a generalized description of what is shown in the image
- Text detection and document text detect: detect text via optical character recognition (OCR) in a hierarchy (pages, blocks, paragraphs, words) and language
- Objects: detect individual objects in an image
Before you begin
You will need an account for Google Cloud Services. While the first 1,000 calls per month are free, you are required to setup at least one active payment method. Prices are in the range of $1.50 per feature per 1,000 requests. One example taken from Google's Pricing page - 5,000 images with a single feature would cost you about $6 per month. After that you need to create a new project and enable the Google Cloud Vision API. Finally, note that all requests need to be authenticated, so you will either need an API key or create an OAuth token. This guide uses the former.
Here are some more useful links and tools to help you get started:
- Postman is a cross-platform app allowing you to test REST APIs. You can use this to create and verify requests and the associated JSON payloads.
- Make a Vision API request shows you how request and response JSONs will look like as well as the appropriate endpoints, and how to enable different features.
- Try it! is useful if you just want to test what the API can do for your image. You can use this in addition to your demo to show features that you didn't cover, plus the way polygons are shown around detected objects or faces are quite useful as well.
Process/Steps to achieve objective
Using the Wizard
Pega works declaratively. This means no manual calls of the web service in an activity, no Connect-REST - just a data page that is referenced in a case. As soon as this page is needed, Pega will automatically issue the request. Since App Studios wizard can't work with a JSON payload yet (as of 8.3), you would need to use Dev Studio. As you provide the following URL, Pega automatically exposes the only query string parameter (your API key - do not share this with anyone). Also, make sure to create the Content-Type header which we will later on set to application/json:
The next step is straight-forward. GCV supports POST only, so check this method:
In the data model step, select Add a REST response and provide the following:
- Your API key
- Content-Type should be application/json
- Paste the sample request below (JSON)
- Click Run.
The result should look like the following screenshot. Select Submit and Next.
On the final screen, just accept defaults and select Create. Pega will then automatically create the data type, data page, and the appropriate request and response data transforms for you (and plenty of other records).
While Pega preparing almost everything for us, there is one thing you should change. By default, GCV takes a base64 encoded image, essentially transforming a binary file into text. Said text is then embedded in the request JSON.
Since you're likely to send a different image to GCV each time, this should become a parameter. In Dev Studio, head over to your Data types and open the data page just created by the wizard (you may need to select your data type to show up first):
Then, add a parameter named imageBase64. While you're at it, make both parameters required; if you want, you can also hard-code your API key here which is handy for demos.
On the definitions tab, open the request data transform. Note that parameters are passed to this transform automatically (use current parameter page is checked). In addition, the wizard only picked up the first feature of the array - if you used above example this should be WEB_DETECTION. Change two things:
- Use Param.imageBase64 as source for content
- Duplicate the features node to enable as many features in parallel as required.
And here is the data transform after the changes. Feel free to experiment with the settings - for example, you could expose maxResults or the requested feature as additional parameters.
Note that you might need to do the same with the response data transform. For example, fullTextAnnotation isn't present if you used the sample request provided above).
After making above changes, make sure to test-drive your data page; you can use this page for converting an image to base64. If everything goes well, the results should look similar to the following figure:
Creating a demo case type
Next, you want to use your data page in a sample case type. The lifecycle consists of two simple steps - uploading an image followed by presenting the results in a view:
The recommended data model is explained below:
- Image - this field will contain the image uploaded by the user.
- Image Base64 - this field will hold a textual representation of the uploaded image (i.e. a base64 encoded string). Images are a lot of data, so make sure to select an appropriate max length (I usually go for 2147483647 which is the maximum positive value for 32-bit signed integers).
- Image Base64 Inline - this field is purely optional and used to render the image when presenting results. Again, this uses a max length of 2147483647.
- Image MimeType - again, purely optional and only used for rendering an image.
- Response - a data reference to your data type and data page. Note that our parameter imageBase64 is linked to the Image Base64 field of our case.
- .ImageBase64 is set to pyAttachmentPage.pyAttachStream, which already contains the base64 encoded data of the image uploaded by the user. Note that this isn't ideal because users may upload additional images or remove attachments, but it should suffice for demos.
- The optional field .ImageMimeType is set to pyAttachmentPage.pyAttachMimeType.
- The second optional field .ImageBase64Inline is set to
"data:"+pyAttachmentPage.pyAttachMimeType+";base64,"+pyAttachmentPage.pyAttachStream. This so-called data URL will later on be used to display the image in an icon control.
Rendering the image in a section
While the attachment data type is ideal for uploading an image and showing a thumbnail, having a full-sized image might be preferable. This is where ImageBase64Inline comes into play: just add an image/icon control to your section, and configure it as follows.
Displaying table data in a section
Since most of features are returned as arrays (0..n faces, labels, web entities), tables are a good way to display results. Keep in mind that GCV also supports batch processing of images at the same time -- in this case you'd be seeing multiple responses. For our demo we're working with a single image at one time, hence hard-coding the first element in the list:
Here are some examples that make use of GCV and a section (view) in Pega:
Below figure shows how face detection for a crowd of people works. In addition to a confidence, GCV provides information about possible emotions such as joy or surprise, features such as headwear, or properties related to the photograph itself such as underexposure or blurriness. You can use this in your demos to detect if a headshot provided by the customer has low quality (underexposed, blurred), or is against certain regulations (e.g. not smiling, not wearing any headwear).
GCV also provides a solid logo detection. Think of a roadside assistance case - the customer takes a picture of the car, and GCV automatically picks up make (and even model in some cases).
Text Annotations, Labels and Web Entities
Below figure shows several possible use cases. Note how GCV correctly classifies the image as a driver license from California? You can use that in your demos, customer on-boarding for example. While this isn't due diligence with regard to ID verification, it can at least sort out images that are unlikely an acceptable
form of identification. Another aspect is shown below - GCV can read text via optical character recognition (OCR). This doesn't just give you individual words, but also metadata such as hierarchical information - pages, sections, paragraphs, and the most likely language.
Web Entities are also a kind-of label, but they compare your image to similar images on the web. This comes in different flavors - below figure shows fitting entities for the photograph (in this case Cafe, Coffee, Restaurant, Bakery). Web Entities can also detect visually similar images and where they were found - you could use this to detect license or IP violations.
There is more!
As laid out earlier, take a look at all available features. GCV can do much more - you can detect if a photo is likely to contain adult content, shows violence, and more.