Overview

Amazon Comprehend is a natural language processing (NLP) service provided by Amazon Web Services (AWS) that uses machine learning to uncover insights and relationships in text. Comprehend provides a number of features useful to businesses and users working with unstructured text data.

In this post I am going to examine the following features of Comprehend:

• Sentiment analysis – aims to define the emotion of a passage of text
• Entity extraction – categorises text into entities like person and quantity
• Personally identifiable information (PII) – identifies personal information like credit card details
I will then demonstrate a potential use case of Comprehend with respect to automating data governance procedures. In this demo we will use Comprehend to identify credit card details in an email that needs to be redacted before being stored. This demo will use Python and Comprehend, Lambda and S3 from AWS.

Sentiment Analysis

Sentiment analysis aims to classify the emotion that is conveyed in a passage of text. A use case for sentiment analysis could be that of a business that wants to sort customer feedback as either positive or negative. One algorithm applied in sentiment analysis is a bag-of-words model. This model passes over each word in the text and looks up the word in what can be thought of as a sentiment dictionary. Each word will have an equivalent sentiment value. For example, the word excellent will carry a high sentiment value whereas the word terrible would have a low sentiment value. Adding these values up would define the overall sentiment of the text. NLP using machine learning moves beyond considering individual words in isolation as it takes into account the structure and context of sentences to better understand the text.

Now let us look an example. If we launch the Comprehend service within the AWS management console, we can simply paste in the text we wish to analyse. Let us take the first paragraph of Roger Ebert’s review of the movie E.T. and paste that into Comprehend. After clicking on the Sentiment tab, we are displayed with the following screenshot.

Comprehend returns a confidence measure under each sentiment category. Comprehend correctly identifies that it was a positive review.

Entities

Entities in Comprehend enables one to identify people, locations, organisations, quantities and dates in a passage of text. The following screenshot shows how this works when analysing some text about Wikipedia.

As can be seen in the example above, Comprehend categorises each entity and returns a confidence rating with it. Comprehend not only extracts numerical elements for the quantity entities but also the words that give that numerical value context. For example, not simply "6.2 million" but rather "6.2 million articles" helps us understand the context of this quantity. The names of persons are also extracted which can be useful, for example, when anonymisation of personal information is necessary. Custom entities such as specific reference numbers, or products SKUs can be added to Comprehend for a user to customise their entities. For example, if a company has a specific reference number system for a line of their products, they can add these to Comprehend and it will then recognise these custom entities under a products category when it observes it in text, allowing for a customisable service to more individual requirements, beyond general problems.

PII

Similar to entity recognition is the ability of Comprehend to recognise personal identifiable information (PII). Comprehend recognises financial, personal, technical security, national and date categories. From the example text passage AWS provides, we can see that it recognises financial and personal information.

We will explore the use of PII in the following demo.

Demo

In this demo, we are going to imagine the scenario in which a customer service team have sent out an email containing the example text above, which can also be found here, to a customer but need to keep a record of it for themselves. Given the data governance regulations in place at the company, they are not allowed to keep any card credit details in their storage. This demo will demonstrate how we can use Comprehend to automate the redaction of the credit card number out of the passage of text. Using triggers in Lambda, we can make sure that the redaction function runs each time a new unredacted file appears.

The demo uses the following services from AWS:
• Comprehend - to identify the credit card number in the text
• S3 – to store the unredacted and redacted text
• Lambda - to execute our Python code that calls Comprehend to find where the credit card number is positioned in the text, each time a new text file appears in the S3 bucket. Then redact the number and save it back to another S3 bucket.

1) To begin, we will create two S3 buckets. One that holds unredacted emails and one that holds the redacted emails. If you are unfamiliar with S3 and creating buckets, please see this page. The buckets should be created with the default settings in the wizard setup after clicking Create bucket. You should have the following two S3 buckets.

2) Now we will need to create a new Lambda function. For this demo we will be using Python 3.7. If this is your first time using Lambda, take a look at the documentation here to gain some familiarity with it. I have called my function comprehend_redact and chosen Python 3.7 as my runtime as demonstrated in the following screenshot.

3) To automate the redaction process, we want our Lambda function to execute each time a new file is created in our S3 bucket. To do so, click on Add trigger in the designer window in Lambda.

We specify the following so that after an object create event occurs in our emails-unredacted S3 bucket, our code in Lambda will run.

4) Before we proceed to adding our Python code to the Lambda function, we need to authorise Lambda to have S3 and Comprehend access permissions. Click on the Permissions tab next to Configuration, positioned below the name of your Lambda function then click on the role name that should look something like the following.

It will take you to the IAM Management Console. Roles define what services your Lambda function can access. Note that you can use pre-existing roles and it is not necessary to define a new role for each Lambda function you create. For our demo, you will need to click Attach policies and search for the following policies that I have already added, as seen below.

Once you have added these policies to your role, Comprehend and S3 can be accessed from your Lambda function.

5) Let us now inspect the Python code that Lambda executes. We will be using the boto3 SDK that allows us to use AWS services from Python, see https://boto3.amazonaws.com/v1/documentation/api/latest/index.html.

Lines 1-5: The default version of the Python library boto3 is outdated and does not have the Comprehend function that we need. These lines install the latest version of boto3 that has the function.

Line 7: We import boto3.

Line 10: In Lambda, our code must be within the lambda_handler for it to execute. The rest of our code is within this function.

Line 11: Define the S3 resource.

Line 12: Define the Comprehend resource.

Line 13: Pull in the text from the file email.txt, currently in S3 object form.

Line 14: Convert text from S3 Object to UTF-8 form that we need to be able to pass into the Comprehend function.

Line 16: Here we check we have at least version 1.16.2 of boto3 before we can use the PII function in Comprehend.

Line 18: Here we call the detect_pii_entities function. We need to supply the function with the text we want to analyse and the language of the text. The response comes in the form of a Python dictionary and returns the index of the beginning of the PII detail and the end index of the PII detail.

Lines 23-26: Here we loop through the response from the Comprehend API. If the type of the response finds a credit card number, we record the beginning position of the number and the ending position of the number into lists.

Line 28: We form a new string in Python that concatenates the following three parts: 1) All the text up to the start of the number. 2) # symbol replaces the credit card number. The length of this is determined by taking the difference of the end position and the beginning position to define the length of the number. 3) The rest of the text from the end position of the number to the end of the text.

Line 29: Print the new text with the redacted number to check.

Line 31: We then send the new text file to the emails-redacted S3 bucket.


Remove any of the default code that exists in Lambda and replace it with the code above. Then click Deploy.

That is all we need. Now once we add a text file to the emails-unredacted S3 bucket, Lambda will execute our code and replace the credit card number with # symbols. Then it will send the redacted version to the emails-redacted S3 bucket, as demonstrated by the following screenshots.

Which contains the following text:

Conclusion

Comprehend is a quick and easy to use service for getting started with NLP. With vast amounts of unstructured text data at their disposal to train machine learning models, it should come as no surprise that AWS can create a successful NLP service. Machine learning and artificial intelligence is beginning to create more opportunities around how we can gather insights from text data and how we can automate procedures around text data. We demonstrated in the demo one possible use case of NLP with respect to data governance. As data governance becomes an increasingly more important requirement for businesses, automated functions that can redact personal information may become more prominent.