Automated Metadata Extraction: Part 1

I hate filling in forms. Really, I do. Imagine if you had to fill in a form each time you uploaded a document into SharePoint Online? You do? I feel for you.

For some time I’ve been wanting to look at how my organisation can use tools such as AI Builder to help auto-generate metadata for types of documents we store and manage in SharePoint Online.

At the time of writing, AI Builder is about 12 months old following preview, and it’s been aligned to the Power Platform. The solution includes a Form Processing component, which can be used to create and train a model to extract named properties from your documents.

I wanted users to be able to search, sort and filter on key pieces of information contained in our statements of work (contracts). The logical solution was to create columns to store this information, given the content is hosted in SharePoint Online. From there, I could convert these columns into managed properties, for use in search refiners and as referenceable values in other solutions.

This is part one of a two-part series designed to walk you through the steps to develop a no-code solution to analyse documents and automatically extract and set associated metadata. (you can find part two here). Our solution will use the following building blocks:

SharePoint Online
AI Builder
Power Automate

From a licensing perspective, you’ll need Power Automate P2 (Premium) to re-create my solution and an E1 licence or better to configure the SharePoint Online components.

This article (part one) covers the AI Builder component. Part 2 covers the SharePoint Online / Power Automate component.

Create Your Model

You can access AI Builder via PowerApps or Power Automate. Here you can select a pre-fabricated model and train it. The models you train and publish are stored centrally and visible to all your Power Platform solutions so it doesn’t matter how you access AI Builder.

We’re going to use the Form Processing model and train it to extract information from our statement of work documents

AI Builder - Model Selection

Some important things to note:

At the time of writing, AI builder can only analyse content in pdf, png, jpg and jpeg format (don’t worry, our solution will handle conversion from native Office document formats).
Start with 5 examples of the document you want the solution to analyse. Ensure they contain examples of the metadata you want to extract. This is because the field selection process targets a single document and you don’t get to choose which one if you start with many documents.
For a reliable model, upload at least 5 other training documents after field selection. Manual effort is needed to teach AI Builder how to analyse each document. Put these in the same location so you can bulk upload them in one action.
It’s important to introduce some variety here (in my case, there were several key variations of our statement of work I wanted AI Builder to recognise). A set of completely different, unrelated documents will skew your model.

Upload

First, convert any training documents to pdf format if they are in a native Office document format.

You’ll be prompted to upload your documents. Remember you can add more training documents later if you like.

AI Builder - Upload Documents

Once you’ve uploaded your documents, hit that Analyse button to give AI Builder an initial look.

AI Builder - Sample Documents

Review

Now you’ll have the opportunity to identify the metadata you want to extract from similar documents. AI Builder may take the initiative and create some for you. In any case, keep what you want. You can come back after saving the model to repeat this process.

AI Builder will present an example document to you for review. Simply highlight the content you wish the model to extract from the document and create an associated field name. As mentioned earlier, you can to make sure you can find all your fields in this document.

AI Builder - Field Selection

When you’re done with the document, hit that Confirm Fields button and you’re ready to review the rest of your payload.

This review is both manual and sequential, one document at a time. Your progress is tracked as you go. You’ll need to identify and assign values to each field you created.

AI Builder - Analysis

(In cases where a field does not occur in one of your documents, or its value is not set, then you can select the Field not in document option).

Note that if you add more documents to the model, you’ll need to run through this process for each new document you added. You won’t have to repeat the process for documents already reviewed unless you add, modify or remove a selected field.

At the end of the review process, you’ll be able to click the Done button. On clicking Next you’ll be presented with a model summary.

Training

The final step before testing is to Train your model. This will save it and the training will happen in the background. Keep an eye on the Status. It will read Trained when training is complete. You’re now ready to test your model before publishing it.

Testing

So here I have my model that I have lovingly called SOW_Processing . I needed 20 documents to fully train my solution. There are 12 selected fields my model will extract during analysis. Your own end state will of course vary.

AI Builder - Model Summary

Listing the field names here is handy, as we’ll need a Content Type to represent a statement of work and Columns to hold this metadata in our SharePoint Online solution. We’ll need a blueprint for the SharePoint Online setup. Let’s go with this:

Statement of Work - Class View

Here we’re testing the reliability of the model. It’s not fit for purpose if, for example, it can’t identify correct values for the selected fields in a statement of work presented to it.

The Quick test allows us to see how well a model can identify the correct metadata in documents we present to it for analysis. Make sure you use new documents (not used to train the model) for testing.

Upload a document and have the model analyse it. On scanning the document, you’ll find parts of it highlighted, indicating that the model has found a value for one of it’s selected fields.

AI Builder - Quick Test

The Confidence score is important. This represents the degree of certainty the model has that the value for the selected field is correct.

Some important things to note:

Training is an iterative process. You’re supposed to re-train your model over time, introducing new variations to it.
Set a minimum confidence threshold for each selected field as part of your acceptance criteria for the model.
Scores between 80 and 100 are good. You cannot guarantee a confidence score of 100 even if you train your model extensively, since new variations can always be presented to it.
Scores below 80 introduce risk and below 50 are indicative of an unreliable model. An unreliable model is, in my opinion, not fit for purpose (since you can’t trust the data). The solution here is to train your model using your test documents (by uploading them to the model and re-training it). This will teach your model to recognise such variations in future tests.
You can use the confidence score in applications (such as Flows or Apps) as part of your validation logic! (more on that in my next article).

Publication

So, let’s assume you’ve run some Quick tests, and your model is reliably identifying values for selected fields with a confidence score that is within the threshold set in your acceptance criteria. Great! You can now publish it by selecting Publish.

Your model is ready for use.

AI Builder - My Models

Now that we’ve created an AI Builder model to extract selected fields from statement of work documents, we can develop a Power Automate solution to extract and set the metadata for these documents when they get uploaded to SharePoint Online!

Head on over to part two where we’ll cover the setup and configuration of the workflow (Power Automate) component of the solution.