At what point does the cost of increasing the training dataset size outweigh any benefit to a model’s predictive performance?

This article aims to provide an understanding of the relationship between accuracy and training data so that together we can answer that very question.

Training most machine learning (ML) models often requires over tens of thousands of training samples (particularly Artificial Neural Networks). In supervised learning cases, a data scientist will need to meticulously label that data.

When the amount of training data is small, an ML model does not generalise well. This is usually because the model is overfitting. In a sense, we are just creating a system that memorises every training input.

Unfortunately, training data is hard to find and is often limited by budgetary or time restrictions. In other cases, training data is simply not accessible. Hence, it is important then to analyse the costs and benefits in finding new data.

In this article we will:

- Show a power-law relationship between the accuracy and training dataset size of three different models
- Explain the relationship in terms of overfitting of a model.
- Describe how this relationship can be used to decide if getting more training samples is the best strategy for increasing accuracy

## Set-up

We will be investigating the effect increasing the training dataset size has on the prediction accuracy of three ML models with varying complexity:

- A custom shallow Artificial Neural Network (ANN)
- A Convolution Neural Network (CNN) built with TensorFlow
- A Support Vector Machine (SVM) Algorithm

We will train each model to classify handwritten digits (the “hello world” of computer vision). To keep the experiment consistent, we will use the same dataset across all three models; the MNIST dataset.

The number of epochs, learning rate, and regularisation parameter all impact the accuracy of the model. To control this impact (and make sure training completes in a reasonable amount of time) we will adjust each one accordingly as the training sample size increases.

All the code used for this blog post can be found in this GitHub repository.

## Investigation

### Accuracy vs Training Sample Size

Figure 1 below shows a slowdown of accuracy improvement as we increase the training sample size beyond 500 training samples. The slowdown is characterised by a sharp increase in prediction accuracy followed by a rapid flattening of the curve.

*Figure 1: Model Accuracy vs Training Sample Size*

Notably, this behaviour is present in all three models, hence we can deduce that it is not inherent to a particular ML architecture.

As a first pass, we can fit a hyperbolic (reciprocal) curve to our data, this will satisfy the following requirements for behaviour:

Sharp increase in accuracy for smaller training sample sizes

Asymptotic behaviour as the model reaches its maximum accuracy

Figure 2 plots the same data points as figure 1, but this time a hyperbolic curve of best fit has been fitted for each model using the curve_fit() method in SciPy’s optimise module.

*Firgure 2: Best Fit Reciprocal Curve*

The hyperbolic curve fits the patterns we are seeing quite well, now let’s see if we can breakdown what is going on conceptually to better describe the accuracy to training dataset size trade-off.

### Conceptual Breakdown – Model Training as a Treasure Hunt

At the beginning of this article, I briefly mentioned that one of the benefits of increasing the training dataset sample size is a reduction in model overfitting. That is, creating a model that predicts previously unseen circumstances based on some underlying truth in the data, rather than simply “memorising” every detail of the training dataset.

In essence, by increasing the number of training samples we make it difficult for the model to learn from random noise in the data. At the same time, we are providing more opportunity for the model to learn general underlying patterns.

To help visualise this, imagine an endless treasure hunt, where following a sequence of clues leads you closer to the position of some priceless treasure. We will assume that each clue is guaranteed to bring you closer to the treasure.

If we start our hunt with only a handful of clues, we might begin by reading each clue and walking directly to the next one. Eventually, we would have wandered to the clue that leads us to the exact location of the treasure.

However, if we had started with more clues this process could get tiring very quickly. Another way to approach the hunt would be to guess the approximate location of the treasure, based on the position of each clue rather than reading and walking to each clue. In this case, we lose information about the exact location of each clue (and thus the final treasure) but we can be sure we are approaching the right direction.

Additionally, in this scenario the more clues we have the better our guess at “what the right direction” will be. However, the proportion of information we learn which each new clue is less and less; our guess from 5 clues might not be much different than our guess with 7 clues. Soon we will not get any closer to the treasure unless we start reading the clues.

One factor that will affect how quickly we find the direction to the treasure is our ability to infer the right direction from clues. If we were excellent at learning from underlying patterns, we might have a good guess of the direction of the treasure with only 3 clues and adding more clues might not be much help.

Similarly, the maximum accuracy we can reach is limited by how well we can guess the location of the treasure.

Here we are describing the increase of the data size (number of “clues”) solely as a method for reducing overfitting. The key here is that when offer fitting is low, our results will approach a maximum accuracy, but the amount the accuracy improves will be proportional to the number of training samples, and dependent on the model’s “learning ability”.

If the above were the case, we would assume the following to hold.

Model complexity (a model’s “ability” to ignore noise) has some bearing on the rate at which the slowdown of accuracy improvement occurs

A model will asymptotically approach some maximum accuracy based on its complexity

Other methods for reducing overfitting should display a similar accuracy to training dataset size trade-off

### Power to the Curve

Using the thought pattern described above, we can form a more robust estimation for our curve of best fit, the power law:

Accuracy = A_{max }+ c_{1 }x^{c2}

Where:

- x is the number of training samples
- A
_{max}is the maximum accuracy possible (as a function of model architecture and hyperparameters) - c1, c2 are constants that define the exact shape (slope) of the curve, also dependent on the model. Note that c2 will always be a negative real number.

Defining a curve in this way addresses the points made in our heuristic explanation above, namely the dependence of model complexity on the curve’s shape and an asymptotic approach to maximum accuracy.

Fitting a curve of this form to the accuracy data indeed provides a very good approximation. It should be no surprise, as this curve is a generalisation of our initial guess (hyperbolic).

*Figure 3: Best Fit Power Law Curve*

We can validate our conceptual understanding by plotting model accuracy while altering a different method for reducing overfitting. Figure 3 shows what effect altering L2 regularisation has on the accuracy of our shallow ANN.

*Figure 4: Model Accuracy vs Regularisation Parameter*

Although there is some variation in the data, plotting the same power-law curve sufficiently models the effect. This suggests that reducing overfitting is indeed playing a key role in the accuracy to training dataset size trade-off.

## Conclusion

There are two main insights from this investigation:

A mathematical power law relationship describes how accuracy grows as training dataset size increases

The accuracy vs training dataset size curve can be explained by the reduction of model overfitting. In general, model overfitting is reduced as the training dataset size increases.

One last thing to note is that more data will almost always increase the accuracy of a model. However, that does not necessarily mean that spending resources to increase the training dataset size is the best way to affect the model’s predictive performance.

By plotting this power law relationship with the current available data, you can begin to predict how much value more training samples will bring.

Next time you or your organisation begins to tackle a problem with ML, it’s important to keep the cost-benefit of collecting data in mind. Shifting from stating “we need always need more data” to asking “do we really need more data?” will allow you to deliver your model quickly and control costs wisely.