Some people think that validation set and testing set are similar and interchangeable terms. But they are not. Both play a very different role when it comes to building and deploying a machine learning model.
Before going into that, we know that we split the dataset into two – training set and testing set. Training set will be used to train the model. After this step, we will have a trained model, which could be used to predict results for new or unseen data.
As we have a clear understanding of the training set, we will look at the remaining two data portions here – validation and testing set.
Why Testing Set is Needed?
Once the training process is complete, some would think that this model could now be put in a live / production setting where new data will come in every day. But, is that a good idea? No. Because, so far, you don’t know how your model is going to behave on new data. Before the actual new data in a live setting comes in, we need to at least have an idea of how the model is going to react there.
This is the reason we have a testing set – the data which the model has never seen during training. The trained model is now tested using the testing set to find out the accuracy, or any metric of choice. This testing that we do, is like a rehearsal for a live or production setting.
By this way, we can be sure that this model produces results which are good enough to be out in a live setting. Once the results are satisfactory, this model could be deployed in production.
So far, we have seen the purpose of a testing set. Next comes why do we need a validation set.
Why Validation Set is Needed?
We often talk about building a model, validating and training it. What are we covering up by the word “validation”?
Let’s say you bake a cake. You have the option to frost the cake with either vanilla or chocolate buttercream. As you have already undergone the trouble of baking, why not try the cake with both the buttercreams and pick the best? Yes, this is exactly what we call a validation process.
Most models have the luxury of having many variations. You can think of it like customizing a model.
Consider a decision tree model. Let’s say you can build the decision tree model with depth 4 or depth 5. This is comparable to the vanilla / chocolate buttercream example we just saw.
During validation process, you can build both variants of the decision tree model – one with depth 4, and another with depth 5. Then, check which one gives good performance score, and pick the best model. This process is called validation. And to do validation, you will need some additional new data for finding validation score.
So just before training, we keep aside a small portion of the training set. This is used as validation set.
In other words, validation process means, finding the best possible model parameters, that could yield best results.
To summarize, validation set is unseen data. It means, the model has not seen it during training. This data is used in validation, and serves to customize the model. As many customization options are possible, validation set helps to pick out the best possible customized model.
Testing set is also unseen data. The model has not seen this data both during training, and during validation. This acts as a rehearsal to find out if the performance score is good enough for a live production setting.