Cross-Validation in machine Learning, the way to improve the estimation of our result rather than just guessing it.
In machine learning, by default, we use an accuracy score to evaluate the measure of our study. It might give one accuracy result in one case and another result in another run. For instance, if you have noticed that if we take a random state value initially say 0, then it will give one output, and next time if we use random state value as 42, it gives another small changed value. And the question is why there is a change in the accuracy though we used the same algorithm again and again but the output of accuracy keeps on changing on different random state values?
If I change the random state again and again with though there is the same algorithm we used for our model, during the change in a random state, all the data gets shuffled. Ok, let me clear you a little bit more. At first, say, you took 80% of the data for the training set and 20% data from the testing dataset with a random state of 0, and next time you took the same 80% of the data for the training set and 20% data from the testing set but this time you took random state 42, in this case, the data gets shuffled and your initial training dataset and later training data set are completely different and it is obvious that our result will deviate from the original result.
So, in order to solve this problem, we have a concept of cross-validation in Machine Learning.
In Machine Learning, estimating the parameters for machine learning models is called “training the algorithm”, and evaluating the methods is called “testing the algorithm”.
One must be very careful that reusing the same data for both training and testing is very bad as data must be tested on the dataset that it has not seen before.
To know more about steps for implementing the Machine Learning model, you may read here https://sththapa999.medium.com/minimum-steps-for-implementing-a-machine-learning-algorithm-f7b91a84cd7
Let’s continue our talk, I was saying that there must be two mutual exclusive datasets for, one for training and another one for testing so that model can predict the model based on the pattern is had found during the study while deciphering the pattern.
But, how can we be sure that taking 80% of data for the training set and the remaining 20% of data for the test set will give us accurate results?
What if we used the first 20% dataset for test or the middle 20% dataset for testing? Sounds interesting right?
Ok instead of thinking too much about which block of data is best for testing or training, cross-validation uses all of the blocks, one at a time and it aggregates and summarized the result at the end by taking the mean of the values.
For example, if we take a value of K=5, then it is said as five-fold cross-validation and test data will be chosen five-time, initially 20% first data set, second other 20% data set till last 20% dataset.
I hope you got the thing up to this point. Now when we take the mean or average of the five folds dataset we can get a more precise result. So, cross-validation changes our guessed accuracy to some estimated accuracy which is much improved and it is always advised to use the cross-validation technique for measuring the accuracy of our data model.