splitting of data before training

how would you know if your model is performing well? you test it and estimate the generalization error (or out of sample error)

-to test it, we need to split the data into training set (70) and testing set (30), but that’s not where splitting stops, we split it more to work on more parameters

(confusing ? don’t worry, i got the bullet points)

1) Training set

this set is used for the model to learn the underlying patterns and relationships in dataset
it iterates to minimize its error

2) Development (Validation) set / Holdout validation

used to tune the model’s hyperparameter and evaluate it’s performance during training
helps selecting the best model when a lot of models are trained on the same training dataset
prevents overfitting
hyperparameters such as learning rate, number of layers in a neural network

The problem with dev set

The dev set is usually a small set 15% of total data, whereas training data is huge. If one model doesn’t perform well in dev set but was performing very good in train set, it doesn’t mean that the model is bad. It’s like training an athlete for marathon but selecting the best sprinter. Therefore, multiple validation sets are are used to test and the result is averaged out.

3) Testing set

used for unbiased evaluation of the model’s performance on unseen data
used only once during the end of training and tuning process

4) Train-Dev set (optional)

subset of training data and is held out for data mismatch ( situation where the data used to train a model does not accurately represent the data the model will encounter in real-world applications)