As soon as you clean up your data, you will be looking to split it into training set and testing set, so that you can start your model learning process. In this post we will see how to split your dataset into training set and testing set.
How to import train_test_split
Scikit learn has an awesome function called “train_test_split” which helps us here. “train_test_split” is present in the module sklearn.model_selection. So, when you want to use “train_test_split“, you have to import it from this module where it is present.
Below is the syntax to import train_test_split.
from sklearn.model_selection import train_test_split
How to use train_test_split?
Below is the general syntax to use train_test_split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size, random_state)
(1) You pass the X and y values, also called as features and target into this function.
(2) test_size takes a value between 0 and 1. It splits each of them in the ratio (1-test_size) : test_size. So, after split, you get 4 values. First two correspond to features and the last two correspond to labels.
For example, if test_size = 0.2, you will get 80% data for training and 20% data for testing. The syntax in this case will be:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
(3) You can assign any number to random_state. This is done to make sure you get the same splits every time you rerun your code.
How to Shuffle or Don’t Shuffle Using train_test_split?
There is a parameter called “shuffle” which you can set in train_test_split. It takes boolean values True or False. By default, shuffle is set to True. So, if you want to shuffle your data, you don’t have to set this parameter. Because, anyways by default it will be shuffled.
In some special cases like time series data, you don’t want to mess up with the pattern or ordering which naturally exist in the dataset. In such a case, you can turn off shuffling by setting “shuffle=False“.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, shuffle=False)