Here is a set of 25 unique machine learning, data science interview questions. This includes both actual interview questions from the interviews I attended, and also some practice questions which I personally prepared. The answers are true to the best of my knowledge. If you find anything wrong with the answers, please let me know in the comments. Hope you find it useful. And, good luck for your interview!

## 1. Name some methods to avoid overfitting

• Cross validation
• Early stopping
• Pruning
• Regularization

## 2. What are the methods to handle missing values in a dataset?

• Some machine learning algorithms handle missing values on their own
• Imputation using mean/median values
• Imputation using most frequent or zero/constant values
• Imputation using KNN
• Imputation using Deep Learning

## 3. What is curse of dimensionality?

There is an exponential increase in difficulty of the problem as the number of input features are increased. For every additional column in the dataset, to properly explain the data trend, the number of data points required increases exponentially.

You can think of this like: If there is a single column and there are 10 data points to explain it, you will need 100 data points if there are two columns, 1000 data points if there are three columns, etc. The numbers given are approximate. The take away here is, for every additional column, to properly catch the data trend or pattern, it requires exponential increase in the dataset size.

## 4. How much training data is needed for machine learning?

The amount of data needed depends on the complexity of problem we are trying to solve and the complexity of our machine learning algorithm. If you have more data, you can plot learning curves to get an idea of the ideal dataset size needed. In general, non-linear models require more data than linear models.

You might need thousands of data points for simple problems. For average modeling problems you might need tens of thousands of data points. For hard problems that require deep learning, you might need millions of data points or more.

## 5. Small k value vs large k value in k-Fold cross validation

If the value of k is too large in a small dataset, in each group, only low number of data combinations are possible. For large datasets, having large k value is not an issue, as each group will have some enough data to capture the patterns. So, the ideal choice will be small k value for small datasets, and small/large k value for large datasets.

It is obvious that k number of models will be created and trained during k-Fold cross validation. If the k value is much larger, then lots of models need to be created and trained. This will take a lot of computational time and resources. Therefore, a large k value has to be chosen with careful consideration.

## 6. What is the difference between Bias and Variance?

The difference between observed value and predicted value is called Bias. High bias means, there is a large error. Low bias is good. High bias is bad.

Variance is the difference in model performance in training set vs model performance in testing set. High variance means, model did a good job in training set, but did not generalize well to testing set. Low variance is good. High variance is bad.

## 7. What is Kernel Trick?

Kernel trick is the method of using a linear classifier to solve a non-linear problem by transforming the data to higher dimensional spaces where they become linearly separable.

## 8. What is Vanishing Gradients?

When training a neural network, we use backpropagation to adjust the weight matrices. It involves calculating gradients.

If the outputs are small numbers, the gradients themselves will normally be a tiny number. And with each passing layer, it involves multiplication of these tiny numbers over and over, and when it reaches the first layer, the resulting number will be so tiny that it holds no significant value to learning. This process of gradients becoming tiny with each passing layer during backpropagation is called vanishing gradients problem.

## 9. Name some methods to avoid Vanishing Gradients

• Batch normalization
• Using ReLU activation function
• Residual networks

## 10. What is Exploding Gradients?

In this case, the gradient value grows uncontrollably due to repeated multiplication. You can use a method called gradient clipping to make sure that the gradients stay in a healthy range. In each step, the gradients are checked if they exceed a certain threshold. If yes, then the values are normalized to stay in range. By this way of clipping, you can avoid exploding gradients.

• Stochastic
• Batch
• Mini-batch

## 12. Stochastic vs Batch vs Mini-batch Gradient Descent

In stochastic method, the error is calculated and model is updated for each training example (also called online method).

In batch method, error is calculated for each training example. But the model update happens once all examples have been evaluated. In other words, one model update per epoch.

In mini-batch method, the training examples are split into small batches. Error is calculated for each batch and averaged to make a model update. This method is more robust.

## 13. What is Backpropagation Through Time – BPTT?

Backpropagation through time is similar to ordinary backpropagation in feed forward neural networks, but with a small difference. One exception is previous time steps need to be considered because the system has memory.

## 14. LSTM vs RNN

RNNs have feedback loops in recurrent layer. It is used to pass the information learnt in previous step to the next step. But due to vanishing gradients problem, the information learnt in much older steps keep decaying and do not contribute much to the future steps.

LSTM is designed specifically for this purpose. LSTM networks include memory which can retain information for longer periods of time. This is made possible by usage of special gates such as input gate, output gate and forget gate. These gates control when the information can be forgotten, and by how much it is forgotten. This is the major advantage (ability to retain information for longer periods) of LSTM compared to vanilla RNNs.

## 15. Using small Learning Rate vs large Learning Rate

Small learning rate would make the training process too slow. Large learning rate would cause the model to bounce over and not converge. You have to find the right balance.

Another good way is to set a decay to learning rate. You can decay it linearly, like reducing the learning rate by half for every few epochs. Or, you can do it exponentially. By multiplying a fraction like 0.1 to learning rate for every few epochs.

## 16. What effect does Batch Size have on training?

Smaller mini batch converges faster than larger ones. In smaller mini batch, the algorithm starts learning with few training examples. Due to fewer examples, there will be noise. And this noise helps in a good way to bounce around and helps to not get stuck in local minima. Though convergence at global minima is not guaranteed, it makes sure it gives a good solution.

Larger mini batch size has speedy computation due to parallel execution in GPU. But this leads to utilizing too much memory and computational resources, which is not good. Some out of memory errors also happen because of large batch size. Also, larger mini batch does not generalize well. They are prone to overfitting due to lack of noise.

## 17. How many epochs / iterations will you train your model?

For this, you have to keep checking the validation error. You can train your model as long as the validation error keeps decreasing.

You can also use a technique called early stopping to instruct the model when to stop training. In early stopping, you can set a number n. Validation loss will be monitored, and if it does not decrease for n number of iterations, the training is stopped.

## 18. How many hidden units to use in a deep neural network?

The hidden units give the model “capacity” to learn the function. Complex functions need more learning capacity. If the model has very large number of hidden units than necessary, it means it has too much learning capacity. It will try to memorize the training set and tend to overfit. You can keep an eye on training loss and validation loss. If you think your model overfits, experiment again by reducing the number of hidden units.

In general, having more number of hidden units than the number of inputs will give good results. But not too many hidden units as it causes overfitting.

## 19. What is word embedding?

Embedding is used in a deep neural network for text data. Here, words or phrases in a vocabulary are mapped to vectors of numerical values. These vectors are called embeddings.

This technique is used to reduce the dimensionality of text data. Embeddings also capture mathematical relationship between words.

## 20. What is cross entropy?

Entropy means randomness. Cross entropy is a loss function used in classification problems. It says by how much the predicted label is deviated from the true label.

Cross entropy is the sum of negative of logarithm of probabilities.

As any loss function needs to be low for a good model, the same applies here. If cross entropy is low, then the model is a good model. Otherwise if cross entropy is high, it is a bad model.

## 21. MLP vs CNN

MLP uses fully connected layers. It means every node in a layer is connected to all nodes of the previous layer. For each node connection, there is a weight involved. So, MLP has more number of parameters even if it is a simple network.

CNN uses sparsely connected layers. It means, a node is connected to only a subset of nodes from previous layer. They also have parameter sharing. So, the number of parameters in a CNN is less.

An input fed to an MLP has to be converted to a vector (1D). If an image is to be passed, the pixel values need to be arranged serially as a vector and then passed to MLP. This causes loss of all spatial information. Whereas in CNN, you can pass the 2D matrix as such. So spatial information is preserved in CNN.

## 22. Name some CNN hyperparameters.

• Size of the window (kernel size)
• Stride
• Filter size (number of output filters)

## 23. How is distance measured in various clustering techniques?

• k-Means – Distance from the centroid to each point
• Single link hierarchical – Distance between closest points in two clusters
• Complete link hierarchical – Distance between farthest points in two clusters
• Average link hierarchical – Distance between every point in a cluster and every other point in second cluster are calculated and calculate average of those distances

## 24. How to determine the value of k in k Nearest Neighbors (kNN)?

Smaller value of k indicates you trust the very few closest neighbors. If any of them happen to be an outlier, it could drag the result far away. So don’t choose a very small k value.

If you choose larger k value, the result could be trustworthy. But it is computationally expensive.

The general rule many people follow is, to choose a k value which is equal to the square root of the number of training examples.

It is good to have k value as odd number. As this algorithm takes voting from neighbors, if the number of neighbors are odd, then we can avoid a tie during voting.

## 25. What is Box Cox Transformation?

A lot of tools and techniques demand your data to be in a normal distribution. And in most cases, that is not how our data is. You can use Box cox transformation to transform a non-normal input data to closely resemble a normal distribution.

In general, a transformation is one where the same operation is applied to every data point. If you are adding or subtracting a common value from all the input data, it is called linear transformation. In linear transformation, your output distribution won’t change.

Instead, if you are raising all the input values to a power, it is a non-linear transformation. In non-linear transformations, the output distribution will change. Box Cox transformation is a power transformation which raises all input values to a common power “lambda”. For lambda, all values from -5 to 5 are tested to find the ideal lambda value. The input values if raised to the ideal lambda value will result in a close to normal distribution.

The Box Cox transformation does not guarantee normal distribution. But this is the best this method can do to give something close to a normal distribution.