Data Science is a blend of various algorithms, tools, and machine learning principles that operate with the goal of discovering hidden patterns from raw data. Test your foundational knowledge with these 25 MCQs.

## 1. Which of the following is a supervised learning method?

### Answer:

### Explanation:

Supervised learning involves training on labeled data. Regression is a supervised method where the outcome variable is predicted based on one or more predictor variables.

## 2. In statistics, what does a Type I error represent?

### Answer:

### Explanation:

A Type I error, or false positive, occurs when we reject the null hypothesis when it is actually true.

## 3. Which technique is used to find the optimal number of clusters in k-means?

### Answer:

### Explanation:

The Elbow Method involves running the k-means clustering algorithm for a range of values of k, then plotting the sum of squared distances. The "elbow" of the curve represents the optimal value for k.

## 4. Which of the following is not a part of the "Five Number Summary"?

### Answer:

### Explanation:

The "Five Number Summary" consists of the Minimum, First Quartile, Median, Third Quartile, and Maximum. It does not include the Mean.

## 5. The primary purpose of PCA (Principal Component Analysis) is:

### Answer:

### Explanation:

PCA is a technique to simplify the complexity in high-dimensional data while retaining trends and patterns. It does this by transforming the original variables into a new set of variables, the principal components, which are orthogonal, and which reflect the maximum variance in the data.

## 6. In which of these is the Central Limit Theorem not valid?

### Answer:

### Explanation:

The Central Limit Theorem (CLT) states that the distribution of the sum (or average) of many independent, identically distributed random variables approaches a normal distribution, irrespective of the shape of the original distribution. However, for heavily skewed data, a larger sample size might be needed to see the approximation to normality.

## 7. Which algorithm can be used for both classification and regression tasks?

### Answer:

### Explanation:

Support Vector Machines can be employed for both classification (categorizing data into classes) and regression (predicting a continuous value).

## 8. What is the purpose of the train-test split?

### Answer:

### Explanation:

The primary purpose of a train-test split is to evaluate the model's performance on unseen data. By training on one subset and testing on another, it simulates how the model would perform in real-world scenarios.

## 9. Overfitting in a model indicates:

### Answer:

### Explanation:

Overfitting happens when a model learns the training data too closely, including its noise and outliers, which makes it perform poorly on new, unseen data.

## 10. In the context of a decision tree, what is "Entropy"?

### Answer:

### Explanation:

In decision trees, entropy is a measure of the impurity or disorder. A set that is homogenous will have an entropy of 0, while a set that has a 50-50 split of classes will have an entropy of 1 (assuming a binary classification).

## 11. Which method helps in avoiding overfitting in neural networks?

### Answer:

### Explanation:

Dropout is a regularization technique used in neural networks where randomly selected neurons are ignored during training. This helps in preventing overfitting as it ensures that no one neuron relies too heavily on any one feature.

## 12. Which of the following is not a metric used for evaluating regression models?

### Answer:

### Explanation:

Accuracy is typically used for classification tasks. For regression models, metrics like MSE, RMSE, and R-squared are more appropriate as they measure the difference between predicted and actual values.

## 13. In time series analysis, what does "stationarity" mean?

### Answer:

### Explanation:

A time series is said to be stationary if it has constant statistical properties over time, i.e., mean, variance, and autocorrelation structure do not change over time.

## 14. In k-fold cross-validation:

### Answer:

### Explanation:

In k-fold cross-validation, the data is split into k subsets. One subset is used as the test set, and the other k-1 subsets are combined and used as the training set. This process is repeated k times, each time with a different test set.

## 15. Which of the following is a non-parametric machine learning algorithm?

### Answer:

### Explanation:

Non-parametric models make no assumptions about the functional form of the transformation needed to achieve linearity. Decision Trees do not require any assumptions about the distribution of the data or the relationships between variables, making them non-parametric.

## 16. Which method is used to capture the linear relationship between the dependent and independent variables in a dataset?

### Answer:

### Explanation:

Regression analysis is a method used to understand the relationship between dependent and independent variables, specifically when the relationship is linear.

## 17. Which of the following is not a characteristic of Big Data?

### Answer:

### Explanation:

The commonly referred to characteristics of Big Data are Volume, Variety, and Velocity, sometimes adding Veracity and Value to the list. Validation is not one of the recognized Vs of Big Data.

## 18. Which of the following techniques can be used for outlier detection?

### Answer:

### Explanation:

IQR is a measure of statistical spread. Outliers can be detected by looking for values that lie outside 1.5 times the IQR above the third quartile or below the first quartile.

## 19. What is the purpose of the A/B testing?

### Answer:

### Explanation:

A/B testing, also known as split testing, is a method of comparing two versions of a webpage or app against each other to determine which one performs better.

## 20. In a ROC curve, if Area Under Curve (AUC) is 0.5, it implies:

### Answer:

### Explanation:

In the context of the ROC curve, an AUC of 0.5 means that the model has no discriminative power and is as good as random guessing. An AUC value of 1 indicates a perfect model, while an AUC value of 0 indicates a perfectly bad model.

## 21. Principal Component Analysis (PCA) is primarily used for:

### Answer:

### Explanation:

PCA is a technique used to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easy to explore and visualize by reducing its number of dimensions, without much loss of information.

## 22. Which of the following is a disadvantage of decision trees?

### Answer:

### Explanation:

Decision trees can easily become too complex and fit the noise in the data, leading to overfitting. This is especially true for trees that are very deep.

## 23. Which of the following is not a supervised learning algorithm?

### Answer:

### Explanation:

The Apriori algorithm is used for association rule learning, a type of unsupervised machine learning. It's used for frequent itemset mining and association rule learning over transactional databases.

## 24. In the context of machine learning, bias:

### Answer:

### Explanation:

In machine learning, bias refers to the error due to the assumptions made by a model to make a prediction. High bias can cause a model to miss relevant relations between features and target outputs (underfitting).

## 25. Which of the following statements about overfitting is false?

### Answer:

### Explanation:

Overfitting is not desirable. It happens when a model learns not just the underlying patterns but also the noise in the training data. As a result, it performs poorly on new, unseen data.