Data Science MCQ

Data Science is a blend of various algorithms, tools, and machine learning principles that operate with the goal of discovering hidden patterns from raw data. Test your foundational knowledge with these 25 MCQs.

1. Which of the following is a supervised learning method?

a) Association
b) Clustering
c) Regression
d) Dimensionality Reduction

Answer:

c) Regression

Explanation:

Supervised learning involves training on labeled data. Regression is a supervised method where the outcome variable is predicted based on one or more predictor variables.

2. In statistics, what does a Type I error represent?

a) False Positive
b) False Negative
c) True Positive
d) True Negative

Answer:

a) False Positive

Explanation:

A Type I error, or false positive, occurs when we reject the null hypothesis when it is actually true.

3. Which technique is used to find the optimal number of clusters in k-means?

a) Confusion Matrix
b) ROC Curve
c) Elbow Method
d) Lift

Answer:

c) Elbow Method

Explanation:

The Elbow Method involves running the k-means clustering algorithm for a range of values of k, then plotting the sum of squared distances. The "elbow" of the curve represents the optimal value for k.

4. Which of the following is not a part of the "Five Number Summary"?

a) Mean
b) Median
c) Minimum
d) Third Quartile

Answer:

a) Mean

Explanation:

The "Five Number Summary" consists of the Minimum, First Quartile, Median, Third Quartile, and Maximum. It does not include the Mean.

5. The primary purpose of PCA (Principal Component Analysis) is:

a) Classification
b) Clustering
c) Dimensionality Reduction
d) Regression

Answer:

c) Dimensionality Reduction

Explanation:

PCA is a technique to simplify the complexity in high-dimensional data while retaining trends and patterns. It does this by transforming the original variables into a new set of variables, the principal components, which are orthogonal, and which reflect the maximum variance in the data.

6. In which of these is the Central Limit Theorem not valid?

a) Sample size is small (< 30) and data is heavily skewed
b) Sample size is large (>= 30)
c) Data is from a normal distribution
d) Data is from a binomial distribution

Answer:

a) Sample size is small (< 30) and data is heavily skewed

Explanation:

The Central Limit Theorem (CLT) states that the distribution of the sum (or average) of many independent, identically distributed random variables approaches a normal distribution, irrespective of the shape of the original distribution. However, for heavily skewed data, a larger sample size might be needed to see the approximation to normality.

7. Which algorithm can be used for both classification and regression tasks?

a) Support Vector Machine
b) K-Means Clustering
c) Apriori
d) Hierarchical Clustering

Answer:

a) Support Vector Machine

Explanation:

Support Vector Machines can be employed for both classification (categorizing data into classes) and regression (predicting a continuous value).

8. What is the purpose of the train-test split?

a) Reduce computational power
b) Train on one set and test on an unseen set
c) Increase the accuracy of the model
d) Deal with missing data

Answer:

b) Train on one set and test on an unseen set

Explanation:

The primary purpose of a train-test split is to evaluate the model's performance on unseen data. By training on one subset and testing on another, it simulates how the model would perform in real-world scenarios.

9. Overfitting in a model indicates:

a) The model performs poorly on both training and test data
b) The model performs well on training data but poorly on test data
c) The model is under-parameterized
d) The model has too little data to learn from

Answer:

b) The model performs well on training data but poorly on test data

Explanation:

Overfitting happens when a model learns the training data too closely, including its noise and outliers, which makes it perform poorly on new, unseen data.

10. In the context of a decision tree, what is "Entropy"?

a) A metric to measure the purity of a split
b) The depth of the tree
c) The number of decision nodes in the tree
d) The number of outcomes of a decision node

Answer:

a) A metric to measure the purity of a split

Explanation:

In decision trees, entropy is a measure of the impurity or disorder. A set that is homogenous will have an entropy of 0, while a set that has a 50-50 split of classes will have an entropy of 1 (assuming a binary classification).

11. Which method helps in avoiding overfitting in neural networks?

a) Pruning
b) Bagging
c) Boosting
d) Dropout

Answer:

d) Dropout

Explanation:

Dropout is a regularization technique used in neural networks where randomly selected neurons are ignored during training. This helps in preventing overfitting as it ensures that no one neuron relies too heavily on any one feature.

12. Which of the following is not a metric used for evaluating regression models?

a) Mean Squared Error (MSE)
b) Root Mean Squared Error (RMSE)
c) Accuracy
d) R-squared

Answer:

c) Accuracy

Explanation:

Accuracy is typically used for classification tasks. For regression models, metrics like MSE, RMSE, and R-squared are more appropriate as they measure the difference between predicted and actual values.

13. In time series analysis, what does "stationarity" mean?

a) The series has a constant mean and variance over time
b) The series has no trend
c) The series has a regular seasonality pattern
d) The series can be predicted with high accuracy

Answer:

a) The series has a constant mean and variance over time

Explanation:

A time series is said to be stationary if it has constant statistical properties over time, i.e., mean, variance, and autocorrelation structure do not change over time.

14. In k-fold cross-validation:

a) The data is divided into k sets and the model is trained k times using a different set as the test set each time.
b) The data is divided into k sets and the model is trained once using k-1 sets combined.
c) The data is divided into 2 sets, one for training and one for testing.
d) The model is trained k times on the same data.

Answer:

a) The data is divided into k sets and the model is trained k times using a different set as the test set each time.

Explanation:

In k-fold cross-validation, the data is split into k subsets. One subset is used as the test set, and the other k-1 subsets are combined and used as the training set. This process is repeated k times, each time with a different test set.

15. Which of the following is a non-parametric machine learning algorithm?

a) Linear Regression
b) Logistic Regression
c) Decision Trees
d) Naive Bayes Classifier

Answer:

c) Decision Trees

Explanation:

Non-parametric models make no assumptions about the functional form of the transformation needed to achieve linearity. Decision Trees do not require any assumptions about the distribution of the data or the relationships between variables, making them non-parametric.

16. Which method is used to capture the linear relationship between the dependent and independent variables in a dataset?

a) Clustering
b) Association
c) Regression
d) Classification

Answer:

c) Regression

Explanation:

Regression analysis is a method used to understand the relationship between dependent and independent variables, specifically when the relationship is linear.

17. Which of the following is not a characteristic of Big Data?

a) Volume
b) Variety
c) Validation
d) Velocity

Answer:

c) Validation

Explanation:

The commonly referred to characteristics of Big Data are Volume, Variety, and Velocity, sometimes adding Veracity and Value to the list. Validation is not one of the recognized Vs of Big Data.

18. Which of the following techniques can be used for outlier detection?

a) Linear Regression
b) IQR (Interquartile Range)
c) R-squared
d) Precision

Answer:

b) IQR (Interquartile Range)

Explanation:

IQR is a measure of statistical spread. Outliers can be detected by looking for values that lie outside 1.5 times the IQR above the third quartile or below the first quartile.

19. What is the purpose of the A/B testing?

a) Comparing two or more machine learning models
b) Testing the hypothesis for population means
c) Comparing the performance of two versions of a webpage or product to see which one performs better
d) Training a model on set A and testing on set B

Answer:

c) Comparing the performance of two versions of a webpage or product to see which one performs better

Explanation:

A/B testing, also known as split testing, is a method of comparing two versions of a webpage or app against each other to determine which one performs better.

20. In a ROC curve, if Area Under Curve (AUC) is 0.5, it implies:

a) The model has a perfect performance
b) The model's performance is better than random guessing
c) The model's performance is worse than random guessing
d) The model's performance is equivalent to random guessing

Answer:

d) The model's performance is equivalent to random guessing

Explanation:

In the context of the ROC curve, an AUC of 0.5 means that the model has no discriminative power and is as good as random guessing. An AUC value of 1 indicates a perfect model, while an AUC value of 0 indicates a perfectly bad model.

21. Principal Component Analysis (PCA) is primarily used for:

a) Classification of data
b) Regression analysis
c) Dimensionality reduction
d) Clustering of data

Answer:

c) Dimensionality reduction

Explanation:

PCA is a technique used to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easy to explore and visualize by reducing its number of dimensions, without much loss of information.

22. Which of the following is a disadvantage of decision trees?

a) They cannot handle linear data.
b) They are prone to overfitting.
c) They cannot be used for regression.
d) They can only handle categorical data.

Answer:

b) They are prone to overfitting.

Explanation:

Decision trees can easily become too complex and fit the noise in the data, leading to overfitting. This is especially true for trees that are very deep.

23. Which of the following is not a supervised learning algorithm?

a) k-Nearest Neighbors
b) Random Forest
c) Apriori
d) Support Vector Machines

Answer:

c) Apriori

Explanation:

The Apriori algorithm is used for association rule learning, a type of unsupervised machine learning. It's used for frequent itemset mining and association rule learning over transactional databases.

24. In the context of machine learning, bias:

a) Is the error due to overly complex models.
b) Refers to a model's predictions being systematically off.
c) Is always undesirable and must be minimized.
d) Means that a model performs differently based on the input data.

Answer:

b) Refers to a model's predictions being systematically off.

Explanation:

In machine learning, bias refers to the error due to the assumptions made by a model to make a prediction. High bias can cause a model to miss relevant relations between features and target outputs (underfitting).

25. Which of the following statements about overfitting is false?

a) Overfitting occurs when a model is too complex.
b) Overfitting can result in high accuracy on the training data.
c) Overfitting is desirable as it captures all patterns in the data.
d) Regularization techniques can be used to prevent overfitting.

Answer:

c) Overfitting is desirable as it captures all patterns in the data.

Explanation:

Overfitting is not desirable. It happens when a model learns not just the underlying patterns but also the noise in the training data. As a result, it performs poorly on new, unseen data.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top