# Statistics interview questions and answers

## Freshers / Beginner level questions & answers

### Ques 1. What is the difference between population and sample?

Population refers to the entire group under study, while a sample is a subset of the population.

**Example:**

Population: All students in a university. Sample: 100 students randomly selected from the university.

### Ques 2. What is p-value in statistics?

The p-value is the probability of obtaining results as extreme as or more extreme than the observed results under the assumption that the null hypothesis is true.

**Example:**

If p-value is 0.05, there is a 5% chance of observing the results if the null hypothesis is true.

### Ques 3. Define standard deviation.

Standard deviation is a measure of the amount of variation or dispersion in a set of values. It represents the average distance of each data point from the mean.

**Example:**

For a set of test scores: 70, 72, 75, 78, 80. Standard deviation is approximately 3.16.

### Ques 4. What is the difference between a population parameter and a sample statistic?

A population parameter is a numerical value that describes a characteristic of an entire population. A sample statistic is a numerical value that describes a characteristic of a sample drawn from that population.

**Example:**

Population parameter: Average income of all households. Sample statistic: Average income of a sample of 100 households.

### Ques 5. Define the term 'confidence level' in the context of confidence intervals.

The confidence level is the probability that the calculated confidence interval contains the true population parameter. Common confidence levels are 90%, 95%, and 99%.

**Example:**

A 99% confidence interval for the average weight of a certain species of bird is 20 to 25 grams.

### Ques 6. What is a histogram, and how is it used in statistics?

A histogram is a graphical representation of the distribution of a dataset. It displays the frequencies of different ranges or bins of the data, providing insights into its shape and central tendency.

**Example:**

Creating a histogram to visualize the distribution of exam scores in a class.

### Ques 7. What is the purpose of a box plot (box-and-whisker plot)?

A box plot is used to display the distribution and spread of a dataset. It shows the median, quartiles, and potential outliers in a compact visual format.

**Example:**

Comparing the distribution of salaries in different departments using box plots.

## Intermediate / 1 to 5 years experienced level questions & answers

### Ques 8. Explain the central limit theorem.

The central limit theorem states that the distribution of the sum or average of a large number of independent, identically distributed random variables approaches a normal distribution, regardless of the original distribution.

**Example:**

If you roll a fair six-sided die many times and calculate the average, the distribution of those averages will be approximately normal.

### Ques 9. What is regression analysis?

Regression analysis is a statistical technique used to model the relationship between a dependent variable and one or more independent variables by fitting a linear or nonlinear equation to the observed data.

**Example:**

Predicting house prices based on factors like square footage, number of bedrooms, and location.

### Ques 10. What is the purpose of hypothesis testing?

Hypothesis testing is used to make inferences about a population based on a sample of data. It involves comparing observed data with the results that would be expected if a specific null hypothesis were true.

**Example:**

Testing whether a new drug has a significant effect by comparing the outcomes of treated and untreated groups.

### Ques 11. Differentiate between correlation and causation.

Correlation implies a relationship between two variables, but it does not imply causation. Causation indicates that one variable directly influences the other.

**Example:**

There is a correlation between ice cream sales and drowning incidents, but one doesn't cause the other; both are influenced by warm weather.

### Ques 12. What is a confidence interval?

A confidence interval is a range of values that is likely to contain the true unknown parameter, with a certain level of confidence based on the sample data.

**Example:**

A 95% confidence interval for the average height of a population is 160 to 170 cm.

### Ques 13. Define type I and type II errors.

Type I error occurs when a true null hypothesis is incorrectly rejected. Type II error occurs when a false null hypothesis is not rejected.

**Example:**

Type I: Concluding a new drug is effective when it is not. Type II: Concluding a new drug is not effective when it is.

### Ques 14. What is ANOVA?

Analysis of Variance (ANOVA) is a statistical method used to determine if there are any statistically significant differences between the means of three or more independent groups.

**Example:**

Comparing the average scores of students in three different teaching methods.

### Ques 15. Explain the concept of p-hacking.

P-hacking refers to the manipulation of statistical analyses, methods, or data to produce statistically significant results, often by testing multiple hypotheses until one reaches significance.

**Example:**

Conducting multiple tests on the same data until a significant result is found and then reporting only that result.

### Ques 16. What is the difference between correlation and covariance?

Correlation is a standardized measure of the strength and direction of the linear relationship between two variables. Covariance measures the extent to which two variables change together, but it is not standardized.

**Example:**

Correlation coefficient ranges from -1 to 1; covariance can take any value.

### Ques 17. Define multicollinearity in regression analysis.

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, making it difficult to identify the individual effect of each variable on the dependent variable.

**Example:**

In a regression predicting house prices, if square footage and number of bedrooms are strongly correlated, multicollinearity may occur.

### Ques 18. What is a Q-Q plot used for?

A Q-Q plot (Quantile-Quantile plot) is used to assess whether a dataset follows a particular theoretical distribution, like the normal distribution. It compares the quantiles of the observed data to the quantiles of the expected distribution.

**Example:**

Checking if a set of exam scores follows a normal distribution using a Q-Q plot.

### Ques 19. Explain the term 'power' in statistics.

Power is the probability that a statistical test will correctly reject a false null hypothesis. It is the ability of a test to detect an effect, given that the effect truly exists.

**Example:**

A study with a larger sample size generally has higher power to detect a true effect.

### Ques 20. What is the purpose of a chi-squared test?

A chi-squared test is used to determine if there is a significant association between two categorical variables. It compares the observed frequencies with the expected frequencies.

**Example:**

Testing if there is a significant association between gender and voting preference.

### Ques 21. Explain the difference between a one-tailed and a two-tailed test.

In a one-tailed test, the critical region is on one side of the distribution (either the right or left). In a two-tailed test, the critical region is on both sides.

**Example:**

One-tailed test: Does a new drug increase performance? Two-tailed test: Does a new drug have any effect on performance?

### Ques 22. What is the coefficient of determination (R-squared) in regression analysis?

The coefficient of determination, denoted as R-squared, measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1.

**Example:**

If R-squared is 0.75, 75% of the variance in the dependent variable is explained by the independent variables.

### Ques 23. Define skewness in statistics.

Skewness measures the asymmetry of the probability distribution of a real-valued random variable. A negative skewness indicates a distribution that is skewed to the left, and a positive skewness indicates a distribution that is skewed to the right.

**Example:**

A dataset with a long tail to the right has positive skewness.

### Ques 24. What is the purpose of a t-test?

A t-test is used to determine if there is a significant difference between the means of two groups. It is often applied when the sample size is small and the population standard deviation is unknown.

**Example:**

Comparing the average scores of two groups of students who were taught using different methods.

### Ques 25. Define outlier in the context of statistical analysis.

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. It may indicate a data entry error, measurement error, or a rare event.

**Example:**

In a dataset of exam scores, a score of 120 when others range from 50 to 100 may be an outlier.

## Experienced / Expert level questions & answers

### Ques 26. Explain the difference between Type I and Type II censoring in survival analysis.

Type I censoring occurs when an observation is right-censored if the event of interest has not occurred by the end of the study. Type II censoring occurs when an observation is left-censored if the exact time of the event is not known.

**Example:**

In a study tracking time to failure of light bulbs, if some bulbs are still functional at the end of the study, it is Type I censoring.

### Ques 27. What is the Mann-Whitney U test used for?

The Mann-Whitney U test is a nonparametric test used to assess whether there is a difference between two independent groups in terms of a continuous dependent variable.

**Example:**

Comparing the distributions of test scores between two different teaching methods when assumptions for a t-test are not met.

### Ques 28. Define Simpson's Paradox.

Simpson's Paradox occurs when a trend appears in different groups of data but disappears or reverses when these groups are combined. It highlights the importance of considering confounding variables in statistical analysis.

**Example:**

A treatment is effective in both men and women, but when the data is combined, it seems ineffective due to differences in baseline characteristics.

### Ques 29. What is the purpose of the Akaike Information Criterion (AIC) in model selection?

The Akaike Information Criterion (AIC) is used for model selection, comparing the goodness of fit of different models. It penalizes models for having more parameters, encouraging the selection of simpler models that still explain the data well.

**Example:**

Choosing between linear and quadratic regression models based on AIC values.

### Ques 30. Explain the concept of bootstrapping in statistics.

Bootstrapping is a resampling technique where multiple random samples are drawn with replacement from the observed data. It is used to estimate the sampling distribution of a statistic and to calculate confidence intervals.

**Example:**

Creating multiple bootstrap samples from a dataset to estimate the uncertainty around the mean.

**Most helpful rated by users:**