- Introduction
- Point Estimation
- Interval Estimation
- Hypothesis Testing
- A/B Testing
- Inference for Relationship
Q. What is statistical inference?
Answer
The process of inferring something about population based on what is measured in the sample is called statistical inference.
Q. What is point estimation in statistical inference?
Answer
In point estimation, we estimate an unknown parameter using a single number that is calculated from the sample data.
Q. What do you mean by interval estimation?
Answer
In interval estimation, we estimate an unknown parameter using an interval of values that is likely to contain the true value of that parameter and we also state how confident we are that this interval indeed captures the true value of the parameter.
Q. What do we do in hypothesis testing?
Answer
In hypothesis testing, we have some claim about the population and we check whether or not the data obtained from the sample provide evidence against this claim.
Q. A blurb on a box of brand X light bulbs claimed that the mean lifetime of each lightbulb is 750 hours. A random sample of 36 light bulbs was tested in a laboratory, and it was found that their average lifetime is 745 hours. Which form of statistical inference should you use to evaluate whether the data provide enough evidence against the advertised mean lifetime on the box?
- Point Estimation
- Interval Estimation
- Hypothesis Testing
Answer
Hypothesis Testing
Here, we are assessing whether the data provide enough evidence against the claim that the mean lifetime is
Q. A recent poll asked a random sample of 1,100 U.S. adults whether or not they support gay marriage. Based on the results of the poll, the pollsters estimated that the proportion of all U.S. adults who support gay marriage is 0.61. Which form of statistical inference should you use to evaluate this conclusion?
- Point Estimation
- Interval Estimation
- Hypothesis Testing
Answer
Point Estimation
Here, we are using the data to estimate the proportion of all U.S adults who support gay marriage by a single number
Q. Based on data collected from a random sample of 1,200 college freshmen, researchers are 95% confident that the mean number of sleep hours of all college freshmen is between 6 hours and 7.5 hours. Which form of statistical inference should you use to evaluate this conclusion?
- Point Estimation
- Interval Estimation
- Hypothesis Testing
Answer
Interval Estimation
Here, we are estimating the mean number of daily sleep hours of college freshman by an interval of values (
Q. How does the type of variable of interest(categorical/quantitative) determine the type of population parameter we need to infer?
Answer
It depends on the type of variable of interest:
-
Categorical variable - In this case the population parameter that we need to infer about is the population proportion(p) associated with that variable.
-
Quantitative variable - In this case the population parameter that we need to infer about is the population mean(
$\mu$ ) associated with that variable.
Q. Which of the following statements are true in context of sampling mean
- Both
$\hat{X}$ and$\mu$ are random varaibles. - Only
$\hat{X}$ is a random variable. - Both are constant values.
- Only
$\mu$ is random varaible.
Answer
Only
Q. A study on exercise habits used a random sample of
The study found the following:
-
$818$ of the females in the sample exercise on a regular basis. -
$924$ of the males in the sample exercise on a regular basis. - The average time that the
$1742$ students who exercise on a regular basis ($818 + 924$ ) spend exercising per week is$4.2$ hours.
- What is the point estimate for the proportion of all female college students who exercise on a regular basis?
- What is the point estimate for the proportion of all college students who exercise on a regular basis?
- Which of the following has a point estimate of
$4.2$ ?- The mean time that all college students who exercise on a regular basis spend exercising per week
- The mean time that all college students spend exercising per week
- The percentage of all college students who exercise on a regular basis
Answer
Let
- Since
$f = 1220$ and number of feamles that do excercise are$818$
- point estimate for the proportion of all college students who exercise on a regular basis(
$\hat{p}$ ):
- The mean time that all college students who exercise on a regular basis spend exercising per week
Q. What should be the criteria under which the point estimates are truly unbiased estimates for the population parameter?
Answer
Sample should be random and the study design should not be flawed.
Q. What should be the criteria under which the point estimates are truly unbiased estimates for the population parameter?
Answer
Sample should be random and the study design should not be flawed.
Q. A researcher wanted to estimate µ, the mean number of hours that students at a large state university spend exercising per week. The researcher collects data from a sample of 150 students who leave the university gym following a workout.
Which of the following is true regarding x̄, the average number of hours that the 150 sampled students exercise per week?
- It is an unbiased estimate for
$µ$ . - It is not an unbiased estimate for
$µ$ and probably underestimates$µ$ . - It is not an unbiased estimate for
$µ$ and probably overestimates$µ$ .
Answer
It is not an unbiased estimator for µ because the sample was not a random sample of 150 students from the entire student body. In addition, students who leave the university gym following a workout are likely students who exercise on a regular basis and therefore tend to exercise more, on average, than students in general.
Q. A study estimated that the mean number of children per family in the the United States is 1.3. This point estimate would be unbiased and most accurate if it were based on which of the following?
- A random sample of
$10,000$ U.S. families with children from the state of Utah - A random sample of
$500$ U.S. families with children - A random sample of
$5,000$ U.S. families with children - A random sample of
$1,000$ U.S. families
Answer
A random sample of $1,000$ U.S. families
The estimate is based on a random sample (and is therefore unbiased) and is also based on a large sample, which makes it more accurate.
Q. What is the limitation of point estimation?
Answer
The point estimation is simple and intuitive but a little bit problematic. When we estimate
Q. How does interval estimation overcome limitation of point estimation?
Answer
Interval estimation enhances point estimation by supplying information about size of error attached.
Q. Suppose a random sample of size n is taken from a normal population of values for a quantitative variable whose mean (
Answer
Q. How should we interpret the
Answer
Q. The IQ level of students at a particular university has an unknown mean,
Answer
Given:
- A
$90%$ confidence interval for$\mu$ is$\hat{x} \pm 1.645\frac{\sigma}{sqrt{n}} = (112.5, 117.5)$ - A
$95%$ confidence interval for$\mu$ is$\hat{x} \pm 2\frac{\sigma}{sqrt{n}} = (112, 118)$ - A
$99%$ confidence interval for$\mu$ is$\hat{x} \pm 2.576\frac{\sigma}{sqrt{n}} = (111, 119)$
Q. Explain the trade-off between the level of the confidence and the precision with which the parameter is estimated?
Answer
The price we have to pay for a higher level of confidence is that the unknown population mean
Q. Write the general structure of the confidence intervals.
Answer
General form:
Where
- sample mean
$\bar{x}$ , the point estimator for the unknown population mean($\mu$ ) - margin of error m as
$z^* \dot \frac{\sigma}{\sqrt{n}}$ , It represents the maximum estimation error for a given level of the confidence.
So we can also write the general form as follows:
Q. Explain margin of error in interval estimation. What value does it encode?
Answer
Expression of margin of error(m):
Where
margin of error(m) is in charge of the width(or precision) of the confidence interval.
Q. How can we reduce margin or error
Answer
With larger sample size
Q. Find the general expression for the required
Answer
We have,
On solving for
Q. Suppose that based on a random sample, a
Answer
0.2
Let
Confidence interval can be given by
On comparing the actaul interval
and,
On solving for
Q. IQ scores are known to vary normally with a standard deviation of
Answer
374
We have,
On putting
Q. In which case it is not safe to use confidence interval developed using CLT?
- Variable varies normaly and sample size is small(
$n < 30$ ) - Variable varies normaly and sample size is large
- Variable does not vary normal and sample size is small
- Variable does not vary normal and sample size is large
Answer
- Variable does not vary normal and sample size is small
In this case we can use non-parametric methods.
Q. How should we calculate confidence interval when the population standard deviation
Answer
We can replace population standard deviation(
The confidence interval for the population mean (
Note that the quantity
Q. True/False For large values of
Answer
True
Q. Write the expression of confidence interval when variable of interest is categorical?
Answer
The confidence interval for the population proportion
Q. A poll asked a random sample of
- Based on the poll's results, estimate p, the proportion of all U.S. adults who believe the use of marijuana should be legalized, with a 95% confidence interval.
- Give an interpretation of the margin of error in context.
- Do the results of this poll give evidence that the majority of U.S. adults believe that the use of marijuana should be legalized?
Answer
- The sample proportion
$\hat{p}$ is$\frac{56}{1000} = 0.56$ and therefore$95%$ confidence interval for$p$ is:
So we are
-
The margin of error is
$0.03$ i.e$3%$ . With$95%$ certainty, the sample proportion we got$56%$ is within$3%$ of(or no more than$3%$ away from) the proportion of U.S adults who believe that the use of marijiuana should be legalized. -
Yes. All of the values in our
$95%$ confidence interval for p$(.53, .59)$ , which represents the set of plausible values for p, lies above$.5$ , which provides evidence (at the$95%$ confidence level) that the majority of U.S. adults believe that the use of marijuana should be legalized.
Q. Under what condition we can use to construct CI in case of estimating
Answer
Q. Suppose that you take
Answer
Given:
On substituting all the values, we get:
So, the
Q. Suppose that we examine
- Given a random newborn puppy, its weight has a
$95%$ chance of being between$0.9$ and$1.1$ pounds. - If we examine another
$100$ newborn puppies, their mean has a$95%$ chance of being in that interval. - We're
$95\%$ confident that this interval captured the true mean weight.
Answer
Lets look at other statements:
Q. Suppose we have a random variable X supported on
Answer
Q. The weight of newborn puppies is roughly symmetric with a mean of 1 pound and a standard deviation of 0.12. Your favorite newborn puppy weighs 1.1 pounds. 1. Calculate your puppy’s z-score (standard score). 1. How much does your newborn puppy have to weigh to be in the top 10% in terms of weight? 1. Suppose the weight of newborn puppies followed a skewed distribution. Would it still make sense to calculate z-scores?
Answer
Given:
-
$$z-score = \frac{x_i - \mu}{\sigma} = \frac{1.1 - 1}{0.12} = 0.83$$ -
Let x be the weight that needed to be in top
$10\%$ . For this condtion$z_score >= z_{0.9}$ Using lookup table we get$z_score >= 1.645$
We can write expression for z-score:
On solving for
So, puppy weight should be atleast
- If weight of newborn puppies distribution is not normal and is skewed then it does not make any sense to use z-scores for any decision making process.
Q. When should you use a Z-Test instead of a T-Test?
Answer
Q. Define statistical hypothesis testing. Explain in detail how does it work?
Answer
Assessing evidence provided by the data in favour of or against some claim about population is statistical hypothesis testing.
Here is how the process of statistical hypothesis testing works:
- We have two claims about what is going on in the population. Let's call them for now claim 1 and claim 2.
- We choose a sample, collect relevant data and summarize them.
- We figure out how likely it is to observe data like the data we got, had claim 1 been true.
- Based on what we found in the previous step, we make our decision:
- If we find that if claim 1 were true it would be extremely unlikely to observe the data that we observed, then we have strong evidence against claim 1, and we reject it in favor of claim 2.
- If we find that if claim 1 were true observing the data that we observed is not very unlikely, then we do not have enough evidence against claim 1, and therefore we cannot reject it in favor of claim 2.
Q. Define
Answer
p-value is the probability of getting data like those observed when
By "extreme" we mean extreme in the direction of the alternative hypothesis.
Specifically, for the z-test for the population proportion:
- If the alternative hypothesis is
$H_a : p < p_0$ (less than), then "extreme" means small, and the p-value is:
The probability of observing a test statistic as small as that observed or smaller if the null hypothesis is true.
- If the alternative hypothesis is
$H_a : p > p_0$ (greater than), then "extreme" means large, and the p-value is:
The probability of observing a test statistic as large as that observed or larger if the null hypothesis is true.
- if the alternative is
$H_a : p ≠ p_0$ (different from), then "extreme" means extreme in either direction either small or large (i.e., large in magnitude), and the p-value therefore is:
The probability of observing a test statistic as large in magnitude as that observed or larger if the null hypothesis is true.
Q. State steps involve in hypothesis testing for population proportion.
Answer
- State the appropriate null and alternative hypotheses,
$H_0$ and$H_a$ .
- Obtain the data from a sample and:
Check whether the data satisfy the following condition:
- Random sample
-
$n \dot p_0 \geq 10$ ,$n \dot (1 - p_0) \geq 10$
Calculate the sample proportion
- Find the p-value of the test:
- for
$H_a : p < p_0$ :$P(Z \leq z)$ - for
$H_a : p > p_0$ :$P(Z \geq z)$ - for
$H_a : p \neq p_0$ :$P(Z \geq |z|)$
- Based on the p-value, decide whether or not the results are significant, and draw your conclusions in context.
Q. Explain significance level of the test and its use in hypothesis testing.
Answer
Significance level of the test is usually denoted by the Greek letter
- if the
$p-value < \alpha$ (usually$.05$ ), then the data we got is considered to be "rare (or surprising) enough" when Ho is true, and we say that the data provide significant evidence against$H_0$ , so we reject Ho and accept$H_a$ . - if the
$p-value > \alpha$ (usually$.05$ ), then our data are not considered to be "surprising enough" when Ho is true, and we say that our data do not provide enough evidence to reject$H_0$ (or, equivalently, that the data do not provide enough evidence to accept$H_a$ ).
Q. In hypothesis testing when do we conclude the statistical significance of the result?
Answer
It depends on
-
The results are statistically significant - when the
$p-value < \alpha$ . -
The results are not statistically significant - when the
$p-value > \alpha$ .
Q. There are rumors that students in a certain liberal arts college are more inclined to use drugs than U.S. college students in general. Suppose that in a simple random sample of
Answer
- State the hypothesis:
- Calculate test-statistic
$z$ :
-
Calculate p-value using software we can get:
$$p - value = 0.035$$ -
Make conclusion based on
$p$ value and$\alpha$ :
For default value of
Q. Write the general form that can be taken by null hypothesis
Answer
General form of null hypothesis:
The alternative hypothesis takes one of the following three forms (depending on the context):
Q. How does null hypothesis is related with confidence interval? Explain it with an example.
Answer
Suppose we want to test
- If
$\mu_0$ falls outside the confidence interval, reject$H_0$ . - If
$\mu_0$ falls inside the confidence interval, do not reject$H_0$ .
Q. Explain the difference between z-distribution and t-distribution.
Answer
Q. When is the t-distribution useful?
Answer
Q. What is Type I and Type II error in case of hypothesis testing?
Answer
- Type I Error - If the null hypothesis is true but we reject it.
- Type II Error> - If null hypothesis is true but we fail to reject it.
Q. What do you mean by test statistic in hypothesis testing?
Answer
Test statistic captures the essence of the test. Larger the test statistic, the further the data are from
Q. Tossing a coin fifteen times resulted in 10 heads and 5 tails. How would you analyze whether a coin is fair?
Answer
Given:
Lets state the hypothesis and then based on the evidence we have from the observations we can access them using some statistic.
- Null Hypothesis(
$H_0$ ) - Coin is fair i.e$p(head) = p(tail) = \frac{1}{2}$ - Alternate Hypothesis(
$H_A$ ) - Coin is not fair i.e$p(head) ≠ \frac{1}{2}$
Since the distribution is binomial in this case with
We can find test statistic(z-score) using following approaches:
- Under Normal distribution approximation
- Exact binomial distribution calculation
Under Normal distribution approximation:
Binomial distribution can be approximated as normal distribution if the following condition satisfies:
-
$np > 5$ and$nq > 5$
In this case we have
test statistics under
- Mean(
$\mu$ )
- Standard deviation(
$\sigma$ )
$Z_score = \frac{x - \mu}{\sigma}$
Now with significance level(
Since
So, the coin seems fair from the given observations.
Exact binomial distribution calculation:
We can do two-sided test to rejects
For observed value
We have,
and
putting these values to get p-value.
since p-value >
Q. Statistical significance.
- How do you assess the statistical significance of a pattern whether it is a meaningful pattern or just by chance?
- What’s the distribution of p-values?
- Recently, a lot of scientists started a war against statistical significance. What do we need to keep in mind when using p-value and statistical significance?
Answer
-
We can assess the statistical significance of a pattern using hypothesis testing. We can conduct hypothesis testing by following below steps:
- Formulate Null hypothesis(
$H_0$ ) and alternate hypothesis($H_A$ ) carefully - Collect data to calculate test statistics that measures the strength of the observed pattern.
- Use this test statistic to calculate a p-value, which represents the probability of obtaining such results (or more extreme) if the null hypothesis were true.
- If the p-value is very small (typically less than a chosen significance level, often 0.05), you reject the null hypothesis and conclude that the pattern is statistically significant.
- Otherwise, you fail to reject the null hypothesis, suggesting the pattern may be due to chance.
- Formulate Null hypothesis(
-
Distribution of p-value depends on under which hypothesis we are observing.
- Under null hypothesis(
$H_0$ ) p-value distribution will be uniformly distributed. - Under alternate hypothesis(
$H_a$ ) p-value distribution will be rightly skewed.
- Under null hypothesis(
-
There are some limitations of p-value and statistical significance and should be used with some cautions:
- P-values only provide information about the probability of obtaining results under the null hypothesis.
- they do not tell you the magnitude of an effect or its practical significance.
- A small p-value doesn't prove that a result is practically significant or that it has any real-world importance.
- Statistical significance should be considered in the context of the study design, sample size, and the relevance of the result to the research question.
- Always consider effect sizes, confidence intervals, and the domain-specific context of the research in addition to p-values.
Q. Variable correlation.
- What happens to a regression model if two of their supposedly independent variables are strongly correlated?
- How do we test for independence between two categorical variables?
- How do we test for independence between two continuous variables?
Answer
-
If predictors of a regression model are highly correlated, we might have following issues:
- It will make interpretability harder, it will be harder to get individual feature impact on final outcome.
- If two variables are highly correlated then it makes sense to drop either of them because too many predictors may increase model's complexity and hence may cause overfitting issue.
-
We can use Chi-Squared test of independence to assess the relationship between two categorical variables.
Here are the steps to conduct the test:
Step 1: Formulate Hypotheses:
- Null Hypothesis (H0): There is no association or independence between the two categorical variables.
- Alternative Hypothesis (H1): There is an association between the two categorical variables.
Step 2: Create a Contingency Table:
Construct a contingency table (also known as a cross-tabulation or a two-way table) that summarizes the frequencies or counts of the combinations of the two categorical variables.
Example:
| Category A | Category B | Total
-----------------------------------------
Group 1 | n11 | n12 | n1.
Group 2 | n21 | n22 | n2.
-----------------------------------------
Total | n.1 | n.2 | N
Step 3: Calculate Expected Frequencies:
Compute the expected frequencies for each cell in the contingency table under the assumption of independence. The formula for the expected frequency in a cell (e.g., n11) is:
Repeat this calculation for all cells in the table.
Step 4: Calculate the Chi-squared Statistic:
Calculate the Chi-squared
Where:
-
$(\chi^2)$ is the Chi-squared statistic. -
$(O)$ is the observed frequency in each cell. -
$(E)$ is the expected frequency in each cell. - The summation
$((\sum))$ is taken over all cells.
Step 5: Determine Degrees of Freedom:
Determine the degrees of freedom
Step 6: Calculate the p-value:
Using the Chi-squared statistic and the degrees of freedom, we can calculate the p-value associated with the Chi-squared distribution. This p-value represents the probability of observing a Chi-squared statistic as extreme as the one computed if the variables were independent.
Step 7: Make a Decision:
Compare the calculated p-value to a significance level (e.g., $(\alpha = 0.05)$) to decide whether to reject the null hypothesis:
- If the p-value is less than
$(\alpha)$ , we can reject the null hypothesis and conclude that there is a statistically significant association between the two categorical variables. - If the p-value is greater than or equal to
$(\alpha)$ , we fail to reject the null hypothesis, indicating no significant association.
- To test for independence between two continuous variables, we can use a correlation coefficient, such as the Pearson correlation coefficient (
$r$ ) or the Spearman rank correlation coefficient ($ρ$ ).
- Pearson Correlation (r): The Pearson correlation coefficient is used to measure linear relationships. It ranges from -1 (perfect negative linear relationship) to 1 (perfect positive linear relationship), with 0 indicating no linear relationship.
Expression of
- Spearman Rank Correlation (ρ): The Spearman rank correlation coefficient is non parametric test used to measure monotonic (nonlinear) relationships.
Q. What is the difference between parametric and non-parametric tests in machine learning?
Answer
Q. A/B testing is a method of comparing two versions of a solution against each other to determine which one performs better. What are some of the pros and cons of A/B testing?
Answer
A/B testing is a powerful technique used in a variety of fields including web design, marketing, product development, and more, to make data-driven decisions.
Pros:
- Empirical Evidence - A/B testing provides concrete, quantitative data on how to versions of a solutions perform against each other. It mitigate the error of naive assumptions or gut feelings.
- Risk Mitigation - It allows to test changes on small portion of users before rolling out to everyone, reducing the risk of implementing a change that could negatively affect the user experience.
- Data driven decision making
Cons
- Time consuming setup - Setting up A?B tests can be time consuming, especially if you're testing minor changes.
- Limited by sample size - To get statistically significant results a large sample size is needed.
- Can Be Misleading: If not properly designed (e.g., not randomizing the sample properly), tests can produce misleading results. Misinterpretation of results can lead to incorrect conclusions.
Q. You want to test which of the two ad placements on your website is better. How many visitors and/or how many times each ad is clicked do we need so that we can be
Answer
Q. Your company runs a social network whose revenue comes from showing ads in newsfeeds. To double revenue, your coworker suggests that you should just double the number of ads shown. Is that a good idea? How do you find out?
Answer
Q. Imagine that you have the prices of $ 10,000 stocks over the last 24-month period and you only have the price at the end of each month, which means you have 24 price points for each stock. After calculating the correlations of
- What’s the probability that this happens by chance?
- How to avoid this kind of accidental pattern?
Answer
Q. How are sufficient statistics and the Information Bottleneck Principle used in machine learning?
Answer
Sufficient statistics and the Information Bottleneck (IB) Principle are fundamental concepts in statistics and information theory that have found important applications in machine learning.
Sufficient Statistics
It is a statistic that captures all the information about a parameter of interest in a dataset, without needing to access the entire dataset again.
Use in Machine Learning:
- Parameter Estimation: Sufficient statistics are used in parameter estimation for probabilistic models. For example, in estimating the parameters of a Gaussian distribution, the sample mean and sample variance are sufficient statistics for the distribution's mean and variance, respectively. This efficiency reduces computational complexity and storage requirements.
- Model Simplification: By identifying sufficient statistics, machine learning practitioners can simplify models and reduce the dimensionality of data, focusing only on the parts of the data that are informative for the task at hand.
Information Bottleneck Principle
The Information Bottleneck (IB) Principle is a method for finding the relevant information in a random variable (X) about another variable (Y). It seeks to compress (X) into a compact representation (T) that preserves as much information about (Y) as possible. The principle balances the trade-off between compression (minimizing the mutual information (I(T;X))) and prediction accuracy (maximizing the mutual information (I(T;Y))).
Use in Machine Learning:
- Feature Selection and Dimensionality Reduction: The IB principle can guide the selection of features that are most informative of the target variable, effectively reducing the dimensionality of the input data while retaining its predictive power.
- Deep Learning Architectures: In deep neural networks, the layers can be viewed as performing successive information bottleneck operations, progressively compressing the input data into representations that are increasingly informative about the output. This perspective helps in understanding and designing neural network architectures.
- Clustering and Data Compression: The IB principle is used in clustering and data compression algorithms to find compact representations of the data that preserve relevant information for specific tasks.
- Regularization and Generalization: By focusing on the information that is relevant for predicting the target variable, the IB principle can help in designing regularization techniques that improve the generalization of machine learning models.