2.1 Descriptive vs Inferential Statistics
Use descriptive statistics to summarize and graph the data for a group that you choose. This process allows you to understand that specific set of observations.
Descriptive statistics frequently use the following statistical measures to describe groups:
- Central tendency: Use mean or median.
- Dispersion/Variability: How far out from the center do the data extend, i.e. range, mode or standard deviation.
- Skewness: The measure tells you whether the distribution of values is symmetric or skewed.
Inferential statistics takes data from a sample and makes inferences about the larger population from which the sample was drawn. Because the goal of inferential statistics is to draw conclusions from a sample and generalize them to a population, we need to have confidence that our sample accurately reflects the population.
- Hypothesis tests
- Confidence intervals (CIs)
- Regression analysis
It has some parameters such as the mean, median, mode, standard deviation, etc.
It is a random subset from the population. In a sample you don’t have parameters you have statistics.
As we’ve seen you take a sample to estimate the parameters of the whole population. However, not always only by sampling are you to retrieve the correct estimate of the population’s real parameters.
Instead of taking a single sample, what about if we take several samples from our population? For each sample, we’ll calculate the mean. So, in the end, we’ll have several values of mean estimation and then we can plot them on a chart.
This will be called the sampling distribution of the sample mean.
So, there are following certainties with the Central Limit Theorem:
- The sampling distribution will always be normal or close to normal;
- The mean of sampling distribution will be equal to the population’s mean distribution;
- The standard error of your sampling distribution is directly linked to the standard deviation of the original population. n is equal to the number of values you take for each sample.
Central Limit Theorem
According to the theorem, sample statistics—in particular sample means—obtained (theoretically) under repeated random sampling follow a normal distribution centered over the true population mean and with a standard deviation equal to the population standard deviation divided by the square root of the size of the (repeated) samples.
Standard Normal (z) distribution: Equally important, area of regions under any normal distribution can be obtained by transforming the original data to have a mean of 0 and standard deviation of 1 as follows:
,where x is the point of interest under the original distribution, u is the population mean, and is the population standard deviation.
For hypothesis testing and confidence interval construction, should be the standard deviation of the sampling distribution, which is computed as (called the standard error)
Following are different ways to state Central Limit Theorem in one statement:
- The Central Limit Theorem states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal if the sample size is large enough.
- In simple terms, if we had a large population divided into samples, then the mean of all the samples from the population will be almost equal to the mean of the entire population.
- The central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size gets larger no matter what the shape of the population distribution.
Q. Check Normal Distribution?
There are three general ways to determine if a distribution is normal.
- The first way is visually checking with a histogram.
- A more accurate way of checking this is by calculating the skewness of the distribution.
- The third way is to conduct formal tests to check for normality — some common tests include the Kolmogorov-Smirnov test (K-S) and Shapiro-Wilk (S-W) test. Essentially, these tests compare a set of data against a normal distribution with the same mean and standard deviation of your sample.
Standard deviation of binomial
Binomial distribution can be assumed to normal distribution if N.^p >5 and N(1-^p)>5
Read (30 mins): https://www.youtube.com/watch?v=VK-rnA3-41c
- Hypothesis: A premise or claim that we want to test/investigate.
- Null hypothesis H0: Currently established/accepted value of a parameters
- Alternate hypothesis Ha: Also called research hypothesis. Involves the claim to be tested.
- H0 Ha are mathematical opposites
- Level of Confidence: C- 95%, 99% How confident are we in our decision
- Level of Significance: α = 1-C
- So, LOC = 95%
- C=0.95, α = 0.05
- P-value: Probability of obtaining a sample “more extreme” than the ones observed in your data, assuming H0 is true.
- In statistical hypothesis testing, Type I error is the rejection of a true null hypothesis (also known as a “false positive” finding or conclusion; example: “an innocent person is convicted”)
α = P (reject true | null true)
- Type II error is the non-rejection of a false null hypothesis (also known as a “false negative” finding or conclusion; example: “a guilty person is not convicted”)
β = P (fail to reject | null false)
Read (10 mins): https://www.abtasty.com/blog/type-1-and-type-2-errors/
- Power of Hypothesis
- Power = P (rejecting H0 | H0 False)
- = 1 – P (not rejecting H0 | H0 False) i.e. Type 2 error definition
- = P (not making type 2 error)
- There are 3 parameters that can affect the power of a test:
- Your sample size (n)
- The significance level of your test (α)
- The “true” value of your tested parameter (read more here)
Hypothesis Testing with Examples
- If the probability of making a type 1 error is determined by “α” the probability of a type 2 error is “β”.
- Beta depends on the power of the test (i.e. the probability of not committing a type 2 error, which is equal to 1-β).
When we want to carry out inferences on one proportion (build a confidence interval or do a significance test), the accuracy of our methods depend on a few conditions. Before doing the actual computations of the interval or test, it’s important to check whether or not these conditions have been met, otherwise the calculations and conclusions that follow aren’t actually valid.
The conditions we need for inference on one proportion are:
- Random: The data needs to come from a random sample or randomized experiment.
- Normal: The sampling distribution of \hat pp^p, with, hat, on top needs to be approximately normal — needs at least 101010 expected successes and 101010 expected failures.
- Independent: Individual observations need to be independent. If sampling without replacement, our sample size shouldn’t be more than 10\%10%10, percent of the population.
The conditions we need for inference on a mean are:
- Random: A random sample or randomized experiment should be used to obtain the data.
- Normal: The sampling distribution of xˉ (the sample mean) needs to be approximately normal. This is true if our parent population is normal or if our sample is reasonably large (n≥30).
- Independent: Individual observations need to be independent. If sampling without replacement, our sample size shouldn’t be more than 10% of the population.
Sample standard deviation: Sample variance is unbiased estimate of population (using n-1, works for any probability distribution for our population). But when taking square root of sample variance Sample standard deviation become unbiased estimate of true population deviation, as square root is non-linear function (and now it is dependent on how the population is actually distributed).
Click Through Rate (CTR): Number of clicks / Number of pageviews
Generally, CTR used when we want to check usability of site
Click Through Probability (CTR – Probability): Number of unique visitors who click / Number unique visitors to page
Generally, used when we want to measure impact, e.g. going from first level of site to second
2.2 Data Exploration
Read (20mins): https://www.saedsayad.com/data_exploration.htm