Statistics

2.1 Descriptive vs Inferential Statistics

Descriptive Statistics

Use descriptive statistics to summarize and graph the data for a group that you choose. This process allows you to understand that specific set of observations.

Descriptive statistics frequently use the following statistical measures to describe groups:

  • Central tendency: Use mean or median.
  • Dispersion/Variability: How far out from the center do the data extend, i.e. range, mode or standard deviation.
  • Skewness: The measure tells you whether the distribution of values is symmetric or skewed.

Inferential Statistics

Inferential statistics takes data from a sample and makes inferences about the larger population from which the sample was drawn. Because the goal of inferential statistics is to draw conclusions from a sample and generalize them to a population, we need to have confidence that our sample accurately reflects the population. 

  • Hypothesis tests
  • Confidence intervals (CIs)
  • Regression analysis

Read (10 mins): https://statisticsbyjim.com/basics/descriptive-inferential-statistics/

2.2 Terms

Population

It has some parameters such as the mean, median, mode, standard deviation, etc.

Sample

It is a random subset from the population. In a sample you don’t have parameters you have statistics.

Sampling Distribution:

As we’ve seen you take a sample to estimate the parameters of the whole population. However, not always only by sampling are you to retrieve the correct estimate of the population’s real parameters.

Instead of taking a single sample, what about if we take several samples from our population? For each sample, we’ll calculate the mean. So, in the end, we’ll have several values of mean estimation and then we can plot them on a chart.

This will be called the sampling distribution of the sample mean.

So, there are following certainties with the Central Limit Theorem:

  • The sampling distribution will always be normal or close to normal;
  • The mean of sampling distribution will be equal to the population’s mean distribution;
  • The standard error of your sampling distribution is directly linked to the standard deviation of the original population. n is equal to the number of values you take for each sample.

Central Limit Theorem

According to the theorem, sample statistics—in particular sample means—obtained (theoretically) under repeated random sampling follow a normal distribution centered over the true population mean and with a standard deviation equal to the population standard deviation divided by the square root of the size of the (repeated) samples.

Standard Normal (z) distribution: Equally important, area of regions under any normal distribution can be obtained by transforming the original data to have a mean of 0 and standard deviation of 1 as follows:

 ,where x is the point of interest under the original distribution, u is the population mean, and is the population standard deviation.

For hypothesis testing and confidence interval construction, should be the standard deviation of the sampling distribution, which is computed as (called the standard error)

Following are different ways to state Central Limit Theorem in one statement:

  • The Central Limit Theorem states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal if the sample size is large enough.
  • In simple terms, if we had a large population divided into samples, then the mean of all the samples from the population will be almost equal to the mean of the entire population.
  • The central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size gets larger no matter what the shape of the population distribution.

Q. Check Normal Distribution?

There are three general ways to determine if a distribution is normal.

  • The first way is visually checking with a histogram.
  • A more accurate way of checking this is by calculating the skewness of the distribution.
  • The third way is to conduct formal tests to check for normality — some common tests include the Kolmogorov-Smirnov test (K-S) and Shapiro-Wilk (S-W) test. Essentially, these tests compare a set of data against a normal distribution with the same mean and standard deviation of your sample.

Standard Error

Read (3 mins): https://www.students4bestevidence.net/blog/2018/09/26/a-beginners-guide-to-standard-deviation-and-standard-error/

Standard deviation of binomial

Binomial distribution can be assumed to normal distribution if N.^p >5 and N(1-^p)>5

Read (10 mins): https://classroom.udacity.com/courses/ud257/lessons/4018018619/concepts/40043986940923

Hypothesis Testing

Read (30 mins): https://www.youtube.com/watch?v=VK-rnA3-41c

  • Hypothesis: A premise or claim that we want to test/investigate.
  • Null hypothesis H0: Currently established/accepted value of a parameters
  • Alternate hypothesis Ha: Also called research hypothesis. Involves the claim to be tested.
  • H0 Ha are mathematical opposites
  • Level of Confidence: C- 95%, 99% How confident are we in our decision
  • Level of Significance: α = 1-C
    • So, LOC = 95%
    • C=0.95, α = 0.05
  • P-value: Probability of obtaining a sample “more extreme” than the ones observed in your data, assuming H0 is true.

α = P (reject true | null true)

  • Type II error is the non-rejection of a false null hypothesis (also known as a “false negative” finding or conclusion; example: “a guilty person is not convicted”)

β = P (fail to reject | null false)

Read (10 mins): https://www.abtasty.com/blog/type-1-and-type-2-errors/

  • Power of Hypothesis
    • Power = P (rejecting H0 | H0 False)
    • = 1 – P (not rejecting H0 | H0 False) i.e. Type 2 error definition
    • = P (not making type 2 error)
  • There are 3 parameters that can affect the power of a test:
    • Your sample size (n)
    • The significance level of your test (α)
    • The “true” value of your tested parameter (read more here)

Hypothesis Testing with Examples

Read (20 mins): https://opentextbc.ca/introbusinessstatopenstax/chapter/full-hypothesis-test-examples/

  1. If the probability of making a type 1 error is determined by “α” the probability of a type 2 error is “β”.
  2. https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/error-probabilities-and-power/v/introduction-to-power-in-significance-tests?modal=1
  3. Beta depends on the power of the test (i.e. the probability of not committing a type 2 error, which is equal to 1-β).

PROPORTIONS

Read (15mins): https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/tests-about-population-proportion/a/conditions-inference-one-proportion

When we want to carry out inferences on one proportion (build a confidence interval or do a significance test), the accuracy of our methods depend on a few conditions. Before doing the actual computations of the interval or test, it’s important to check whether or not these conditions have been met, otherwise the calculations and conclusions that follow aren’t actually valid.

The conditions we need for inference on one proportion are:

  • Random: The data needs to come from a random sample or randomized experiment.
  • Normal: The sampling distribution of \hat pp^​p, with, hat, on top needs to be approximately normal — needs at least 101010 expected successes and 101010 expected failures.
  • Independent: Individual observations need to be independent. If sampling without replacement, our sample size shouldn’t be more than 10\%10%10, percent of the population.

MEAN

Read (15mins): https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/tests-about-population-mean/a/reference-conditions-inference-one-mean

The conditions we need for inference on a mean are:

  • Random: A random sample or randomized experiment should be used to obtain the data.
  • Normal: The sampling distribution of xˉ (the sample mean) needs to be approximately normal. This is true if our parent population is normal or if our sample is reasonably large (n≥30).
  • Independent: Individual observations need to be independent. If sampling without replacement, our sample size shouldn’t be more than 10% of the population.

Sample standard deviation: Sample variance is unbiased estimate of population (using n-1, works for any probability distribution for our population). But when taking square root of sample variance Sample standard deviation become unbiased estimate of true population deviation, as square root is non-linear function (and now it is dependent on how the population is actually distributed).

AB Testing:

Click Through Rate (CTR): Number of clicks / Number of pageviews 

Generally, CTR used when we want to check usability of site

Click Through Probability (CTR – Probability): Number of unique visitors who click / Number unique visitors to page

Generally, used when we want to measure impact, e.g. going from first level of site to second

2.2  Data Exploration

Read (20mins): https://www.saedsayad.com/data_exploration.htm

2.2.1  Univariate

2.2.1.1  Categorical

a.  Count%
b.  Pie Chart, Bar Chart

2.2.1.2  Numerical

a.  Min, Max, Mean, Median, Mode
b.  Range, IQR, Variance, Standard Deviation, Coefficient of Variation
c.  Skewness, Kurtosis
d.  Histogram, Box plot

2.2.2  Bivariate

2.2.2.1  Categorical & Categorical

a.  Chi sqaure test
b.  Bar Chart, 2-y axis plot

2.2.2.2  Numerical & Numerical

a.  Correlation
b.  Scatter plot

2.2.2.3  Numerical & Categorical

a.  Z test, t test, ANOVA

2.2.2.4  Bar & Line Chart, 2-y axis plot

2.2.3  Multivariate

2.2.3.1  Cluster Analysis

2.2.3.2  Principal Component Analysis

2.2.3.3  Correspondence Analysis

R | Dummy Variables

Programming

Leave a Comment

Connect

Subscribe

Join our email list to receive the latest updates.

© 2021 KeyToDataScience