Statistical Sampling

## Statistical Sampling – Overview

Statistical Sampling Terminology

In this part we discuss about Sampling Theory and related topics like

## Statistical Sampling

Statistical Sampling is the process of making a set of observations randomly from a population distribution.

A Populationis a collection of all data points under study. For example, if we are studying annual incomes of all people in India, then the population under study would consist of data points representing the incomes of each and every person in India.

A Sample is a part of a population.

Samples, being smaller in size than their population, are easier to study. If we want to draw some conclusions about a population, we can do by studying a suitable sample of the population.

Statistical Sampling Benefits

1.The population may be too large to be studied entirely. Testing the entire population may be impossible.  2. A study of a sample is usually cheaper than a study of the population. 3. Sampling usually gives information quicker than a census. Timely decisions can be taken. 4. Sampling involves less work as compared with a census. Naturally the chances of errors while processing the data are less using sample survey. 5. In destructive testing, sampling is the only available course.

## Statistical Sampling Methods

Different Statistical Sampling Methods are adopted for selection of Samples.

• Random Sampling

Random Sampling are used  where every element of the population has a chance of being included in the sample.

• Judgemental Sampling

In Judgemental Sampling, the sample is selected according to the judgement of the investigators or experts. There is a certain degree of subjectivity in the selection. So, in Judgemental Sampling, Statistical Theories to find results of sampling cannot be used.

Ex. To study people living in rented accommodation in Delhi, the investigator may select as a sample, some selected tenants in Delhi, exercising adequate  care to make the sample representative of the population.

## Statistical Random Sampling

Statistical Random Sampling are used  where every element of the population has a chance of being included in the sample.

Simple Random Sampling

In Simple Random Sampling, each possible sample has an equal chance of being selected, and  each item in the entire population also has an equal chance of being selected.

Ex. To do a survey of customers,  a retailer may pick up the sale bills of his customers in random

Stratified Sampling

Stratified Sampling is generally used in heterogeneous population.

• Population is first subdivided into several parts (or small groups) called Stratum, according to some relevant characteristics, so that each stratum is more or less homogeneous.
• Each stratum is called a sub-population. A small sample is selected from each stratum at random.
• All the sub-samples combined together form the Stratified Sample, representing the population properly.

Ex. A retailer may create strata of his customers for study of his customers as, TV Buyers, Stereo Buyers, VCR buyers. For each stratum, random sampling would be done.

Systematic Sampling

In Systematic Sampling, each element has an equal chance of being chosen, but each sample does not have the same chance of being chosen.

In Systematic Sampling, the first element of the population is randomly chosen, but thereafter the elements are chosen according to a systematic plan.

Ex. To do a survey of customers,  a retailer may pick up the sale bills of his customers in random.

He would randomly select one bill (say bill number 8, or customer ledger number 4).

Thereafter he would select subsequent bills (or customer ledger) according to a systematic plan. Say, he would select every sixth bill (or customer) so that the selected bills (or customers) are bill numbers 8.14.20 or customer ledger 4,10,14 and so on.

Systematic Sampling requires less time and lower costs than Simple Random Sampling, but the chances of error is more in Systematic Sampling

Cluster Sampling

In Cluster Sampling, the population is divided into clusters or groups and then Random Sampling is done for each Cluster.

In case of Stratified Sampling, the elements of each stratum are homogeneous. In Cluster Sampling, the elements of each cluster are not homogeneous. Each cluster is representative of the population.

Ex. The retailer may divide the city of Delhi into several clusters as per Region (North, South etc) and classify customers according to clusters

Then he would consider every item within randomly selected clusters. There can be large variation in buying pattern within each cluster.

## Statistical Sampling Errors

Statistical Sampling Error occurs when an analyst select a sample that does not represent the entire data population.

Use of statistics can often lead to wrong conclusions or wrong estimates when entire population is not fully studied (sampling errors) and other reasons (non-sampling errors)

Sampling Error

Samples are used to determine conclusions regarding the population. Sample mean may not be equal to the population mean. Sampling Error arises due to difference between Sample mean and Population mean

Non-Sampling Errors

Non-Sampling errors are caused by deficiencies in the collection and analysis of data. Non Sampling errors may occur both in sample or in a census.

Causes of Non Sampling Errors

• Procedural Bias : Procedural bias is the distortion of the representativeness of the data due to the procedure adopted in collecting the data.

Ex. if the retailer excludes all customers making purchases under Rs.3,000, a Procedural Bias may creep in.

• Biased Observation : Observations may not accurately reflect the characteristics of the population being studied.

Ex. The retailer may exclude Important information like the quantity and type of equipment bought, etc. and only concentrate on the bill amount. A buyer of a number of low value items would be treated on the same footing as a buyer of a single high value item. This may be unjustified as the two purchasers are likely to have distinctly different needs.

• Non-Response Bias : Absence of response can lead to Non-Response Bias.

Ex. A retailer may ask customers for their suggestions for better products and services. Some customers may not be able to give an instant reply. Exclusion of their response may cause non response bias.

## Parameter and Statistic

Statistic and Parameter represent the characteristic of Sample and Population.

A parameter is a statistical measure related to the population and is based on population, whereas a statistic is a statistical measure which relates to the sample and is based on sample data.

Statistic & Parameter Notations : To differentiate the measures of parameter and statistic, Greek or Capital letters are used for the notations of parameter, whereas Roman letters are used in statistic

• Parameter Notations  : Greek or Capital letters are normally used for the notations of parameter (like Population size = N, Population mean = $\displaystyle \mu$, Population S.D = $\displaystyle \sigma$)
• Statistic Notations  : Roman letters are normally used for the notations of Statistics (like Sampling size = n, Sampling mean = $\displaystyle \overline{X}$,  Sample S.D = S)

Notations used to denote population parameters and sample statistics

Ex : The population of a town is 2,00,000. The statistical measures based on data of all these persons will be parameter. If a sample of 25,000 persons is taken and various statistical measures such as mean, standard deviation, correlation, etc. are computed, they will be statistic.

## Statistical Sampling Distribution

Sampling Distribution refers to probability of an event based on data from a small group within a large population.

Sampling theory is the study of relationships between a population and samples drawn from the population, applicable to random samples only. The population parameters (population mean, population standard deviation, population proportion, etc.) may be determined through sample statistics like sample mean, sample standard deviation, sample proportion, etc.

Sampling Fluctuation

As two or more samples drawn from a population are not the same, the value of a statistic varies from sample to sample, but the parameter always remains constant (since all the units in a population remain the same). A parameter has no fluctuation.

The variation in the value of a statistic is called Sampling Fluctuation.

## Statistical Standard Error

Statistical Standard Error is the measure of the variability arising from sampling error due to chance.

The standard error of a statistic (standard error of sample mean, or sample standard deviation or sample proportion of defectives, etc.) is the standard deviation of the sampling distribution of the statistic.

Standard error is used as a tool in tests of hypotheses or tests of significance. It gives an idea about the reliability and precision of a sample. It helps to find out confidence limits within which the parameters are expected to lie.

If a statistic q is used to estimate the parameter, then precision of $\displaystyle \theta$ = [1/ (Standard Error of $\displaystyle \theta$)]

Uses of Standard Error

Standard Error is used to test whether the difference between the sample statistic and the population parameter is significant or is due to sampling fluctuations. 2. Standard Error is used to find the precision of the sample estimate of a population parameter.  3. It is used to find the interval estimate of a population parameter.

## Statistical Hypothesis Testing

Statistical Hypothesis Testing is a method used to determine if there is enough evidence in a sample data to draw conclusions about a population.

One objective of sampling theory is Hypothesis Testing. Test of Hypothesis or Test of Significance is the procedure to decide whether a hypothesis is true or not.

Hypothesis testing involves making assumption about the population parameter, gathering of sample data and determining the sample statistic.

To test the validity of hypothesis, the difference between the hypothesized value and the actual value of the sample statistic is determined. The hypothesis is rejected if the difference is large,

To decide about the population on the basis of sample information, assumptions or guesses are made about the population parameters involved (called a statistical hypothesis, which may or may not be true).

• Null Hypothesis : Tests of Hypothesis is always started with an assumption of Null Hypothesis. The null hypothesis asserts that there is no significant difference between the statistic and the population parameter. The observed difference is merely due to chance (fluctuations in sampling from the same population).

Null hypothesis is usually denoted by the symbol H0. The “no difference” attitude on the part of a statistician before drawing any sample is the basis of null hypothesis.

• Alternative Hypothesis : An Alternative Hypothesis denoted by the symbol H1. The two hypotheses H0 and H1 are such that if one is true, the other is false and vice versa.

Ex. If we have to test whether the population mean m has a specified value m0, then (1) the Null Hypothesis is H0: $\displaystyle \mu ={{\mu }_{0}}$ and (2) the Alternative Hypothesis may be (i) H1 : $\displaystyle \mu \ne {{\mu }_{0}}$ ($\displaystyle \mu$ > $\displaystyle {{\mu }_{0}}$ or $\displaystyle \mu$ < $\displaystyle {{\mu }_{0}}$) or (ii) H1: $\displaystyle \mu$ > $\displaystyle {{\mu }_{0}}$, or (iii) H1 : $\displaystyle \mu$ < $\displaystyle {{\mu }_{0}}$.

The Alternative Hypothesis are called two-tailed (left and right sided), right-tailed  and left-tailed  tests respectively.

Level of Significance

The main object of hypothesis testing is to make judgment about the difference between the sample statistic and a hypothesized population parameter.

The next step of Null and Alternative Hypotheses, is to decide the criterion to be applied for acceptance or rejection to  the null hypothesis.

Ex. When we close 5% level of significance in a test procedure, there are about 5 cases in 100 that we would reject the hypothesis when it should be accepted. That is, we are about 95% confident that we have made the right decision.

Similarly, if we close 1% level of significance in testing a hypothesis, then there is only 1 case in 100 that we would reject the hypothesis when it should be accepted.

## Statistical Hypothesis Tests

Statistical Hypothesis Test is a type of statistical analysis with assumptions about a population parameter to the test.

• Two-tailed Test : Two-tailed test is a test of a statistical hypothesis, where the region of rejection is on both sides of the sampling distribution.

In such a case the critical region lies in both the right and left tails of the sampling distribution of the test statistic,

Ex. Suppose the null hypothesis states that the mean is equal to 10. The alternative hypothesis would be that the mean is less than 10 or greater than 10. The region of rejection would consist of a range of numbers located on both sides of sampling distribution; that is, the region of rejection would consist partly of numbers that were less than 10 and partly of numbers that were greater than 10.

• One-tailed Test : One-tailed test is a test of a statistical hypothesis , where the region of rejection is on only one side of the sampling distribution.

Alternative Hypothesis : Alternative Hypothesis H1 given by H1 : $\displaystyle \mu$ < $\displaystyle {{\mu }_{0}}$ (Left-tailed) or H1: $\displaystyle \mu$ > $\displaystyle {{\mu }_{0}}$ (Right-tailed)

• In the right tail, the critical region lies entirely in the right tail on the sampling distribution of sample statistic with area equal to the level of significance a.
• In the left-tailed, the critical region lies entirely in the left tail of the sampling distribution of q with area equal to the level of significance a.

Ex. Suppose the null hypothesis states that the mean is less than or equal to 10. The alternative hypothesis would be that the mean is greater than 10. The region of rejection would consist of a range of numbers located on the right side of sampling distribution; that is, a set of numbers greater than 10.

The type of the tests to be applied depends on the nature of the Alternative Hypothesis. We apply one-tailed or two-tailed test accordingly as Alternate Hypothesis is one-tailed or two-tailed.

Critical Values of Z

For large samples (n > 30), the sampling distributions of many statistics are approximately normal distribution. In such cases, we can use the results of the table given above to formulate decision rules.

## Z Score

Z-Score is a statistical measurement of a score’s relationship to the mean in a group of scores.

z-score(or,standard score) indicates how many standard deviations an element is from the mean.

z = $\displaystyle \frac{{\left( {X-\mu } \right)}}{\sigma }$, where,  z is the z-score, X is the value of the element, μ is the population mean, and σ is the standard deviation.

Interpretation of Z-Score

## Statistical Estimation

Statistical Estimation refers to value based on sampled data which has been adjusted using statistical estimation procedures.

Types of Estimate

• Point Estimate : If an estimate of a population parameter is given by a single value, then the estimate is called point estimate of the parameter.
• Interval Estimate : If an estimate of a population parameter is given by two distinct numbers between which the parameters may be considered to lie, then the estimate is called an interval estimate of the parameter.

As the value of a point estimate fluctuates from sample to sample, interval estimates are preferable to point estimates. Also, the interval estimates indicates the accuracy (or precision) of an estimate.

A statement of the error of an estimate is called its reliability.

Qualities of an efficient estimator

An estimator must be an unbiased estimator of the parameter. 2. Efficiency of an estimator refers to the size of the standard error of the estimator. 3. A statistic must be consistent. That is, as the sample size increases, the statistic must get closer to the parameter.4.The sufficiency of an estimator refers to the usage of the sample information by the statistic. Sample mean is more sufficient than the sample median.

Ex. 970 apples are taken at random from a basket and 97 are found to be bad. Estimate the percentage of bad apples in the basket and assign the limits in percentage.

Percentage of defective apples in the sample = $\displaystyle \frac{{97}}{{970}}$ = 0.1, So, p=0.1, q=1-0.1 = 0.9.

SE : $\displaystyle \sigma =\sqrt{{\left[ {\frac{{pq}}{n}} \right]}}$= $\displaystyle \sqrt{{\left[ {\frac{{.1\times 09}}{{970}}} \right]}}$ = .0096 = .96%

Expected Limits = .1 ± 3 x .0096 = .1± .0288 = .0712 to .1288 = 7.12% to 12.88%

## Statistical Points Estimate

Statistical Point Estimate refers to approximate value of some parameter of a population from random samples of the population.

A point estimate of a population parameter is a single value of a statistic. For example, the sample mean is a point estimate of the population mean. Similarly, the sample proportion p is a point estimate of the population proportion P.

Point estimators are rough estimates of the population parameters. Besides, if the sample is not representative of the population, the point estimators may be way off the mark.

Ex. To estimate the average salary drawn by individuals in Delhi, we would have to study a representative sample of such individuals. If our sample included a disproportionately large number of high salary earners, then our point estimate would also be on the high side. The accuracy of the point estimate will usually improve as the sample gets larger and larger.

## Statistical Interval Estimates

Statistical Interval estimate is defined by range of two numbers, between which a population parameter is said to lie.

For example, a < x < b is an interval estimate of the population mean μ. It indicates that the population mean is greater than a but less than b.

Point estimators are only a rough guide. Interval estimate describes a range of values within which a population parameter is likely to lie.

## Statistical Confidence Intervals

Statistical Confidence Interval is used to express the precision and uncertainty associated with a particular sampling method.

A confidence interval consists of three parts : 1. Confidence level, 2.  Statistic, 3. Margin of error.

Confidence level describes the uncertainty of a sampling method. The statistic and the margin of error define an interval estimate that describes the precision of the method. The interval estimate of a confidence interval is defined by the sample statistic + margin of error.

Ex. To compute an interval estimate of a population parameter, we may describe this interval estimate as a 95% confidence interval. This means that if we used the same sampling method to select different samples and compute different interval estimates, the true population parameter would fall within a range defined by the sample statistic + margin of error 95% of the time.

Confidence intervals are preferred to point estimates, because confidence intervals indicate the precision of the estimate and also the uncertainty of the estimate.

Ex. A random sample of size 11 is selected from a symmetrical population with a unique mode. The sample mean and standard deviation are 200 and 30 respectively. Find the 90% confidence interval in which the population mean m will lie.

Here $\displaystyle \overline{X}$ = 200, s=30, n=11, Degrees of freedom = n – 1 = 11-1 = 10

From the table we see that for 10 degrees of freedom, the area in both tails combined is 0.10 or 10%, when t = 1.812.

Hence, area under the curve between $\displaystyle \overline{X}$ – t$\displaystyle \left( {\frac{s}{{\sqrt{n}}}} \right)$ and $\displaystyle \overline{X}$ + t$\displaystyle \left( {\frac{s}{{\sqrt{n}}}} \right)$ is 90% when t = 1.812.

So, the 90% confidence interval is 200 – [1.812 x 30 / $\displaystyle \sqrt{{11}}$ ) to 200 + [1.812 x 30 / $\displaystyle \sqrt{{11}}$)

Hence, we are 90% confident that the population mean lies in the interval 183.61 to 216.39.

Note: However, it does not mean “P(183.61< m < 216.39) = 0.9”.

If enough samples are taken and for each sample, the confidence interval is computed, then 90% of these intervals would contain m .

## Statistical Confidence Level

Statistical Confidence Level indicates the probability part of a Confidence Interval.

The confidence level describes the likelihood that a particular sampling method will produce a confidence interval that includes the true population parameter.

Ex. Confidence Intervals are computed for each sample from all possible samples collected  from a given population. Some confidence intervals would include the true population parameter; others would not. A 95% confidence level means that 95% of the intervals contain the true population parameter; a 90% confidence level means that 90% of the intervals contain the population parameter; and so on.

## Margin of Statistical Error

Margin of Statistical Error refers to the range of values in a confidence interval, above and below the sample statistic.

Ex. The local newspaper conducts an election survey and reports that the independent candidate will receive 30% of the vote. The newspaper states that the survey had a 5% margin of error and a confidence level of 95%.

This indicates that the newspaper is 95% confident that the independent candidate will receive between 25% and 35% of the vote.

Many public opinion surveys report interval estimates, but not confidence intervals. They provide the margin of error, but not the confidence level. To clearly interpret survey results, both these values should be known.

## Statistical Test of Significance

Statistical Tests of significance are used to estimate the probability that a relationship observed in the data occurred only by chance, the probability that the variables are really unrelated in the population. They are used to filter out unpromising hypotheses.

Tests of significance is a theoretical sampling distribution commonly known as student’s t-distribution. ‘t’ in t-distribution refers to the ratio of difference between sample mean and population mean to standard error of sample mean,

t= {$\displaystyle {\left( {\left| {\overline{X}-\mu } \right|} \right)}$ /S} x ($\displaystyle \sqrt{n}$), where $\displaystyle \overline{X}$  = Mean of sample, m = Actual or Hypothetical Mean of the Population, S = best possible estimate of standard deviation of population, n= size of sample.

S is computed as S= $\displaystyle \sqrt{{}}$[{$\displaystyle \sum{{}}$$\displaystyle {{\left( {\overline{X}-X} \right)}^{2}}$} / (n-1)], or $\displaystyle \sqrt{{\left[ {\left( {\sum{{{{d}^{2}}}}} \right)/\left( {n-1} \right)} \right]}}$

Tests of significance in small samples are based on this assumption that the parent population, from which sample has been drawn, posses the feature of normality.

If the parent population differs from normality, even then tests of significance can properly be applied in small samples. If the parent population is markedly skew (U or J Shaped), these tests cannot be applied with much confidence.

There are 3 types of Test of Significance : 1. Student’s t-distribution  : t-test (proposed by William Sealy Gosset), 2. Fisher’s Z-distribution : Z- test, 3. F – distribution: F- test

tDistribution

t-distribution is normally used to test the significance of various results obtained from small samples

Properties of tDistribution:

1. The variable t-distribution ranges from minus infinity (¥) to plus infinity (¥) just as a normal distribution does. 2. The t-Distribution has greater variability than normal distribution. As n gets larger, the t-Distribution approaches the normal distribution. 3. Like standard normal distribution, t-Distribution is also symmetrical and mono-peaked. 4. The shape of t-curve differs at various levels of significance, which can be observed from t-table at 5% or 1% level of significance.

## Statistical Null Hypothesis

Statistical Null Hypothesis usually refers to a general statement or default position, that there is no relationship between two measured phenomena, or no difference among groups.

Null hypothesis is first formulated such that the population mean (m) is equal to the given value of mean (say m0), i.e., H0 : $\displaystyle \mu ={{\mu }_{0}}$

This hypothesis implies that (a) There is no significant difference between the sample mean and the population mean or (b) the random sample has been drawn from the normal population with mean m0.

1. If deviation are taken from assumed mean, the formula will be

S= $\displaystyle \sqrt{{}}$$\displaystyle \left[ {\left{ {\left( {\sum{{{{d}^{2}}x}}} \right)-{{{\left( {\sum{{dx/n}}} \right)}}^{2}}} \right}-\left( {n-1} \right)} \right]$ $\displaystyle \times$ n, where dx = deviations from assumed mean

2. If standard deviation of the sample is given, the formula of t would be modified as
t= $\displaystyle \left[ {\left{ {\left( {\left| {\overline{X}-\mu } \right|} \right)/\sigma } \right}\times \left{ {\sqrt{{\left( {n-1} \right)}}} \right}} \right]$,

Table value of t or Critical value: This value is observed in t-table at a certain level of significance and for degree of freedom (n – 1) on the basis of questions.

• If calculated value of t is less than its tabulated or critical value, the null hypothesis is accepted, it is that there is no significant difference between sample mean and population mean.
• If calculated value of t is greater than its table value, the difference is considered significant, and the hypothesis is rejected

## Statistical Null Hypothesis – Problems

Statistical Null Hypothesis – Problems

Ex. A dice was thrown 9,000 times and of these 3,220 yielded a 5 or 6. Is this consistent with the hypothesis that dice was unbiased?

Let us consider  the hypothesis that the dice is unbiased. On the basis of hypothesis, probability of getting 5 or 6 = 2/6 or 1/3. So, p=1/3, q=1-1/3 = 2/3, n=9000, Standard Error Ö (npq) = Ö (9000×1/3x 2/3) = Ö2000= 44.72

Test Statistic : Z = Difference / S.E= (3220-3000) / 44.72 =4.92

As the value of Z is more than three times. Hence, null hypothesis is rejected and it can be concluded that the dice is not unbiased

Ex. In a sample of 100, 54 males and 46 females are found. Ascertain if the observed proportions are in consistent with the hypothesis that the male and female are in equal proportion.

Let us consider the hypothesis that males and females are in equal proportion

Here, p=1/2 (i.e 0.5), q=1-(1/2)=1/2 (i.e 0.5), n=100, SE =Ö[(pq) / n]= Ö [(.5 x .5) /100 =.5/10=.05

Actual proportion = 54/100 =.54. Test Statistic = (.54 – .50) / .05 = .8

Since the difference of proportions is less than 3 S.E., the null hypothesis is accepted, ascrtaining that  males and females are in equal proportion

## Statistical Test of Significance – Problems

Statistical Test of Significance – Problems

The heights of ten children selected at random from a given colony had a mean 63.5 cms. and variance 6.25 cms. Test, at 5% level of significance, the hypothesis that the children of the given colony are on the average less than 65cms. in all (The value of t for 8 d.f. at 5% level of significance is 2.262)

Here, n = 10, X = 63.5, Variance = 6.25 (s = 2.5), m = 65

Null Hypothesis: The average height of the children is 65 cms. i.e., H0 : m = 65,

Alternative Hypothesis H1 : m < 65

t= [{(|X – m|) /s} ´ {Ö(n-1)}] = {(|63.5-65|) / 2.5 ´ Ö(10-1) = (1.5 x 3) / 2.5 = 1.8

Critical Value = 2.262 (as given)

Since the calculated value of t (1.8) is less than its critical value (2.262), the hypothesis is correct that the average height of the children is 65 cms. On this basis the alternative hypothesis (H1), that average height is less than 65 cms, will be rejected.

Six boys are selected at random from a college and their marks in English found to be 63, 63, 64, 66, 60 and 68 out of 100. In the light of these marks discuss the general observation that the mean marks in English in the college were 67.

Calculation of Sample Mean and S.D.

Critical Value: n= 6, so d.f. = 6 – 1 = 5. The critical value of t at 5% level of significance and for 5 d.f. is 2.571.

Since calculated value of t = (2.665) is more its critical value (2.571), the null hypothesis is rejected. It means that the mean marks in English in the college were 67.

Fixing Limits of Population Mean: On the basis of mean of a random sample taken from a normally distributed population, find the limits of the population mean at 95% or 99% confidence level

Limits of Confidence level at 95% = X + [(S / Ön + t.05]

Limits of Confidence level at 99% = `X + [(S / Ön + t.01]