Statistical Correlation and Regression

Correlation & Regression

In this part, we discuss about Correlation & Regression, depicting relationship between variables and the statistical process of estimating such relationship, relating to :

  • Correlation
  • Measures of Correlation
  • Correlation Co-efficient
  • Rank Correlation
  • Regression Analysis

Statistical Correlation

Statistical Correlation refers to relationship between two or more related variables in statistical experiment. Two variables are said to be correlated, if a change in the value of one variable brings a change in the value of another variable (e.g. Height & Weight, Price & Demand, Income & Savings).

Types of Correlation

  • Positive or Negative Correlation: As per direction of movement, Correlation may be classified as Positive or Negative
  • Positive Correlation : When the values of two variables move in the same direction, Correlation is said to be positive.
  • Negative Correlation :  If the values of variables move in opposite directions, so that with an increase (or decrease) in the values of one variable, the value of the other variable decreases (or increase), Correlation is said to be negative.
  • Simple, partial or multiple Correlation: As per number of variable factors, Correlation may be classified as Simple, Partial or Multiple Correlation.
  • Simple Correlation : In simple Correlation, there are only two variables.
  • Multiple correlation : In multiple correlation, the relationship between three or more factors are studied.
  • Partial Correlation : In partial Correlation, more than two factors are involved but Correlation is studied only between two factors and other factors are assumed to be constant.
  • Linear or Nonlinear Correlation: According to the nature of change in ratio of variables,Correlation may be classified as Linear or Non Linear
  • Linear Correlation:  If the ratio of change between the variables is uniform, Correlation between variable will be linear.
  • Non-Linear Correlation :The Correlation is said to be non-linear or curvilinear if corresponding to a unit change in the value of one variable, the other variable change at a fluctuating rate.

Measures of Statistical Correlation

Statistical Co-relation may be measured (i.e the degree of relationship between the variables) by various methods. such as: 1. Scatter Diagram Method, 2. Karl Pearson’s method of Co-efficient of Correlation, 3. Spearman’s Rank Co-efficient of Correlation, 4. Co-efficient of concurrent deviation

Degree of Correlation

The following chart shows the relative degree of correlation

Degree of correlationPositiveNegative
Perfect Correlation1-1
Very high degree of Correlation0.09 or more-0.09 or more
Sufficient high degree of Correlation0.75 to 0.9-0.75 to -0.9
Moderate degree of Correlation0.6 to 0.75-0.6 to -0.75
Only the possibility of Correlation0.3 to 0.6-0.3 to -0.6
Possibility of no CorrelationLess than 0.3Less than -0.3
No Correlation00

Scatter Diagram Method

Scatter Diagram is a tool for analyzing relationships between two variables. One variable is plotted on the horizontal axis and the other is plotted on the vertical axis. The pattern of their intersecting points can graphically show relationship patterns

Positive correlation

– For each pair of x and y values, a dot (or point) is put. Thus dots equal to the number of plotted observations .

– If these plotted dots (or points) show some trend, either upward or downward, then the two variables (x and y) are said to be correlated (otherwise not correlated).

– The relationship is expressed through ‘value of r’. The value of  ‘r’ must be in between ± 1.

Measure of Relationship

-If the trend of the points is upward moving from lower left hand corner to upper right hand corner, then correlation is positive (r = + 1). (r is coefficient of correlation (Fig a)

-If movement is reverse i.e. dots move from upper left hand corner to lower right hand corner, then correlation is negative (r= – 1) (Fig b)

-If no trend is observed, it indicates  the absence of correlation (r = 0), (Fig c)

Karl Pearson’s method of Co-efficient of Correlation

Karl Pearson Method is used to compute co-efficient of correlation, as well as extent of the correlation is expressed in several formulae of algebraic nature & in numerical terms

Pearson’s co-efficient of correlation is represented by ‘r’, which lie between ±1

Assumptions

Pearson’s Coefficient of Correlation is based upon some assumptions. 1. A large variety of independent causes are operating in the series, to give a normal distribution. 2. The forces so operating are related in a casual way. 3. The relationship between the two series is linear. ‘r’ can be computed in various ways, depending upon the choice of the user.

The Table shows the values of Co-efficient of Correlation revealing the respective Degree of Correlation

Co-efficient of Correlation
ResultsDegree of Correlation
± 1Perfect correlation
± .90 or moreVery high degree of correlation
 \displaystyle \ge ± .75 and < ± .90Fairly high degree of correlation
\displaystyle \ge ± .50 and < ± .75Moderate degree of correlation
\displaystyle \ge ± .25 and < ± .50Low degree of correlation
 \displaystyle \ge ± .25Very low degree of correlation
0No correlation

Co-efficient of Correlation Computation : Direct method

Coefficient of correlation method is used where the size of the given values of the variables are small or all the values of the variables can be reduced to small size by change of their scale or origin.

Formula of coefficient of correlation : Direct Method

\displaystyle r=\frac{{N\sum{{xy}}-\sum{x}\sum{y}}}{{\sqrt{{N\sum{{{{x}^{2}}}}-{{{\left( {\sum{x}} \right)}}^{2}}.}}\sqrt{{N\sum{{{{y}^{2}}}}-{{{\left( {\sum{y}} \right)}}^{2}}}}}}

Where, \displaystyle \sum{{}}x= Sum of variable x. \displaystyle \sum{{}}y = Sum of variable y. \displaystyle \sum{{}}xy = Sum of products of variable x and y. \displaystyle \sum{{}}x= Sum of squares of values of variable x. \displaystyle \sum{{}}y2 = Sum of squares of values of variable y. N is the number of observation

Ex: Compute co-efficient of correlation for following data, using Pearson’s direct method.

Marks in English :    12345
Marks in Statistics:678910

The following table shows the computation of values and theirs sums

Marks in English XX2Marks in Statistics YY2XY  
116366
2474914
3986424
41698136
5251010050
\displaystyle \sum{{}}X =15\displaystyle \sum{{}}X2 = 55\displaystyle \sum{{}}Y = 40\displaystyle \sum{{}}Y2 = 330\displaystyle \sum{{}}XY = 130

\displaystyle r=\frac{{N\sum{{xy}}-\sum{x}\sum{y}}}{{\sqrt{{N\sum{{{{x}^{2}}}}-{{{\left( {\sum{x}} \right)}}^{2}}.}}\sqrt{{N\sum{{{{y}^{2}}}}-{{{\left( {\sum{y}} \right)}}^{2}}}}}}= \displaystyle \frac{{5(130)-15\times 40}}{{\sqrt{{5\times 55-{{{(15)}}^{2}}}}.\sqrt{{5\times 330-{{{(40)}}^{2}}}}}}= \displaystyle \frac{{650-600}}{{\sqrt{{275-225}}.\sqrt{{1650-1600}}}}=\frac{{50}}{{\sqrt{{50\times 50}}}}=\frac{{50}}{{50}}=1

Result of co-efficient of correlation being + 1, it shows that correlation between the two variables is perfectly positive.

Co-efficient of Correlation Computation : Assumed Mean

Coefficient of correlation is preferred when it is not possible to get the arithmetic averages of both the variables in whole or round numbers. Under this method, the deviations of values of each of the variables are taken from an assumed average.

Formula of coefficient of correlation : Using Assumed Mean

\displaystyle r=\frac{{N\sum{{dxdy}}-\left[ {\left( {\sum{{dx}}} \right)\times \left( {\sum{{dy}}} \right)} \right]}}{{\sqrt{{N\sum{{d{{x}^{2}}}}-\mathop{{\left( {\sum{{dx}}} \right)}}^{2}.}}\sqrt{{N\sum{{d{{y}^{2}}}}-\mathop{{\left( {\sum{{dy}}} \right)}}^{2}}}}}=

Where, dx = Deviation of x from assumed mean (i.e x – x series assumed mean). dy = Deviation of y from assumed mean (i.e, y – y series assumed mean). \displaystyle \sum{{}}dx = Sum of deviation of x series from its assumed mean.  \displaystyle \sum{{}}dy = Sum of deviation of y series from its assumed mean,  \displaystyle \sum{{}}dx= Sum of squares of deviation of x series from assumed mean. \displaystyle \sum{{}}dy2 = Sum of squares of deviation of y series from assumed mean. \displaystyle \sum{{}}dxdy = Sum of the product of deviation of x and y series from assumed mean.

Ex. Compute Karl Pearson’s co-efficient of correlation taking 79 and 132 as the average for Rainfall and Rice production variable respectively , by short cut method.

Rainfall:    6168795969968978
Rice Prodn:108123136107112156137125

The following table shows the computation of values and theirs sums

  RainfallDev. from ass. av. (79)
dx
  dx2Rice production    Dev. from ass.
av. (132)
dy
  dy2  dx dy
61
68
79
59
69
96
89
78
-18
-11
0
-20
-10
17
10
-1
324
121
0
400
100
289 100
1
108
123
136
107
112
156
137
125
-24
-9
4
-25
-20
24
-7
-7
576
81
16
625
400
576
25 49
432
99
0
500
200
408
50
7
Total    – 331335    N = 8   – 5223481696

\displaystyle r=\frac{{N\sum{{dxdy}}-\left[ {\left( {\sum{{dx}}} \right)\times \left( {\sum{{dy}}} \right)} \right]}}{{\sqrt{{N\sum{{d{{x}^{2}}}}-{{{\left( {\sum{{dx}}} \right)}}^{2}}}}.\sqrt{{N\sum{{d{{y}^{2}}}}-{{{\left( {\sum{{dy}}} \right)}}^{2}}}}}}=\displaystyle \frac{{8\times 1696-\left( {-33\times -52} \right)}}{{\sqrt{{\left( {8\times 1335} \right)-{{{\left( {-33} \right)}}^{2}}}}.\sqrt{{\left( {8\times 2348} \right)-{{{\left( {-52} \right)}}^{2}}}}}}\displaystyle =\frac{{13568-1716}}{{\sqrt{{\left( {10680-1089} \right)\times \left( {18789-2704} \right)}}}}=\frac{{11852}}{{\sqrt{{9591\times 16080}}}}\displaystyle =\frac{{\left( {11852} \right)}}{{\left( {12418.66} \right)}}=095

As value of r is between 0 and 1, there is a high degree of positive correlation between the two variables.            

Spearman’s Rank Correlation

Spearman’s Rank Correlation is a nonparametric measure of statistical dependence between two variables. It enables to identify whether two variables relate in a monotonic function (i.e., that when one number increases, so does the other, or vice versa).

Co-efficient of Rank correlation computation process:

  • Assign ranks to various items of the two series (if it is not given)
  • Find differences of the ranks (d)
  • Square these differences (d2)

Formula of Spearman’s Rank Correlation

\displaystyle r=1-\frac{{6\left( {\sum{{{{d}^{2}}}}} \right)}}{{{{n}^{3}}-n}}

, where n = number of pairs of observations.

The value of this co-efficient ranges between + 1 and – 1. -If r = +1, there is complete agreement in the order of ranks and the ranks are in the same direction. -If r = – 1, there is complete agreement in the order of ranks and they are in opposite directions.

If the difference of ranks in each pair is zero, them from the above formula, we get r = 1

Ex. Ten Students in a voice contest are ranks by three judges in the following order:

1st Judge:16510324978
2nd Judge:35847102169
3rd Judge:64981231057

Use the method of rank-correlation to judge which pair of judges have the nearest approach to common liking in voice.

Ranks given byDifferences (d)Squares of differences (d2)
1st Judge2nd Judge3rd Judge(i)(ii)(iii)(i)(ii)(iii)
136-2-3-54925
654112114
589-3-1-49116
10486-4236164
371-46216364
2102-88064640
4232-11411
91108-9-164811
765112114
897-121141
  Total \displaystyle \left( {\sum{{}}} \right)   20021460

The rank co relations are computed as follows :

r12 = \displaystyle 1-\frac{{6\left( {\sum{{{{d}^{2}}}}} \right)}}{{{{n}^{3}}-n}} 1st & 2nd Judge\displaystyle 1-\frac{{6\left( {\sum{{{{d}^{2}}}}} \right)}}{{{{n}^{3}}-n}}=1-\frac{{6.200}}{{{{{10}}^{3}}-10}}=1-\frac{{1200}}{{1000-10}}\displaystyle =1-\frac{{1200}}{{990}}=1-1.213=-0.213
r23 = \displaystyle 1-\frac{{6\left( {\sum{{{{d}^{2}}}}} \right)}}{{{{n}^{3}}-n}} 2nd & 3rd Judge\displaystyle 1-\frac{{6.214}}{{{{{10}}^{3}}-10}}=1-\frac{{1284}}{{990}}= 1 – 1.297 = -0.297.
r13 = \displaystyle 1-\frac{{6\left( {\sum{{{{d}^{2}}}}} \right)}}{{{{n}^{3}}-n}}1st & 3rd Judge\displaystyle =1-\frac{{6.60}}{{1000-10}}=1-\frac{{360}}{{990}}1– .364 = + 0.636.

Rank Correlation – Problems

Rank Correlation computation where Actual Ranks are not given.

Compute rank co-efficient of correlation for marks obtained by 8 students in Mathematics and History papers.

Marks in Mathematics:1520281240602080
Marks in History:4030503020103060

The Table shows the computation details

Marks in Mathematics (X)rank    Marks in History (Y)rank  Difference d  d2  
152406-416
203.5304-.5.25
285507-24
121304-39
406202416
607101636
203.5304-.5.25
80860800
Total:81.50

The value of co relation zero indicates there is no correlation.

For equal ranks, some adjustment in the above formula is required.

Add \displaystyle \frac{1}{2} (m3 – m) with \displaystyle \sum{{}}d2 where m=number of items whose ranks are common.

Here,

The item 20 is repeated 2 times in X-series, i.e., m = 2 in X-series and 30 is repeated 3 times in Y series. So m = 3 in Y-series.

=1 – [6x{(81.5 + .5+2)} / 504]=1-[(6×84)/504)] =1- (\displaystyle \frac{{504}}{{504}}) = 1-1=0

Rank Correlation – Problems

Compute rank correlation coefficient between the following two series X and Y

X:6864706054677663
Y:8771637884585040

Computation Table

Rank of XRank of Yd = X Yd2
3124
5411
25-39
73416
82636
46-24
17-636
6824
n = 8  \displaystyle \sum{{}}d2 = 110

Rank correlation coefficient

\displaystyle r=1-\frac{{6\sum{{{{d}^{2}}}}}}{{n\left( {{{n}^{2}}-1} \right)}}=1-\frac{{6\times 110}}{{8\times 63}}

The correlation between X and Y is negative.

Concurrent Deviations

Concurrent Deviation is a very simple and causal method of finding correlation, when the magnitude of the two variables is not relevant.

Concurrent deviations method involves in attaching a positive sign for a x-value (except the first), if this value is more than the previous value, and assigning a negative value if this value is less than the previous value. This is done for the y-series as well. The deviation in the x-value and the corresponding y-value is known to be concurrent if both the deviations have the same sign.

Denoting the number of concurrent deviation by c and total number of deviations as m (which must be one less than the number of pairs of x and y values), the coefficient of concurrent deviation is given by

\displaystyle {{r}_{c}}=\pm \sqrt{{\pm \frac{{\left( {2c-m} \right)}}{m}}}

If (2c-m) >0, then we take the positive sign both inside and outside the radical sign.

If (2c-m) <0, we consider the negative sign both inside and outside the radical sign.

Like Pearson’s correlation coefficient and Spearman’s rank correlation coefficient, the coefficient of concurrent deviations also lies between – 1 and 1, both inclusive.

Ex. Find the coefficient of concurrent deviations from the following data.

Year:20002001200220032004200520062007
Price2427292234373841
Demand:3433293428272522

Computation of Coefficient of Concurrent Deviations

YearPriceSign of dev from prev fig (a)DemandSign of dev from prev fig (b)Product of deviation (ab)
200024 34  
200127+33
200229+34++
20032229+
200434+8
200537+27
200638+25
200741+22

Here, m = number of pairs of deviations = 7, c = No. of positive signs in the product of deviation column = No. of concurrent deviation = 2

\displaystyle {{r}_{c}}=\pm \sqrt{{\pm \frac{{\left( {2c-m} \right)}}{m}}}\displaystyle =\pm \sqrt{{\pm \frac{{\left( {4-7} \right)}}{m}}}\displaystyle \pm \sqrt{{\pm \frac{{\left( {-3} \right)}}{7}}}\displaystyle =-\sqrt{{\frac{3}{7}}}

=-.65 [Since \displaystyle \frac{{\left( {2c-m} \right)}}{m} =-\displaystyle \frac{3}{7}|,

We take negative sign both inside and outside of the radical sign]

Thus there is a negative correlation between price and demand.

Regression Analysis

Regression Analysis is a statistical process for estimating the relationships among variables.

Regression Analysis Types

  • Simple and Multiple  : To describe relationship between variables
  • Simple : To find relationship between 2 variables only
  • Multiple : To find relationship between multiple variables
  • Linear and Non- linear : To find relationship by plotting the values on graph.
  • Linear : A straight line depicts a linear relationship
  • Non Linear : A curved line depicts non linear relationship
  • Total and Partial : To study the effect of multiple variables on one another
  • Total  : To study the effect of all the Important variables on one another
  • Partial : To study the effect of one or two Important relevant variable making others as constant.

Regression Analysis Methods

  • Graphical Method : The observation data (x,y values) are plotted as point and then joined to get the regression analysis line.
  • Algebraic Method : Linear equations are developed from the observation data
  • Normal Equation Method : The line of the best fit (e.g Y on X) is obtained from simple linear algebraic equation
  • Deviation from Actual Means : Two regression equations are developed in a modified form from the deviation figures

Simple and Multiple Regression Analysis

Simple Regression Analysis

A simple regression analysis is one which is confined to only two variables (e.g, Price and Demand). The value of one variable is estimated on the basis of the value of the other variable.

The variable whose values are estimated is called dependent, regressed or explained variable and the variable used as the basis of finding the value of the other variable is called the independent, regressing or explanatory variable.

The functional relationship between two variables X & Y can be expressed as

Y= f(X).

Ex: If the expenditure on sales promotion can have some effect on the volume of sales, then sales promotion will be the independent variable and sales will be the dependent variable. Here Sales is denoted by Y and Sales Promotion is denoted by X

Multiple Regression Analysis

The relationship is made among more than two related variables at a time say, X,Y, Z (like Sales, Price and income of the people).

In such analysis, the value of one variable is estimated on the basis of the other remaining variables. One variable is made dependent and the other variables independent.

The functional relationship is expressed as

Y = f(X,Z)      or    X = f(Y,Z)   or Z = f(X,Y)

Linear and Non- linear Regression Analysis

Regression Analysis may also be classified as Linear and Non- linear Regression Analysis.

Linear Regression Analysis

A linear regression analysis is one, which gives rise to a straight line when the data relating to the two variables are plotted on a graph paper.

In simplest term, The linear relationship is mathematically represented by the equation of a straight line

Y = a + bX

A model is linear when each term is either a constant or the product of a parameter and a predictor variable. A linear equation is constructed by adding the results for each term,

Expressed by basic form:

Response = constant + parameter * predictor + … + parameter * predictor

Y = bo + b1X1 + b2X2 + … + bkXk

If two variables have linear relationship with each other, a change in the value of the independent variable by one unit causes a constant change in the values of the dependent variable.

Linear regression analysis enables to study the average change in the value of the dependent variable for any given value of the independent variable.

The linear relationship is preferred due its simplicity and better prediction.

Non-linear Regression Analysis  

While a linear equation has one basic form, nonlinear equations can take many different forms. If the equation doesn’t meet the criteria for a linear equation, it’s nonlinear. Unlike linear regression, these functions can have more than one parameter per predictor variable.

A non-linear regression analysis, graphically depicts a curved line when the data relating to  variables are plotted on a graph paper. The regression will be a function involving the terms of higher order like, Y =X2, Y = X3 etc.

Total and Partial Regression Analysis

Regression Analysis may also be classified as Total and Partial Regression Analysis.

Total Regression Analysis

 A total regression analysis is made to study the effect of all Important variables on one another.

Ex. Effect of sales promotion expenditure, individual income, and price of the goods on the volume of sales are measured, it is a case of total regression analysis.

Regression equation takes the following forms like that of a multiple regression analysis:

S = f(A, I,P), X = f(Y,Z,P) etc.

Total regression analysis is usually made in the field of business and economics where values of a variable are effected by multiplicity of causes.

Partial Regression Analysis

In case of multiplicity of variables, effect of all Important variables on one another is considered in Total Regression Analysis, while Partial Regression analysis is made to study the effect of one or two relevant variable on another variable (keeping other variables constant).

The equation of such regression takes the following from

Y = f(X but not of Z and P);

S = f(sales promotion but not of price and individual income).

Graphical Method of Regression Analysis

Regression Analysis may be graphically represented through a scatter diagram, drawn by plotting every observation by a dot. The dependent variables are shown on y-axis and independent variables on x-axis.

The dots are connected to draw regression lines, depicting the best mean value of one variables corresponding to the mean values of the other.

The line of best fit in the  scatter diagram is used to summarise the data.

Ex. Using the scatter diagram method draw the two regression lines associated with the following data both separately and jointly:

X:801001208040100140100110
Y:60601007060801008070

Algebraic Method of Regression Analysis

Regression Analysis may be algebraically represented through Normal Equation Method.

Normal Equation Method

The line of the best fit for Y on X (i.e. the regression line Y on X)is obtained by finding the values of Y for any two (preferably the extreme ones) values of X through the linear equation Y = a + bx,

Where, a and b are the two constants, whose values are to be found out by solving two normal equations \displaystyle {\sum{{}}}Y = Na + b\displaystyle {\sum{{}}}X  & \displaystyle {\sum{{}}}XY = a\displaystyle {\sum{{}}}X + b\displaystyle {\sum{{}}}X2, where, X and Y represent the given values of the X and Y variables respectively.

Line of the best fit for X on Y (i.e. the regression line of X on Y) through the linear equation X = a + bY

where, the values of the two constants a and b are determined by solving the two normal equations \displaystyle {\sum{{}}}X = Na + b\displaystyle {\sum{{}}}Y  & \displaystyle {\sum{{}}}XY = a\displaystyle {\sum{{}}}Y + b\displaystyle {\sum{{}}}Y2                                          

Ex. Compute rank correlation coefficient between the two series X and Y

X162126232824172221
Y333850395247354341

Computation Table

xyx2y2xy
16332561089528
21384411444798
265067625001300
23395291521897
285278427041456
244757622091128
17352891225595
22434841889946
21414411681861
\displaystyle {\sum{{}}}=1983784476162228509

Regression equation of x on y : (x = a + by)

\displaystyle {\sum{{}}}x = Na + b\displaystyle {\sum{{}}}y     … (i)

\displaystyle {\sum{{}}}xy = a\displaystyle {\sum{{}}}y + b\displaystyle {\sum{{}}}y2  … (ii)

Putting the values in (i), we get

198 = 9a + 378b  … (iii)

Putting the values in (ii), we get,

8509 = 378a + 16222b  … (iv)

So,  74844 = 3402a + 142884b … (v) [multiplying (iii) by 378]

and, 76581 = 3402a + 145998b … (vi) [multiplying (iv) by 9]

So, – 1737 = – 3114b (vii) … [(vi) – (v)], or b=\displaystyle \frac{{1737}}{{3114}} = .56

Putting the value of b in (i), we get 198 = 9a + 378 x(.56), or 198 = 9a + 211.68

Or  9a = -13.68, or a= \displaystyle -\frac{{13.68}}{9}= -1.52

Regression equation  of x on y : (x = a + by)

or x = -1.52 + .56y, or x = .56y -1.52

Regression equation of y on x: (y = a + bx)

\displaystyle {\sum{{}}}y = Na + b\displaystyle {\sum{{}}}x   … (i)

\displaystyle {\sum{{}}}xy = a\displaystyle {\sum{{}}}x + b\displaystyle {\sum{{}}}x2 … (ii)

Putting the values in (i), we get,

378 = 9a + 198b  … (iii)

Putting the values in (ii), we get,

8509 = 198a + 4476b … (iv)

So,  74844 = 1782a + 39204b  .. (v) [(iii) x 198]

and 76581 = 1782a + 40284b … (vi) [ (iv) x 9]

So, 1737 = 1080b [(vi) – (v)], or b= \displaystyle \frac{{1737}}{{1080}} =1.61

Putting Value of b in (iii), we get 378 = 9a + (198 x 1.61), or 9a= 378 – 318.78, or 9a=58.22, or a=6.58

Regression equation of y on x: y = a + bx, or y = 1.61x + 6.58

Deviation from Actual Means

Deviation from Actual Means is computed using two regression equations (X on Y and Y on X), developed in a modified form from the deviation figures of the two variables from their respective actual Means, rather than their actual values.

Regression equation of X on Y : X = \displaystyle \overline{X} + bxy ( Y –\displaystyle \overline{Y} ) or X – \displaystyle \overline{X} = bxy ( Y – \displaystyle \overline{Y} )

Regression equation of Y on X : Y = \displaystyle \overline{Y}+ byx ( X – \displaystyle \overline{X} ) or Y – \displaystyle \overline{Y}= byx ( X – \displaystyle \overline{X} )

where, X= given value of variable,  Y= given value of variable, \displaystyle \overline{X} = arithmetic average of variable X,  \displaystyle \overline{Y} = arithmetic average of variable Y, r is correlation co-efficient

bxy = regression coefficient of X on Y = r\displaystyle \sigma x/\displaystyle \sigma y

Ex. : Regression Analysis – Deviation from Actual Means

Using the method of deviations from the actual Means, find: 1. the two regressions equations,

2.  The correlation coefficient, 3. The most probable value of Y when X = 30

X:25283532313629383432
Y:43464941363231303339

Computation Table

XY(X – 32) x(Y – 38) yX2Y2xy
2543-754925-35
2846-481664-32
3549311912133
324103090
3136-1-2142
36324-61636-24
2931-3-794921
38306-83664-48
34332-5425-10
323901010
\displaystyle {\sum{{}}}X = 320\displaystyle {\sum{{}}}Y = 380\displaystyle {\sum{{}}}x = 0\displaystyle {\sum{{}}}y = 0\displaystyle {\sum{{}}}x2 = 140\displaystyle {\sum{{}}}y2 = 398\displaystyle {\sum{{}}}xy = -93

Regression equation of X on Y

\displaystyle X=\overline{X}+r\displaystyle \sigma x/\displaystyle \sigma y\displaystyle \left( {Y-\overline{Y}} \right)

Putting the values, we get the value of \displaystyle \overline{X} & \displaystyle \overline{Y} as follows

\displaystyle \overline{X}=\frac{{\sum{X}}}{N}=\frac{{320}}{{10}}=32 \displaystyle \overline{Y}=\frac{{\sum{Y}}}{N}=\frac{{380}}{{10}}=38

Putting the values, we get the value of \displaystyle \sigma x & \displaystyle \sigma y, as follows

\displaystyle \sigma x=\displaystyle \sqrt{{\frac{{\sum{{{{x}^{2}}}}}}{N}}}=\sqrt{{\frac{{140}}{{10}}}}=3.74 appx

\displaystyle \sigma y=\displaystyle \sqrt{{\frac{{\sum{{{{y}^{2}}}}}}{N}}}=\sqrt{{\frac{{398}}{{10}}}}=6.31 appx

Putting the values, we get the value of r, as follows

r=\displaystyle {\sum{{Xy}}}/n\displaystyle \sigma x\displaystyle \sigma y\displaystyle =\frac{{-93}}{{10\times 3.74\times 6.31}}=\frac{{-93}}{{10\times 23.5994}}=\frac{{-93}}{{235.99}}=-0.394

Putting the respective values, we get  the Regression equation of X on Y, as :

X= 32+ (-.394) x [{(\displaystyle \frac{{3.74}}{{6.31}})} x (Y-38)] = 32 + [-0.2337x (Y – 38)]

=  32 + 8.8806 – 0.2337Y = 40.8806 – 0.2337Y

So, the Regression equation of X on Y is : X= 40.8806 – 0.2337Y

Regression equation of Y on X

Y=\displaystyle \overline{Y}+r\displaystyle \sigma x& \displaystyle \sigma y\displaystyle \left( {X-\overline{X}} \right)

\displaystyle Y=38+0394\times \frac{{6.31}}{{3.74}}\left( {x-32} \right)

Or, Y=38 – 0.6643 (X – 32) = 38 + 21.2576 – 0.6643X = 59.2576 – 06643X

So, the Regression equation of Y on X is Y= 59.2576 – 06643X

Coefficient of Regression

Coefficient of Regression determines the value by which one variable increases for a unit increase in other variable.

Coefficient of regression of X on Y = bxy=r\displaystyle \sigma x \\displaystyle \sigma y

Coefficient of regression of Y on X = byx =r\displaystyle \sigma y \\displaystyle \sigma x

Where, r = co-efficient of correlation, sx = standard deviation of series x, sy = standard deviation of series y

The co-efficient of Regression is also given by

\displaystyle {{b}_{{xy}}}=\frac{{N\sum{{XY-\sum{{X.\sum{Y}}}}}}}{{N\sum{{{{Y}^{2}}-{{{\left( {\sum{Y}} \right)}}^{2}}}}}} \displaystyle {{b}_{{yx}}}=\frac{{N\sum{{XY-\sum{{X.\sum{Y}}}}}}}{{N\sum{{{{X}^{2}}-{{{\left( {\sum{X}} \right)}}^{2}}}}}}

Where, X = the given value of X variable, Y = the given value of Y variable, N = number of pairs of observation. All other factors carry the same meanings as given above

Ex. Find Coefficient of Regression of  X on Y and of Y on X

Sale Promotion Exp X (Thousands)118958920
Sales Y: (Lacs)108659711

Computation Details

XYx(X-10)y(Y-8)x2y2xy
111012142
88-20400
96-1-2142
55-5-325915
89-2141-2
97-1-1111
2011103100930
(\displaystyle \left( {\sum{=}} \right))7056001362848

So,\displaystyle \overline{X}=\frac{{\sum{X}}}{N}=\frac{{70}}{7}=10, \displaystyle \overline{Y}=\frac{{\sum{Y}}}{N}=\frac{{56}}{7}=8

Regression Co-efficient  of X on Y = bxy = \displaystyle \frac{{\sum{{xy}}}}{{\sum{{{{y}^{2}}}}}}=\frac{{48}}{{28}} = 1.71

Regression Co-efficient  of Y on X = byx = \displaystyle \frac{{\sum{{xy}}}}{{\sum{{{{x}^{2}}}}}}=\frac{{48}}{{136}}=.353

Click here to see PDF