# How to test linear relationship between variables

### Regression Slope Test

In Module Notes we covered Steps 1 - 3 of regression and correlation analysis for the simple linear regression model. In those steps, we learn about the form. Use linear regression or correlation when you want to know whether one One is a hypothesis test, to see if there is an association between. This lesson describes how to conduct a hypothesis test to determine whether there is a significant linear relationship between an independent variable X and a .

The value of this squared variation can be seen in Worksheet 2. The value is Next, we find the total variation by finding the difference between the actual value of Y and the average value of Y for each observation in the data set. For the first observation, this variation is - The computer program then squares this and the differences for all of the other observations and sums them up. Note the value is Since we use the regression model to compute the estimate, some refer to this standard error as the Standard Error of the Regression Model.

### Correlation and Regression

The interpretation is similar to the interpretation of the standard deviation of an observation and the standard error of the mean, as we learned in Module 1. So, if we predict external hours to beit could be anywhere from to This is really important. So many times in regression analysis, people make a prediction and go with it without ever looking at the standard error.

This measure of practical utility gives us an indication of how reliable the regression model will be. The tough part is that I cannot give you nor can a text a good benchmark - it is a management call on how much error is acceptable. Obviously, there will be error since not every observation in a sample of data falls on the regression line. For the example above, the standard error is For an actual value of Y at the high end ofthe error percent is 6 percent.

For the average value of Y of hours, the error percent is 8. For planning purposes, this range of error may be tolerable.

For precise prediction purposes, to be off by up to 13 percent may not be tolerable. Often times, we can use the standard error as a comparison tool. Let's say we ran another model with a different independent variable and get a standard error of It would be much better to have an error of 45 on a prediction than an error of The point is, without the measure of average or standard error of the prediction, we would not be able to compare models.

To compute the standard error of the estimate, the computer program first finds the error, which is also called the residual, for each observation in the data set.

The error is the difference between the actual value of Y and the predicted value of y, or Y - y. To illustrate for the first observation, the actual value of y external hours is and the predicted value of y is The error is thus - or In a similar manner, all of the errors are computed for each observation, then squared, then summed to get the Sum of Squares Error SSE.

SSE is a measure of the unexplained variation in the regression and is the variation around the regression line. To get the standard error of the estimate, the computer program divides the SSE by the sample size minus 2 to adjust for the degrees of freedom in simple regressionand then takes the square root. Regression models that have lower Standard Errors and higher R2's have greater practical utility compared to models with higher Standard Errors and lower R2's.

However, these are judgment calls rather than precise statistical standards. The important thing is that analysts have an ethical standard to report the Standard Error and R2 values to their audiences. Did you note low standard errors would be associated with high R2's, and vice versa? This is simply because regression models in which the data are tightly grouped around the regression line have little error, and X has high predictive value movements in X result in predictable movements in Y.

This can also be explained by the equation for R2, Equation 2. Lower SSE results in lower standard errors.

### Statistics review 7: Correlation and regression

Test the Statistical Utility of the Regression Model There are two inferential methods of testing statistical utility of the regression model: The parameter of interest in determining if a regression is statistically significant or useful is the slope. Testing a Hypothesis for a Population Slope The five-step process for testing a hypothesis for a population mean is identical to that of testing a hypothesis for a population slope, we just change the parameter from the mean to the slope.

State Null and Alternative Hypotheses The null and alternative hypotheses in regression are: That is, the regression line is horizontal, meaning that Y does not change when X changes. Let's look at the regression equation again: Also in this case, we say that the regression model is not statistically useful.

On the other hand, if we reject the null hypothesis in favor of the alternative, then we are really saying that changes in X result in predictable changes in Y, be they positive or negative. The relationship between X and Y has statistical utility, or the regression model is statistically useful.

Determine and Compute the Test Statistic There are two test statistics for testing a regression model. The first is the F statistic which is the ratio of the average or mean variation attributed to regression to the average or mean variation attributed to error or residual.

Look in the row titled "Regression," and the column titled "F". You should see the F value to be The farther this value is from 1, the more significant the regression model. Here, the variation explained by regression is The F statistic is then used to test the regression model. The other test statistic is the t statistic. Look at the row labeled "Assets" in Worksheet 2. The row labeled Assets gives us information about the slope. We start with its value of 0. The next value is the Standard Error of the Slope not the model, but the standard error of the slope.

In fact, what causes the standard error of the model or estimate is the fact that the slope itself has error or variability. The next value is the "t Stat" for the slope. Its value is 8. This is a very large value - the slope of 0. The t statistic is thus used to test a regression slope. Its value is 4. The p-value for the t statistic is in the column titled P-value, in the row titled "Assets".

Correlation Analysis In correlation analysis, we estimate a sample correlation coefficient, more specifically the Pearson Product Moment correlation coefficient. The correlation between two variables can be positive i. The sign of the correlation coefficient indicates the direction of the association. The magnitude of the correlation coefficient indicates the strength of the association.

A correlation close to zero suggests no linear association between two continuous variables. You say that the correlation coefficient is a measure of the "strength of association", but if you think about it, isn't the slope a better measure of association? We use risk ratios and odds ratios to quantify the strength of association, i. The analogous quantity in correlation is the slope, i. And "r" or perhaps better R-squared is a measure of how much of the variability in the dependent variable can be accounted for by differences in the independent variable.

The analogous measure for a dichotomous variable and a dichotomous outcome would be the attributable proportion, i. Therefore, it is always important to evaluate the data carefully before computing a correlation coefficient.

## Statistics review 7: Correlation and regression

Graphical displays are particularly useful to explore associations between variables. The figure below shows four hypothetical scenarios in which one continuous variable is plotted along the X-axis and the other along the Y-axis.

Scenario 3 might depict the lack of association r approximately 0 between the extent of media exposure in adolescence and age at which adolescents initiate sexual activity. Example - Correlation of Gestational Age and Birth Weight A small study is conducted involving 17 infants to investigate the association between gestational age at birth, measured in weeks, and birth weight, measured in grams.

We wish to estimate the association between gestational age and infant birth weight. We introduce the technique here and expand on its uses in subsequent modules. Simple Linear Regression Simple linear regression is a technique that is appropriate to understand the association between one independent or predictor variable and one continuous dependent or outcome variable.

In regression analysis, the dependent variable is denoted Y and the independent variable is denoted X.

When there is a single continuous dependent variable and a single independent variable, the analysis is called a simple linear regression analysis.

This analysis assumes that there is a linear association between the two variables. If a different relationship is hypothesized, such as a curvilinear or exponential relationship, alternative regression analyses are performed.

The figure below is a scatter diagram illustrating the relationship between BMI and total cholesterol. Each point represents the observed x, y pair, in this case, BMI and the corresponding total cholesterol measured in each participant.

Note that the independent variable BMI is on the horizontal axis and the dependent variable Total Serum Cholesterol on the vertical axis.