Do you know that simple regression analysis can be employed for various purposes in business? In fact, forecasting future opportunities and risks is one of the major applications of regression analysis in a business. In addition, companies use linear regression models to optimize their business processes by reducing the massive amount of raw data into actionable information.
What is simple regression analysis
Basically, a simple regression analysis is a statistical tool that is used in the quantification of the relationship between a single independent variable and a single dependent variable based on observations that have been carried out in the past. In layman’s interpretation, what this means is that a simple linear regression analysis can be utilized in the demonstration of how a change in the hours of an organization’s production machine (which is the independent variable) will consequently result in a change in the organization’s electricity cost.
The Simple Linear Regression Model
Basically, the simple linear regression model can be expressed in the same value as the simple regression formula.
y = β0 + β1X+ ε.
In the simple linear regression model, we consider the modelling between the one independent variable and the dependent variable. Usually, the model is typically called a simple linear regression model when there is just a single independent variable in the linear regression model. Keep in mind that it becomes a multiple linear regression model when there are more than one independent variables.
In the simple linear regression model, y refers to the study or dependent variable and X is the explanatory or independent variable. The expressions β0 and β1 are the parameters of the linear regression model. The β0 parameter is regarded as an intercept term, while the β1 parameter is regarded as the slope parameter. The general term for these parameters is known as regression coefficients.
The expression ‘ε’ is the unobservable error that accounts for the inability of the data to stay on the straight line. It also represents the variation between the observed and true realization of ‘y’.
Several reasons can be attributed to these differences. For example, the variables may be qualitative, inherent randomness in the observations, and the effect of all the deleted variables in the model also contributes to the differences. Thus, it is assumed that ε is observed as independent and identically distributed random variable with mean zero and constant variance q². Subsequently, it will further be assumed that ε is distributed normally.
The independent variables in the linear regression model are seen as controlled by the experimenter. This is why it is regarded as non-stochastic, whereas y is regarded as a random variable with:
E(y) = β0 + β1X. and
Var(y) = q²
In some cases, X can function as a random variable. In these situations, rather than the sample variance and sample mean of y, our consideration will be on the conditional mean of y provided X = x as
ε(y) = β0 and β1
and the conditional variance of y provided X = x as
Var(y|x) = q².
Hence, the simple regression analysis model is completely expressed when the values of β0, β1 and q² are known. Generally, the parameters β0, β1 and q² are not known in practice and ε is unobserved. Therefore, you see that the determination of the statistical model y = β0 + β1X + ε is based on the determination (that is, estimation) of β0, β1 and q². In order to ascertain the values of these parameters, n pairs of observations (x¡, y¡)(¡ = 1,…, n) on (X, y) are observed/collected and are used to determine these unknown parameters.
In all, different methods of estimation can be employed in the determination of the estimates of the parameters. The most popular method is the least squares estimation and maximum likelihood method of estimation.
How to Perform a Simple Regression Analysis
The most common way people perform a simple regression analysis is by using statistical programs to enable fast analysis of the data.
Performing the simple linear regression in R
R is a statistical program that is used in carrying out a simple linear regression analysis. It is widely used, powerful, and free. Here’s how it works.
First, you have to load the income.data dataset into your R environment. Then you run the command below to create a library model that demonstrates the relationship between happiness and income.
R code for some linear regression
income.happiness.lm <- lm(happiness ~ income, data = income.data)
Basically, this code will take the gathered data “data = income.data” and then evaluate the effect that the independent variable “income” has on the dependent variable “happiness” by using the equation for the linear model: lm().
How to interpret the results
To view the outcome of the model, you can make use of the “summary()” function in R:
summary(income.happiness.lm)
What this function does is to take the most important parameters from the linear model and place them into a table.
This result table initially repeats the formula that was used in the generation of the results (‘Call’). Thereafter, it summarizes the model residuals (‘Residuals’). This helps to provide insight to how appropriately the model fits the original data.
Then we move to the ‘Coefficients’ table. The first row provides the estimates of the y-intercept, while the second row provides the regression coefficient of the model.
The number one row of the table is labeled “(Intercept)”. This is the y-intercept of the regression equation, having a value of 0.20. You can incorporate this into the equation of your regression if you want to make prediction for the values of happiness across the range of income that you have analyzed:
happiness = 0.20 + 0.71*income±0.018
The next row in the ‘Coefficients’ table is income. This row explains the estimated effect of income on reported happiness.
The “Estimate” column is the estimated effect. It can also be referred to as r² value or regression coefficient. The number in the table (0.713) informs us that for every single unit increase in income (taking a unit of income to be equals $10,000), there is a corresponding 0.71-unit increase in reported happiness (taking happiness to be a scale of 1 to 10).
The “Std. Error” column describes the standard error of the estimate. This number demonstrates the level of the variation in our estimate of the relationship between happiness and income.
The test statistic is displayed in the “t value” column. If you do not specify otherwise, the test statistic used in the linear regression remains the t-value from a double-sided t-test. The higher the test statistic, the lower the probability that our outcomes occurred coincidentally.
The “pr(>| t |)” column describes the p-value. The figure there shows us the probability of having the estimated effect of income on happiness if the null hypothesis of no effect were accurate.
Since the p-value is very low (p < 0.001), we can dismiss the null hypothesis and come to the conclusion that income has a statistically relevant effect on happiness.
The last 3 lines of the model summary are statistics regarding the entirety of the model. The most significant thing to keep in mind here is the model’s p-value. It becomes relevant here (p < 0.001), meaning that this model is a standard fit for the observed data.
Presentation of results
In the report of the results, add the p-value, standard error of the estimate, and the estimated effect (that is, the regression coefficient). It is also necessary that you interpret your numbers to make it vivid to your readers what the meaning of regression coefficient is.
Result
There was a relevant relationship (p < 0.001) between income and happiness ( R² = 0.71±0.018), with a 0.71-unit increase in reported happiness for every $10,000 increase in income.
In addition, it would be good to add a graph along with your results. For a simple linear regression, all you have to do is plot the observations on the x and y axis. Then you add the regression function and regression line.
Simple linear regression formula
The formula for a simple linear regression is
y = β0 + β1 + ε
Key Parts of Simple Regression Analysis
R²
This is a measure of association. It serves as a representation for the percent of the variance in the values of Y that can be displayed by understanding the value of X. R² varies from a minimum of 0.0 (where no variance at all is explained), to a maximum of +1.0 (in which every of the variance is explained).
S.e.b
This refers to the standard error of the registered value of b. A t-test for statistical importance of the coefficient is carried out by dividing the value of b by its standard error. According to the rule of thumb, a t-value that is higher than 2.0 is typically statistically relevant, however you have to make reference to a t-table just to be sure.
If according to the t-value there is indication that the b coefficient is statistically relevant, then it means that the independent variable of X should be reserved in the regression equation. This is especially because it features a statistically relevant relationship with the dependent variable or Y. In the case where the relationship is not statistically relevant, then the ‘b coefficient’ value would be just the same as zero (statistically speaking).
F
This is a test for the statistical relevance of the entire regression equation. It is generated by dividing the described variance by the unexplained variance. As the rule of thumb posits, any F-value that is higher than 4.0 is most often statistically relevant. Nonetheless, you have to refer to an F-table just to be sure. If F is relevant, then the regression equation helps us to know the relationship between X and Y.
Assumptions of simple linear regression
- Homogeneity of variance: this can also be referred to as homoscedasticity. The core of this assumption states that there is no significant change in the size of the error in our prediction across the values of the independent variable.
- Independence of observations: here, statistically valid sampling methods were used to collect the observations in the dataset, and there exists no unknown relationships among observations.
- Normality: this simply assumes that the data follows a normal distribution.
Simple Linear Regression Examples
Here, we will be citing a scenario that serves as an example of the implementation of simple regression analysis.
Let us assume the average speed when 2 highway patrols are deployed is 75 mph, or 35 mph when 10 highway patrols are deployed. The question thus is what is the average speed of cars on the freeway when 5 highway patrols are deployed?
Using our simple regression analysis formula, we can thus compute the values and derive the following equation: Y = 85 + (-5) X, given that Y is the average speed of cars on the highway. A = 85, or the average speed when X = 0
B = (-5), the impact of each extra patrol car deployed on Y
And X = no of patrols deployed
Therefore, the average speed of cars on the highway when there are zero highway patrols operating (X=0) will be 85 mph. For every extra highway patrol car working, the average speed will reduce by 5 mph. Hence, for 5 patrol cars (X = 5), we have Y = 85 + (-5) (5) = 85 – 25 = 60 mph.
Limits of Simple Linear Regression
Even the best data does not give perfection. Typically, simple linear regression analysis is widely used in research to mark the relationship that exists between variables. However, since correlation does not interpret as causation, the relationship between 2 variables does not mean that one causes the other to occur. In fact, a line in a simple linear regression that describes the data points well may not bring about a cause-and-effect relationship.
The use of a simple regression analysis example will enable you to find out if at all there exists a relationship between variables. Therefore, extra statistical analysis and research is needed to determine what exactly the relationship is, and if one variable leads to the other.
Final Thoughts
In all, businesses of today need to consider simple regression analysis if they need an option that provides excellent support to management decisions, and also identifies errors in judgment. With proper analysis, large amounts of unstructured data that have been accumulated by businesses over time will have the potential to yield valuable insights to the businesses.