What type of regression to use




















Linear models are the oldest type of regression. It was designed so that statisticians can do the calculations by hand. However, OLS has several weaknesses, including a sensitivity to both outliers and multicollinearity , and it is prone to overfitting. To address these problems, statisticians have developed several advanced variants:. Nonlinear regression also requires a continuous dependent variable, but it provides a greater flexibility to fit curves than linear regression.

However, nonlinear models use an iterative algorithm rather than the linear approach of solving them directly with matrix equations. What this means for you is that you need to worry about which algorithm to use, specifying good starting values, and the possibility of either not converging on a solution or converging on a local minimum rather than a global minimum SSE. Most nonlinear models have one continuous independent variable, but it is possible to have more than one.

When you have one independent variable, you can graph the results using a fitted line plot. My advice is to fit a model using linear regression first and then determine whether the linear model provides an adequate fit by checking the residual plots.

I always recommend that you try OLS first because it is easier to perform and interpret. Read the following posts to learn the differences between these two types, how to choose which one is best for your data, and how to interpret the results. A categorical variable has values that you can put into a countable number of distinct groups based on a characteristic.

Logistic regression transforms the dependent variable and then uses Maximum Likelihood Estimation, rather than least squares, to estimate the parameters.

Logistic regression describes the relationship between a set of independent variables and a categorical dependent variable. Choose the type of logistic model based on the type of categorical dependent variable you have.

Use binary logistic regression to understand how changes in the independent variables are associated with changes in the probability of an event occurring. This type of model requires a binary dependent variable. A binary variable has only two possible values, such as pass and fail. Example: Political scientists assess the odds of the incumbent U. President winning reelection based on stock market performance. Read my post about a binary logistic model that estimates the probability of House Republicans belonging to the Fre Ordinal logistic regression models the relationship between a set of predictors and an ordinal response variable.

An ordinal response has at least three groups which have a natural order, such as hot, medium, and cold. Example: Market analysts want to determine which variables influence the decision to buy large, medium, or small popcorn at the movie theater. Nominal logistic regression models the relationship between a set of independent variables and a nominal dependent variable. A nominal variable has at least three groups which do not have a natural order, such as scratch, dent, and tear.

Example : A quality analyst studies the variables that affect the odds of the type of product defects: scratches, dents, and tears. If your dependent variable is a count of items, events, results, or activities, you might need to use a different type of regression model.

Counts are nonnegative integers 0, 1, 2, etc. Count data with higher means tend to be normally distributed and you can often use OLS. However, count data with smaller means can be skewed , and linear regression might have a hard time fitting these data. For these cases, there are several types of models you can use. Count data frequently follow the Poisson distribution, which makes Poisson Regression a good possibility.

Poisson variables are a count of something over a constant amount of time, area, or another consistent length of observation. With a Poisson variable, you can calculate and assess a rate of occurrence. A classic example of a Poisson dataset is provided by Ladislaus Bortkiewicz, a Russian economist, who analyzed annual deaths caused by horse kicks in the Prussian Army from Use Poisson regression to model how changes in the independent variables are associated with changes in the counts.

Poisson models are similar to logistic models because they use Maximum Likelihood Estimation and transform the dependent variable using the natural log. For example, homicides per month. I have seen this many times since I have written several scientific papers and each of them has made mistakes with the principles of science of my dissertation. Some may condemn me for this, but I really had no other choice. Do you have Python based examples. Please share if so.

Very good article, you can also dd the multivariate regression model, extension of logistic regression. Otherwise it is a good piece of work. Thankyou as it's very consuming to give answers to these in the understanding. Regression techniques are one of the most popular statistical techniques used for predictive modeling and data mining tasks. On average, analytics professionals know only types of regression which are commonly used in real world.

They are linear and logistic regression. But the fact is there are more than 10 types of regression algorithms designed for various types of analysis. Each type has its own significance. Every analyst must know which form of regression to use depending on type of data and distribution. Table of Contents. What is Regression Analysis?

Lets take a simple example : Suppose your manager asked you to predict annual sales. There can be a hundred of factors drivers that affects sales. In this case, sales is your dependent variable. Factors affecting sales are independent variables. Regression analysis would help you to solve this problem. In simple words, regression analysis is used to model the relationship between a dependent variable and one or more independent variables. It helps us to answer the following questions - Which of the drivers have a significant impact on sales Which is the most important driver of sales How do the drivers interact with each other What would be the annual sales next year.

Terminologies related to regression analysis 1. R Tutorials : 75 Free R Tutorials. Spread the Word! Share Share Tweet Subscribe. Deepanshu Bhalla 25 March at Unknown 25 March at Job Nmadu 25 March at Anonymous 26 March at Unknown 27 March at HS 30 March at Unknown 23 April at Unknown 14 May at Deepanshu Bhalla 15 June at But that case could be included in another regression, as long as it was not missing values on any of the variables included in that analysis.

You can change this option so that your regression analysis does not exclude cases that are missing data for any variable included in the regression, but then you might have a different number of cases for each variable. Outliers You also need to check your data for outliers i.

If you feel that the cases that produced the outliers are not part of the same "population" as the other cases, then you might just want to delete those cases.

Alternatively, you might want to count those extreme values as "missing," but retain the case for other variables. Alternatively, you could retain the outlier, but reduce how extreme it is. Specifically, you might want to recode the value so that it is the highest or lowest non-outlier value. Normality You also want to check that your data is normally distributed.

To do this, you can construct histograms and "look" at the data to see its distribution. Often the histogram will include a line that depicts what the shape would look like if the distribution were truly normal and you can "eyeball" how much the actual distribution deviates from this line.

This histogram shows that age is normally distributed: You can also construct a normal probability plot. In this plot, the actual scores are ranked and sorted, and an expected normal value is computed and compared with an actual normal value for each case.

The expected normal value is the position a case with that rank holds in a normal distribution. The normal value is the position it holds in the actual distribution. Basically, you would like to see your actual values lining up along the diagonal that goes from lower left to upper right. This plot also shows that age is normally distributed: You can also test for normality within the regression analysis by looking at a plot of the "residuals.

Residuals will be explained in more detail in a later section. If the data are normally distributed, then residuals should be normally distributed around each predicted DV score. If the data and the residuals are normally distributed, the residuals scatterplot will show the majority of residuals at the center of the plot for each value of the predicted score, with some residuals trailing off symmetrically from the center.

You might want to do the residual plot before graphing each variable separately because if this residuals plot looks good, then you don't need to do the separate plots. Below is a residual plot of a regression where age of patient and time in months since diagnosis are used to predict breast tumor size. These data are not perfectly normally distributed in that the residuals about the zero line appear slightly more spread out than those below the zero line. Nevertheless, they do appear to be fairly normally distributed.

In addition to a graphic examination of the data, you can also statistically examine the data's normality. Specifically, statistical programs such as SPSS will calculate the skewness and kurtosis for each variable; an extreme value for either one would tell you that the data are not normally distributed. If any variable is not normally distributed, then you will probably want to transform it which will be discussed in a later section. Checking for outliers will also help with the normality problem.

Linearity Regression analysis also has an assumption of linearity. Linearity means that there is a straight line relationship between the IVs and the DV. This assumption is important because regression analysis only tests for a linear relationship between the IVs and the DV.

Any nonlinear relationship between the IV and DV is ignored. You can test for linearity between an IV and the DV by looking at a bivariate scatterplot i. If the two variables are linearly related, the scatterplot will be oval.

Looking at the above bivariate scatterplot, you can see that friends is linearly related to happiness. Specifically, the more friends you have, the greater your level of happiness. However, you could also imagine that there could be a curvilinear relationship between friends and happiness, such that happiness increases with the number of friends to a point.

Beyond that point, however, happiness declines with a larger number of friends. This is demonstrated by the graph below: You can also test for linearity by using the residual plots described previously. This is because if the IVs and DV are linearly related, then the relationship between the residuals and the predicted DV scores will be linear. Nonlinearity is demonstrated when most of the residuals are above the zero line on the plot at some predicted values, and below the zero line at other predicted values.

In other words, the overall shape of the plot will be curved, instead of rectangular. The following is a residuals plot produced when happiness was predicted from number of friends and age. As you can see, the data are not linear: The following is an example of a residuals plot, again predicting happiness from friends and age.

But, in this case, the data are linear: If your data are not linear, then you can usually make it linear by transforming IVs or the DV so that there is a linear relationship between them. Sometimes transforming one variable won't work; the IV and DV are just not linearly related. If there is a curvilinear relationship between the DV and IV, you might want to dichotomize the IV because a dichotomous variable can only have a linear relationship with another variable if it has any relationship at all.

Alternatively, if there is a curvilinear relationship between the IV and the DV, then you might need to include the square of the IV in the regression this is also known as a quadratic regression. The failure of linearity in regression will not invalidate your analysis so much as weaken it; the linear regression coefficient cannot fully capture the extent of a curvilinear relationship.

If there is both a curvilinear and a linear relationship between the IV and DV, then the regression will at least capture the linear relationship. Homoscedasticity The assumption of homoscedasticity is that the residuals are approximately equal for all predicted DV scores. Another way of thinking of this is that the variability in scores for your IVs is the same at all values of the DV. You can check homoscedasticity by looking at the same residuals plot talked about in the linearity and normality sections.

Data are homoscedastic if the residuals plot is the same width for all values of the predicted DV. Heteroscedasticity is usually shown by a cluster of points that is wider as the values for the predicted DV get larger.

Alternatively, you can check for homoscedasticity by looking at a scatterplot between each IV and the DV. As with the residuals plot, you want the cluster of points to be approximately the same width all over. The following residuals plot shows data that are fairly homoscedastic. In fact, this residuals plot shows data that meet the assumptions of homoscedasticity, linearity, and normality because the residual plot is rectangular, with a concentration of points along the center : Heteroscedasiticy may occur when some variables are skewed and others are not.

Thus, checking that your data are normally distributed should cut down on the problem of heteroscedasticity. Like the assumption of linearity, violation of the assumption of homoscedasticity does not invalidate your regression so much as weaken it. Multicollinearity and Singularity Multicollinearity is a condition in which the IVs are very highly correlated.

Multicollinearity and singularity can be caused by high bivariate correlations usually of. High bivariate correlations are easy to spot by simply running correlations among your IVs.



0コメント

  • 1000 / 1000