Notes About Regression
What is Regression ?
Regression is machine learning methode classified as supervised learning. Regression is used to predict numerical data and also analyze how changes in the value of independent variable affect the value of dependent variable. Generally, regression makes prediction on the dependent variable (Y or target) which is of numerical type, based on independent variable (x or feature) using mathematical functions.
Metric Evaluation For Regression
1. Mean Absolute Error (MAE)
MAE provides an overview of how far, on average the predicted value
from the actual value, without considering the direction of the errors (positive or negative). Therefore the difference between the predicted and actual value is always counted as a positive value. Generally, a lower MAE value indicates a more accurate model in predicting the actual value.
Formula : (1/n) * Σ(y_actual — y_predict)
n : the number of data or samples.
y_actual : the actual value of the target variable.
y_predict : the predicted value from the model.
Advantage :
- Easy to interpret and more resist to outliers.
Disadvantage :
- Does not give different weights to major errors and minor errors.
- Does not indicate the direction of the difference between the predicted and actual values. So, it does not provide information about whether predictions tend to be Overestimated or Underestimated.
Overestimate : When the predicted value is higher than the actual value.
Underestimate : When the predicted value is lower than the actual value.
2. Mean Squared Error (MSE)
MSE provides an overview on average the squared errors between the predicted value and actual value. Generally, if MSE is close to 0, it indicates that the prediction is good or accurate to the actual value on average.
Formula : (1/n) * Σ(y_actual — y_predict)²
n : the number of data or samples.
y_actual : the actual value of the target variable.
y_predict : the predicted value from the model.
Advantage :
- More focus on reducing significant errors.
Disadvantage :
- Difficult to interpret.
- Sensitive outliers.
3. R-Squared
R-Squared provides an overview how far the variability in the target data can be explained by the model. Generally, If the R-Squared value is close to 1, it indicates that the model can explain all the variability in the data.
Formula : 1 — (SSR/SST)
- SSR (Sum of Squared Residuals) = Sum of the squared difference between predict values and actual value. (y_actual — y_pred)².
- SST (Total Sum of Squares) = Sum of the squared difference between actual values and the average of the actual value. (y_actual — y_mean)².
Advantage :
- Provide information how well the model explains the variability of the data.
- Easy to compare different model.
Disadvantage :
- Does not provide specific information about prediction error.
- The value can increase even if the added features are irrelevant.
4. Adjusted R-Squared
Adjusted R-Squared provides an overview far the variability in the target data can be explained by the model. Adjusted R-Squared takes the R-Squared calculation and adjusts it to the number of features used in the model. If the values is close to 1, it indicates that the model can explain all the variability in the data using the right feature.
Formula : 1 — [(1 — R-Squared) * (n — 1) / (n — k — 1)]
n : the number of data or samples
k : the number of feature used in model
Advantage :
- Help in overcoming the problem of overfitting.
- More accurate estimate of how the model will perform on the new data.
Disadvantage :
- Adding features will increase the value, even though the features have no significant effect on the prediction.
5. Root Mean Squared Error
RMSE provides an overview the average error rate of the model predictions with calculates the square root of the mean from square of the difference between the predicted value and the actual value. Generally, if RMSE is close to 0, the model prediction error to the actual value is relatively small.
Formula : √[(1/n) * Σ(y_actual — y_pred)²]
n : the number of data or samples.
y_actual : the actual value of the target variable.
y_predict : the predicted value from the model.
Advantage :
- More focus on reducing significant errors.
- Easy to interpret because the results have the same scale as the dataset.
Disadvantage :
- Sensitive outliers.
6. Root Mean Absolute Error
RMAE provides an overview the average error rate of the model predictions in absolute values with calculates the square root of the mean from absolute value the difference between the predicted value and the actual value. Generally, if RMAE is close to 0, the model prediction error to the actual value is relatively small in absolute value.
Formula : √[(1/n) * Σ|y_actual — y_pred|]
n : the number of data or samples.
y_actual : the actual value of the target variable.
y_predict : the predicted value from the model.
Advantage :
- More resist to outliers.
Disadvantage :
- It is difficult to compare between models and datasets that have different scales.
- Does not give different more weights to major errorr.
7. Mean Absolute Percentage Error
MAPE provides an overview the average percentage error between the predicted value and the actual value in percentage form.
Formula : (1/n) * Σ((y_pred — Y_actual)/Y_actual) * 100%
n : the number of data or samples.
y_actual : the actual value of the target variable.
y_predict : the predicted value from the model.
Advantage :
- Describes the prediction error in percentage terms.
Disadvantage :
- Cannot be calculated if any actual value is 0 or near 0.
- Susceptible to outlier values, because the error is calculated as a percentage.
Model Regression
1. Linear Regression (Linear)
Linear regression is a statistical method used to model a linear relationship between the dependent variable (Y / target) and one or more independent variables (X / feature). Linear Regression is divided into 2 namely : Simple & Multiple.
Simple Linear : 1 variable independent.
Formula : Y = β0 + β1*X + ε
Multiple Linear : more 1 variable Independent.
Formula : Y = β0 + β1*X1 + β2*X2 + … + βnXn -> (ε)
Y : dependent variable (target).
X1-Xn : independent variable (feature).
β0 : intercept (interception). The predicted value of Y when all independent variables (X) are zero.
Formula -> Ȳ — (β1*X̄1)-(β2*X2)-(βn*Xn)
β1 : coefficient (slope). The extent to which changes in the independent variable (X) affect the dependent variable (Y).
Formula -> Σ((X - X̄)*(Y - Ȳ)) / Σ((X - X̄)²)
ε : residual (errorr). Formula -> (Y-Y_Predict)
X̄ : average independent variable (feature).
Ȳ : average dependent variable (target).
Linear Regression Assumptions :
1. Linearity : There is a linear relationship between the independent variable (X) and the dependent variable (Y). To check linearity can use Scatter Plot.
2. Normality : The residual (ε) has a normal distribution. To check normality can use Shapiro / Lilieforst Test.
3. No Multicollinearity : There is no linear relationship between the independent variables. To check multicolinearity can use Variance Inflation Factor (VIF) or Partial F-Test.
4. Homoscedasticity & Unbiased : The residual variance (ε) is constant for all values of X. To check homoscedasticity can use Residual Plot.
Linear Regression Optimization :
- Transformation dependent variable if not normal distributed. Can use np.log2(var_dependent / (y)).
- Feature (X) elimination which has no effect on the dependent variable. Can use Backward Elimination based on P-Value.
- Scaling if there are unequal scales. Can use MinMaxScaler, StandarScaler, Reboust Scaler.
- Hyperparameter Tunning. can use GridSearch or RandomSearch.
Advantage :
- Easy to implement & interpret.
- Can know how much influence the independent variable with the dependent.
Disadvantage :
- There are several assumptions that must be met & sensitive outliers.
Others Model Regression (Linear) :
- Ridge Regression : Uses L2 regularization to control model complexity. In Ridge, the squared penalty of the regression coefficients is added to the objective function, resulting in a more stable solution and reduced multicollinearity effects.
- Lasso Regression : Uses L1 regularization. In Lasso, the absolute penalty of the regression coefficients is added to the objective function. The advantage is its ability to perform feature selection, which produces a zero coefficient for insignificant variables, resulting in a simpler and more interpretable.
2. Logistic Regression (Non-Linear)
Logistic Regression is a statistical method used to model and predict an event based on probability. There is 3 types Logistic Regression, namely :
- Binary Logistic Regression : Only has 2 labels.
- Multinomial Logistic Regression : 3 or more labels.
- Ordinal Logistic Regression : 3 or more labels ordinal.
Formula Binary Logistic Regression :
1. Odd = β0 + β1*X1 + β2*X2 + … + βnXn
2. P(Y=1) = Odd / (1+Odd)
3. P(Y=0) = 1-P(Y=1)
Odd : ratio probability success and fail.
β1 : coefficient (slope).
β0 : intercept (interception).
X1-Xn : independent variable (feature).
Y = : dependent variable (target) 0 / 1.
In Binary Logistics Regression has something called Odds-Ratio. Odds-Ratio for interpret the results of the analysis and indicate how much great tendency to occur success event in a condition compared to other conditions. Formula : exp(β).
Explanation Odds-Ratio
- If the Odds-Ratio is greater than 1, then there is a tendency to increase the probability of success as the predictor variable increases.
- If the Odds-Ratio is less than 1, then there is a decreasing trend in the probability of success as the predictor variable increases.
- If the Odds-Ratio is equal to 1, then the predictor variable has no effect on the probability of success.
Logistic Regression Assumptions :
1. Linearity : There is a linear relationship between the independent variable (X) and log-odds probability. To check linearity can use Scatter Plot.
2. No Multicollinearity : There is no linear relationship between the independent variables. To check multicolinearity can use Variance Inflation Factor (VIF) or Partial F-Test.
3. Homoscedasticity & Unbiased : The residual variance (ε) is constant for all values of X. To check homoscedasticity can use Residual Plot.
4. No outliers.
Logistic Regression Optimization :
- Use Maximum Likelihood Estimation method to obtain coefficient estimates that maximize the likelihood function.
- Feature Selection. Can use Select K-Best.
- No Overfitting (Good Train, Bad Test).
- Hyperparameter Tunning. can use GridSearch or RandomSearch.
Decision Boundry : helps to differentiate probabilities into positive and negative class.
Metrik Evaluation : Precision, Recall, Accuracy, F1-score and others.
Advantage :
- Easy to intepret and suitable for both discrete and continuous predictor (x) variables.
Disadvantage :
- Assumption must be met, especially Linearity and sensitive outliers.
Others Model Regression (Non Linear) :
- Polynomial Regression : Used the relationship between the independent and dependent variables as a polynomial function with squared or more.
- Exponential Regression: Used when the relationship between the independent and dependent variables follows an exponential pattern.
3. Regression (Non-Parametric)
Non-Parametric Regression is a regression method that does not assume a linear form or specific function parameters on the relationship between the independent variables and the dependent variable. As a result, Non-Parametric Regression is capable of capturing more complex and unstructured patterns in the data.
Examples Non-Parametric model : Decision Tree, K-Nearest Neighbors (KNN) Regression, Support Vector Regression (SVR).
4. Regression (Ensemble)
Ensemble model regression method are techniques that combines different regression models to enhance overall prediction performance. There are several popular ensemble model regression methods, including :
- Stacking : A method that utilizes the predicted outcomes of each model (base learner) as features and combines them using a meta learner.
Important Notes Stacking : prevent overfitting, different models & the parameters, number models, balancing data, outlier significant. - Bagging (Boostrap Aggregating) : Different regression models are trained using bootstrap samples of the same size from the original dataset. These models then make predictions on the test data, which they have not seen before. The final prediction is obtained by either majority voting or averaging the predictions of all individual models.
Example Model : Random Forest Regression.
Important Notes Bagging : different models & the parameters, number models, independent models, prevent overfitting, outlier significant. - Boosting : An iterative process where weak learner models are trained sequentially, with each model attempting to correct the mistakes made by the previous model. The final prediction is obtained by combining the predictions of all individual models, typically weighted based on their performance.
Example Model : AdaBoost & GradientBoosting
Important Notes Boosting : prevent overfitting, outlier significant, parameters tunning.
Thank you for taking the time to read this article. I hope this article is easy to understand, and I also apologize if any errors in writing or else, See you.