Build Your First Machine Learning Model: Multiple Linear Regression From Scratch (Complete Guide 2025)

What You Will Learn
Understanding Multiple Linear Regression
The Student Performance Dataset
Building Your Model Step by Step
Validating Your Results
Key Takeaways

What You Will Learn

In this tutorial, you will discover how to build a multiple linear regression model completely from scratch. By the end, you will understand not just how to use machine learning, but how it actually works under the hood. This knowledge will set you apart from developers who only know how to import libraries.

Why Build From Scratch?

Building models from scratch provides three key advantages:

Deep Understanding: You learn the mathematics behind the algorithms
Better Debugging: When things go wrong, you know exactly where to look
Interview Readiness: Top companies often ask candidates to implement algorithms from scratch

Let me guide you through building a model that predicts student performance with remarkable accuracy.

Understanding Multiple Linear Regression

Before diving into code, let us establish a solid foundation. Multiple linear regression finds the best-fitting line (or hyperplane) through multi-dimensional data. Think of it as drawing the most accurate trend line possible through a cloud of data points.

The mathematical formula at the heart of our model is:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

Where:

y represents what we want to predict (student performance)
β values are coefficients showing how much each factor matters
x values are our input features (study hours, previous scores, etc.)

The Normal Equation: Your Secret Weapon

Instead of using iterative methods like gradient descent, we will use the normal equation for an elegant, one-shot solution:

β = (X^T × X)^(-1) × X^T × y

This equation finds the optimal coefficients instantly. No loops. No iterations. Just pure mathematical precision.

The Student Performance Dataset

Our journey begins with real data from Kaggle containing information about 1,000 students. Each row tells a story about a student's habits and their resulting academic performance.

The dataset includes six columns:

Hours Studied: Daily study time (1-10 hours)
Previous Scores: Historical performance (0-100)
Extracurricular Activities: Yes/No participation
Sleep Hours: Nightly rest (4-10 hours)
Sample Question Papers Practiced: Exam preparation (0-10)
Performance Index: Our target variable (0-100)

You can access the complete implementation and dataset here: Kaggle Notebook - Multi Linear Regression From Scratch

Building Your Model Step by Step

Step 1: Data Exploration and Feature Selection

The first rule of machine learning is to understand your data before touching any algorithms. Let us start by examining which factors actually influence student performance.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Load the dataset
data_df = pd.read_csv("Student_Performance.csv")

# Convert categorical data to numerical
data_df['Extracurricular Activities'] = data_df['Extracurricular Activities'].map({'Yes':1, 'No':0})

# Calculate correlations with our target variable
corr_m = data_df.corr()
print(corr_m['Performance Index'])

Correlation heatmap showing relationships between all variables in the student performance dataset

The correlation analysis reveals fascinating insights:

Previous Scores (0.92): Nearly perfect correlation - past performance strongly predicts future success
Hours Studied (0.37): Moderate impact - effort matters, but it is not everything
Extracurricular Activities (0.02): Surprisingly minimal impact
Sleep Hours (0.05): Low correlation, but we will keep it to capture potential hidden patterns

Based on this analysis, we select three features for our model: Previous Scores, Hours Studied, and Sleep Hours.

Step 2: Data Preprocessing - The Foundation of Success

Clean, well-prepared data is essential for any machine learning model. Our preprocessing function handles three critical tasks:

def data_processing(data_df, features_columns, target_column, train_size=0.8, random_state=42):
    # Select only the features we need
    desired_data = data_df[features_columns + [target_column]].copy()
    
    # Shuffle to ensure random distribution
    shuffle_df = desired_data.sample(frac=1, random_state=random_state).reset_index(drop=True)
    
    # Split into training and testing sets
    train_sz = int(train_size * len(desired_data))
    train_set = shuffle_df[:train_sz]
    test_set = shuffle_df[train_sz:]
    
    # Separate features and target
    X_train = train_set[features_columns]
    y_train = train_set[target_column]
    X_test = test_set[features_columns]
    y_test = test_set[target_column]
    
    # Standardization - bringing all features to the same scale
    train_mean = X_train.mean()
    train_std = X_train.std()
    X_train_std = (X_train - train_mean) / train_std
    X_test_std = (X_test - train_mean) / train_std
    
    return X_train_std, y_train, X_test_std, y_test

Why Standardization Matters: Without standardization, features with larger scales (like Previous Scores: 0-100) would dominate features with smaller scales (like Hours Studied: 1-10). Standardization ensures every feature competes fairly.

Step 3: Implementing the Learning Algorithm

Now comes the exciting part - implementing the actual learning algorithm. This is where mathematical theory transforms into working code:

def train_and_eval_model(X_train, y_train, X_test, y_test, plot_residuals=True):
    # Add intercept column (for the β₀ term)
    n_train = X_train.shape[0]
    n_test = X_test.shape[0]
    X_train_intercept = np.column_stack([np.ones(n_train), X_train])
    X_test_intercept = np.column_stack([np.ones(n_test), X_test])
    
    # Apply the normal equation
    X_T = X_train_intercept.T
    try:
        # This single line finds the optimal coefficients
        beta = np.linalg.solve(X_T @ X_train_intercept, X_T @ y_train)
    except np.linalg.LinAlgError:
        # Fallback for singular matrices
        beta = np.linalg.lstsq(X_train_intercept, y_train, rcond=None)[0]
    
    # Generate predictions
    y_train_pred = X_train_intercept @ beta
    y_test_pred = X_test_intercept @ beta
    
    # Calculate performance metrics
    # ... (metrics calculation code continues)

The beauty of this approach lies in its elegance. While gradient descent would require hundreds of iterations, the normal equation solves everything in one computation.

Step 4: Model Evaluation - Trust but Verify

Building a model is only half the battle. We must rigorously validate its performance:

# Calculate R-squared - how much variance we explain
ss_total = np.sum((y_true - np.mean(y_true)) ** 2)
ss_residual = np.sum((y_true - y_pred) ** 2)
r2 = 1 - (ss_residual / ss_total)

# Calculate RMSE - average prediction error
rmse = np.sqrt(np.mean((y_true - y_pred) ** 2))

Model validation results showing R² Score and RMSE performance metrics

Our model achieves impressive results:

R² Score: 0.988 (explaining 98.8% of variance)
RMSE: 2.14 points (on a 0-100 scale)
No Overfitting: Test performance matches training performance

Validating Your Results

Understanding Residual Plots

Residual plots are like health checkups for your model. Each plot tells a different story:

Residuals vs Fitted Values: Random scatter confirms no systematic bias
Q-Q Plot: Points following the line confirm normal distribution
Histogram: Bell curve shape validates our assumptions
Scale-Location: Horizontal band shows consistent accuracy

Running the Complete Model

Here is how everything comes together:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# First lets import the data
# Note: You will need to change this file path to the location of the dataset on your computer.
data_file = "~/path_to_your_file"

# create a pandas data frame
data_df = pd.read_csv(data_file)

# Lets print the dataset to see how it is structured.
print(data_df.head())

#Remapping "Extracurricular Activities"
data_df['Extracurricular Activities'] = data_df['Extracurricular Activities'].map({'Yes':1, 'No':0})

#Lets separate features from target
features = data_df.iloc[:, :-1]
target   = data_df.iloc[:, -1]

#Lets give each column names
features_names = features.columns.to_list()
target_name    = data_df.columns[-1]

#Time to check correlation
corr_m = data_df.corr()

#lets print values
print(corr_m['Performance Index'])

#lets create a figure for a different visualization
plt.figure(figsize=(8, 6))

# Create heatmap
sns.heatmap(corr_m, annot=True, cmap='coolwarm', center=0, fmt='.2f', square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Student Performance Features', fontsize=16)
plt.tight_layout()
plt.show()

#lets create a function to preprocess the dataset by normalizing it and splitting into 80% train and 20% test.
def data_processing(data_df, features_columns, target_column, train_size=0.8, random_state=42):

    #make copy of dataframe with the desired features and target
    desired_data = data_df[features_columns + [target_column]].copy()

    #shuffle dataset
    shuffle_df = desired_data.sample(frac = 1, random_state=random_state).reset_index(drop=True)

    #train_set
    train_sz  = int(train_size * len(desired_data))
    train_set = shuffle_df[:train_sz]

    #test_set
    test_set = shuffle_df[train_sz:]

    X_train = train_set[features_columns]
    y_train = train_set[target_column]
    X_test  = test_set[features_columns]
    y_test  = test_set[target_column]

    #apply standardization formula to features
    train_mean  = X_train.mean()
    train_std   = X_train.std()
    X_train_std = (X_train - train_mean) / train_std
    X_test_std  = (X_test - train_mean) / train_std

    return X_train_std, y_train, X_test_std, y_test

def train_and_eval_model(X_train, y_train, X_test, y_test, plot_residuals=True, plot_fit=True):
    # Create a feature matrix with intercept column
    n_train = X_train.shape[0]
    n_test  = X_test.shape[0]
    X_train_intercept = np.column_stack([np.ones(n_train), X_train])
    X_test_intercept  = np.column_stack([np.ones(n_test), X_test])

    # Solve normal equations beta = (X^T * X)^(-1) * X^T * y
    # Help to find the optimal coefficients in one direct computation
    X_T = X_train_intercept.T
    try:
        beta = np.linalg.solve(X_T @ X_train_intercept, X_T @ y_train)
    except np.linalg.LinAlgError:
        beta = np.linalg.lstsq(X_train_intercept, y_train, rcond=None)[0]

    # Generate predictions
    y_train_pred = X_train_intercept @ beta
    y_test_pred = X_test_intercept @ beta

    # Calculate residuals (actual - predicted)
    train_residuals = y_train - y_train_pred
    test_residuals = y_test - y_test_pred

    #Create Linear regression plot
    if plot_fit:
        create_prediction_vs_actual_plot(y_train, y_train_pred, y_test, y_test_pred)

    # Create residual plots
    if plot_residuals:
        create_residual_plots(y_train, y_train_pred, train_residuals,
                              y_test, y_test_pred, test_residuals)

    # Calculate performance metrics
    def compute_metrics(y_true, y_pred, dataset_name):
        # Mean squared error formula
        mse = np.mean((y_true - y_pred) ** 2)
        # Root mean squared error formula
        rmse = np.sqrt(mse)

        # R-squared: proportion of variance explained (1.0 = perfect, 0.0 = no better than mean)
        ss_total = np.sum((y_true - np.mean(y_true)) ** 2)
        ss_residual = np.sum((y_true - y_pred) ** 2)
        r2 = 1 - (ss_residual / ss_total) if ss_total > 0 else 0

        return {
            f'{dataset_name}_mse': mse,
            f'{dataset_name}_rmse': rmse,
            f'{dataset_name}_r2': r2,
            f'{dataset_name}_samples': len(y_true)
        }

    # Compute metrics for both training and test sets
    train_metrics = compute_metrics(y_train, y_train_pred, 'train')
    test_metrics = compute_metrics(y_test, y_test_pred, 'test')

    # Display the metrics
    display_metrics_pandas(train_metrics, test_metrics)

    return {'coefficients': beta, 'train_metrics': train_metrics, 'test_metrics': test_metrics, 'predictions': { 'train': y_train_pred, 'test': y_test_pred},
        'residuals': {'train': train_residuals, 'test': test_residuals}}

def display_metrics_pandas(train_metrics, test_metrics):

    metrics_data = {
        'Training': [
            train_metrics['train_mse'],
            train_metrics['train_rmse'],
            train_metrics['train_r2'],
            train_metrics['train_samples']
        ],
        'Test': [
            test_metrics['test_mse'],
            test_metrics['test_rmse'],
            test_metrics['test_r2'],
            test_metrics['test_samples']
        ]
    }

    df = pd.DataFrame(metrics_data,
                      index=['MSE', 'RMSE', 'R²', 'Samples'])
    for col in df.columns:
        df.loc[['MSE', 'RMSE', 'R²'], col] = df.loc[['MSE', 'RMSE', 'R²'], col].round(4)

    print("\n" + "="*40)
    print("MODEL PERFORMANCE METRICS")
    print("="*40)
    print(df.to_string())
    print("="*40 + "\n")

def create_residual_plots(y_train, y_train_pred, train_residuals, y_test, y_test_pred, test_residuals):

    # Create a figure with 4 subplots
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    fig.suptitle('Residual Analysis for Model Validation', fontsize=16)

    # 1. Residuals vs Fitted Values
    ax1 = axes[0, 0]
    ax1.scatter(y_train_pred, train_residuals, alpha=0.5, label='Training', s=10)
    ax1.scatter(y_test_pred, test_residuals, alpha=0.5, label='Test', s=10)
    ax1.axhline(y=0, color='red', linestyle='--', linewidth=2)
    ax1.set_xlabel('Predicted Performance Index')
    ax1.set_ylabel('Residuals (Actual - Predicted)')
    ax1.set_title('Residuals vs Fitted Values')
    ax1.legend()
    ax1.grid(True, alpha=0.3)

    # 2. Q-Q Plot for Normality Check
    ax2 = axes[0, 1]
    all_residuals = np.concatenate([train_residuals, test_residuals])
    stats.probplot(all_residuals, dist="norm", plot=ax2)
    ax2.set_title('Q-Q Plot')
    ax2.grid(True, alpha=0.3)

    # 3. Histogram of Residuals
    ax3 = axes[1, 0]
    ax3.hist(train_residuals, bins=30, alpha=0.5, label='Training', density=True)
    ax3.hist(test_residuals, bins=30, alpha=0.5, label='Test', density=True)

    #normal distribution curve
    residual_std = np.std(all_residuals)
    residual_mean = np.mean(all_residuals)
    x = np.linspace(residual_mean - 4*residual_std, residual_mean + 4*residual_std, 100)
    ax3.plot(x, stats.norm.pdf(x, residual_mean, residual_std), 'r-',
             linewidth=2, label='Normal Distribution')
    ax3.set_xlabel('Residuals')
    ax3.set_ylabel('Density')
    ax3.set_title('Distribution of Residuals')
    ax3.legend()
    ax3.grid(True, alpha=0.3)

    # 4. Scale-Location Plot
    ax4 = axes[1, 1]
    train_std_residuals = train_residuals / np.std(train_residuals)
    test_std_residuals = test_residuals / np.std(test_residuals)
    ax4.scatter(y_train_pred, np.sqrt(np.abs(train_std_residuals)),
                alpha=0.5, label='Training', s=10)
    ax4.scatter(y_test_pred, np.sqrt(np.abs(test_std_residuals)),
                alpha=0.5, label='Test', s=10)
    ax4.set_xlabel('Predicted Performance Index')
    ax4.set_ylabel('Standardized Residuals')
    ax4.set_title('Scale-Location Plot')
    ax4.legend()
    ax4.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

def create_prediction_vs_actual_plot(y_train, y_train_pred, y_test, y_test_pred):

    plt.figure(figsize=(8, 6))

    # Scatter plot for training and test data
    plt.scatter(y_train, y_train_pred, alpha=0.5, label='Training Data', s=15)
    plt.scatter(y_test, y_test_pred, alpha=0.7, label='Test Data', s=15)

    # Line of perfect prediction (y=x)
    min_val = min(min(y_train), min(y_test))
    max_val = max(max(y_train), max(y_test))
    plt.plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect Prediction')

    plt.title('Predicted vs. Actual Performance Index')
    plt.xlabel('Actual Performance Index')
    plt.ylabel('Predicted Performance Index')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

[X_train, y_train, X_test, y_test] = data_processing(data_df,features_columns=['Previous Scores', 'Hours Studied', 'Sleep Hours'], target_column='Performance Index')
results = train_and_eval_model(X_train, y_train, X_test, y_test,plot_residuals=True, plot_fit=True)

Key Takeaways

What You Have Accomplished

By building this model from scratch, you have:

Mastered the Mathematics: You understand the normal equation and how it finds optimal solutions
Implemented Core Algorithms: Your code handles data preprocessing, model training, and evaluation
Validated Thoroughly: You know how to check if a model is truly working
Achieved Excellence: 98.8% accuracy proves the power of understanding fundamentals

Next Steps in Your Journey

Experiment with Features: Try different feature combinations and observe how performance changes
Compare with Scikit-learn: Implement the same model using scikit-learn and verify you get identical results
Scale to New Problems: Apply these techniques to other regression problems
Learn Gradient Descent: Implement the iterative approach and understand when each method is preferred

Resources for Continued Learning

Full Implementation: Access the complete code on Kaggle
Dataset: Student Performance Dataset
Further Reading: Linear Algebra for Machine Learning, Understanding Statistical Learning

Final Thoughts

Building machine learning models from scratch transforms you from a library user into a true practitioner. You now possess knowledge that many developers lack - the ability to understand, debug, and optimize at the deepest level.

Remember, every expert started exactly where you are now. The difference is they kept building, kept learning, and never stopped questioning how things work.

Your next challenge awaits. Start building.

Build Your First Machine Learning Model: Multiple Linear Regression From Scratch (Complete Guide 2025)

Table of Contents

What You Will Learn

Why Build From Scratch?

Understanding Multiple Linear Regression

The Normal Equation: Your Secret Weapon

The Student Performance Dataset

Building Your Model Step by Step

Step 1: Data Exploration and Feature Selection

Step 2: Data Preprocessing - The Foundation of Success

Step 3: Implementing the Learning Algorithm

Step 4: Model Evaluation - Trust but Verify

Validating Your Results

Understanding Residual Plots

Running the Complete Model

Key Takeaways

What You Have Accomplished

Next Steps in Your Journey

Resources for Continued Learning

Final Thoughts

Tags:

Table of Contents

What You Will Learn

Why Build From Scratch?

Understanding Multiple Linear Regression

The Normal Equation: Your Secret Weapon

The Student Performance Dataset

Building Your Model Step by Step

Step 1: Data Exploration and Feature Selection

Step 2: Data Preprocessing - The Foundation of Success

Step 3: Implementing the Learning Algorithm

Step 4: Model Evaluation - Trust but Verify

Validating Your Results

Understanding Residual Plots

Running the Complete Model

Key Takeaways

What You Have Accomplished

Next Steps in Your Journey

Resources for Continued Learning

Final Thoughts

Tags:

Share this post: