Introduction to Regression

June 08, 2025

What is a Regression Problem?

A regression problem is a type of supervised machine learning task where the goal is to predict a continuous value based on input features.

📌 Examples of Regression Problems:

Predicting house prices based on size, location, and number of rooms.
Estimating the temperature for tomorrow based on weather conditions.
Predicting student scores based on study hours.

In all these examples, the output is a number (not a category), so we use regression models.

Key Concepts

Term	Meaning
Features (X)	Input variables (e.g., hours studied)
Target (y)	The value we want to predict (e.g., score)
Model	A function that learns the relationship between X and y
Training	Feeding the model with known X and y values to learn the pattern
Prediction	Using the model to estimate unknown y for a given X

🐍 Python Example: Simple Linear Regression

We'll use pandas, scikit-learn, and matplotlib (optional for plotting).

🔧 Step 1: Sample CSV File

Assume we have a CSV file named student_scores.csv with the following contents:

Hours,Score

2.5,21

5.1,47

3.2,27

8.5,75

3.5,30

1.5,20

9.2,88

This file has:

Input feature: Hours studied
Output/Target: Score

Python Code

import pandas as pd

from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt # optional for plot

# Step 1: Read the CSV file

data = pd.read_csv('student_scores.csv')

# Step 2: Separate the input (X) and output (y)

X = data[['Hours']] # 2D array

y = data['Score'] # 1D array

# Step 3: Create the Linear Regression model

model = LinearRegression()

# Step 4: Train the model using the data

model.fit(X, y)

# Step 5: Make a prediction (e.g., for 6.5 hours of study)

predicted_score = model.predict([[6.5]])

print(f"Predicted Score for 6.5 hours of study: {predicted_score[0]:.2f}")

# Optional: Plot the data and regression line

plt.scatter(X, y, color='blue') # actual data points

plt.plot(X, model.predict(X), color='red') # regression line

plt.xlabel("Hours Studied")

plt.ylabel("Score")

plt.title("Study Hours vs Score")

plt.show()

Output

Predicted Score for 6.5 hours of study: 59.58

Blue dots = actual data
Red line = prediction line

💡 Notes:

X = data[['Hours']] uses double square brackets because scikit-learn expects a 2D array for features.
model.fit(X, y) tells the model to learn the best-fit line.
model.predict([[6.5]]) returns the predicted score for 6.5 hours.

Search This Blog

Python for Artificial Intelligence MNCST319 KTU BTech CS Minor 2024 - Dr Binu V P