Feature Scaling in Data Science
Feature scaling is a vital preprocessing step in data science, ensuring that all features contribute fairly to machine learning models. This module explains normalization and standardization in depth, with formulas, examples, and guidance on when to use each technique.

Feature Scaling in Data Science
Feature scaling is a crucial data preprocessing step in machine learning and data science. Many algorithms rely on distance metrics or gradient-based optimization. If the features of a dataset exist on different scales, the algorithm may become biased toward those with larger magnitudes, leading to poor performance or unstable training.
Consider an example with two features: Age (0–100) and Income (0–100,000). If we use a distance-based algorithm such as k-nearest neighbors (kNN), the differences in income values will dominate the distance computation, while the contribution of age becomes negligible. Similarly, gradient-based algorithms like logistic regression or neural networks will converge more slowly if features are on vastly different scales.
To resolve this, we apply feature scaling. This transforms features into comparable ranges or distributions, ensuring that each contributes appropriately to the model. The two most common methods are Normalization and Standardization.
1. Normalization (Min–Max Scaling)
Definition:
Normalization is a technique that rescales the values of a feature into a fixed range, typically between 0 and 1. This makes the data easier to compare across features, especially when they have very different scales.
Formula:
Where:
- 𝑥 = original value
- 𝑥min = minimum value of the feature
- 𝑥max = maximum value of the feature
- 𝑥' = normalized value (between 0 and 1)
Example:
Let's consider the following bief example:
- Income = [30,000, 50,000, 70,000, 90,000, 110,000].
- Mean () = 70,000
- Standard deviation (σ) ≈ 15,811
Standardized values ≈ [-1.26, -0.63, 0, 0.63, 1.26].
Example in Python:
Output:
When to Use:
- Distance-based algorithms: k-Nearest Neighbors (kNN), clustering (e.g., k-Means), where distance is influenced by scale.
- Neural networks: Helps stabilize gradient descent by keeping input values bounded within a range.
Limitations:
- Sensitive to outliers: Extreme values will stretch the range, making most values very small.
2. Standardization (Z-Score Scaling)
Definition:
Standardization transforms data such that the resulting distribution has a mean of 0 and a standard deviation of 1. This process is also known as Z-score scaling. Unlike normalization, which rescales values to a fixed range (e.g., [0, 1]), standardization does not limit values to a specific interval. Instead, it centers the data and accounts for variability, making different features comparable.
Formula:
Where:
- x = original feature value
- = mean of the feature
- σ = standard deviation of the feature
- z = standardized value
Example:
Suppose we have the following income values:
- Income = [30,000, 50,000, 70,000, 90,000, 110,000]
- Compute mean: μ=70,000
- Compute standard deviation: σ ≈ 15,811
- Apply the formula:
- For 30,000: z = 30,000 − 70,000 / 15,811 ≈ −2.53
- For 50,000: z = 50,000 − 70,000 / 15,811 ≈ −1.27
- For 70,000: z = 70,000 − 70,000 / 15,811 = 0
- For 90,000: z ≈ 1.27
- For 110,000: z ≈ 2.53
So, the standardized values are approximately: [−2.53,−1.27,0,1.27,2.53]
Example in Python:
Output:
When to Use:
Standardization is useful when:
- The algorithm assumes normally distributed data, such as:
- Linear Regression
- Logistic Regression
- Principal Component Analysis (PCA)
- Support Vector Machines (SVMs)
- Features are measured in different units and need to be made comparable.
Advantage:
- Not restricted to a fixed range, unlike normalization.
- Less sensitive to outliers because it centers around the mean and considers spread.
- Often the default choice when scaling is required for machine learning algorithms.
3. Normalization vs. Standardization
The following table outlines the key differences between normalization and standardization:
| Aspect | Normalization | Standardization |
|---|---|---|
| Range: | Fixed (e.g., 0–1) | No fixed range |
| Based On: | Minimum & Maximum values | Mean & Standard Deviation |
| Best For: | Distance-based methods (kNN, clustering, neural nets) | Gaussian-based methods (regression, PCA, SVM) |
| Outlier Sensitivity | High | Moderate |
| Interpretability | Values always bounded | Values centered at 0, variance = 1 |
4. Practical Considerations
In data science, it is useful to follow the following practices:
- Try both methods: In practice, data scientists often experiment with both normalization and standardization to see which works best, as performance may vary by dataset.
- Pipelines: Feature scaling should be applied after train-test splitting to avoid data leakage.
- Scaling with categorical data: Only apply scaling to numeric features; categorical features need separate preprocessing (e.g., one-hot encoding).
- Inverse transforms: Libraries like scikit-learn allow reversing scaling for interpretability after modeling.
5. Summary
Feature scaling is essential to prevent larger-magnitude features from dominating smaller ones.
- Normalization rescales features to a bounded range (0–1) — ideal for distance-based algorithms.
- Standardization centers features at 0 with unit variance — ideal for algorithms assuming normal distributions.
Choosing between them depends on the algorithm and data distribution, but both are indispensable tools in preprocessing.
Knowledge - Certification - Community



