数学重学 - 23 回归与预测

lucy

2026-04-04

数学重学

数学, 编程数学

这是数学重学路线图阶段四的子页面

回归与预测：用一条线穿过数据

直觉入门：回归是在”找规律”

想象你在散点图上撒了一把豆子

回归就是找一条线（或曲线），让它尽可能靠近所有豆子

找到这条线之后，给一个新的 x，就能预测 y

类比后端开发：

过去 30 天的 QPS 数据就是散点

找到趋势线，就能预测下周需要多少服务器

简单线性回归：y = ax + b

模型

$y = ax + b + \epsilon$

a = 斜率（x 每增加 1，y 平均变化 a）

b = 截距（x=0 时 y 的值）

ε = 误差（模型抓不住的随机波动）

最小二乘法 OLS

目标：找到 a 和 b，使得所有数据点到直线的垂直距离平方和最小

$\min_{a,b} \sum_{i=1}^{n} (y_i - ax_i - b)^2$

直觉：让预测值和真实值的”总偏差”最小

公式推导

$a = \frac{n\sum x_iy_i - \sum x_i \sum y_i}{n\sum x_i^2 - (\sum x_i)^2}$

$b = \bar{y} - a\bar{x}$

不用背，理解就好——Python 一行搞定

Python 从零实现 + sklearn 版：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# 模拟数据：QPS 与 服务器CPU使用率
np.random.seed(42)
qps = np.random.uniform(100, 1000, 50)
cpu_usage = 10 + 0.07 * qps + np.random.normal(0, 5, 50)

# === 方法1：纯数学公式 ===
n = len(qps)
a_manual = (n * np.sum(qps * cpu_usage) - np.sum(qps) * np.sum(cpu_usage)) / \
       (n * np.sum(qps**2) - np.sum(qps)**2)
b_manual = np.mean(cpu_usage) - a_manual * np.mean(qps)
print(f"手动计算: y = {a_manual:.4f}x + {b_manual:.4f}")

# === 方法2：sklearn ===
model = LinearRegression()
model.fit(qps.reshape(-1, 1), cpu_usage)
print(f"sklearn:  y = {model.coef_[0]:.4f}x + {model.intercept_:.4f}")

# 可视化
plt.figure(figsize=(10, 6))
plt.scatter(qps, cpu_usage, alpha=0.6, label='实际数据')
x_line = np.linspace(50, 1050, 100)
y_line = model.predict(x_line.reshape(-1, 1))
plt.plot(x_line, y_line, 'r-', linewidth=2, label=f'回归线: y={model.coef_[0]:.3f}x+{model.intercept_:.1f}')
plt.xlabel('QPS')
plt.ylabel('CPU 使用率 (%)')
plt.title('QPS vs CPU使用率：线性回归')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# 预测：QPS=800时CPU多少？
pred = model.predict(800)
print(f"\nQPS=800 → 预测CPU: {pred[0]:.1f}%")

R² 决定系数：模型好不好？

定义

$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$

$SS_{res}$ = 残差平方和（模型没解释的变异）

$SS_{tot}$ = 总平方和（数据的总变异）

直觉

R² = 模型解释了多少比例的数据波动

R² = 1：完美拟合（不现实）

R² = 0：模型和瞎猜一样

R² = 0.85：模型解释了 85% 的变异，还行

注意事项

R² 高不一定好（可能过拟合）

R² 低不一定差（有些数据本身就很嘈杂）

增加特征 R² 一定不降（所以多元回归用 Adjusted R²）

Python 计算 R²：

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

np.random.seed(42)
X = np.random.uniform(0, 10, 100).reshape(-1, 1)

# 场景对比
scenarios = {
"强线性关系 (低噪声)": 2*X.ravel() + 3 + np.random.normal(0, 1, 100),
"弱线性关系 (高噪声)": 2*X.ravel() + 3 + np.random.normal(0, 8, 100),
"非线性关系": np.sin(X.ravel()) * 10 + np.random.normal(0, 1, 100),
}

for name, y in scenarios.items():
model = LinearRegression().fit(X, y)
r2 = r2_score(y, model.predict(X))
print(f"{name:30s} → R² = {r2:.4f}")

多元线性回归：多个特征一起上

模型

$y = a_1x_1 + a_2x_2 + \ldots + a_nx_n + b$

矩阵形式：$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$

解：$\boldsymbol{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

直觉

CPU 使用率不只取决于 QPS，还取决于并发连接数、内存占用等

多元回归同时考虑多个因素

特征重要性

标准化后，系数绝对值越大 → 影响越大

Python 多元回归实战：

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 200

# 模拟服务器监控数据
qps = np.random.uniform(100, 1000, n)
connections = np.random.uniform(50, 500, n)
memory_mb = np.random.uniform(1000, 8000, n)

# 真实关系：CPU = 5 + 0.05*QPS + 0.03*connections + 0.002*memory + noise
cpu = 5 + 0.05*qps + 0.03*connections + 0.002*memory_mb + np.random.normal(0, 3, n)

X = np.column_stack([qps, connections, memory_mb])
feature_names = ['QPS', '并发连接数', '内存使用(MB)']

# 拟合
model = LinearRegression().fit(X, cpu)
print("多元线性回归系数：")
for name, coef in zip(feature_names, model.coef_):
print(f"  {name}: {coef:.5f}")
print(f"  截距: {model.intercept_:.4f}")
print(f"  R²: {model.score(X, cpu):.4f}")

# 标准化后看特征重要性
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model_scaled = LinearRegression().fit(X_scaled, cpu)

print("\n标准化后的特征重要性（绝对值越大越重要）：")
importance = sorted(zip(feature_names, np.abs(model_scaled.coef_)),
                key=lambda x: x[1], reverse=True)
for name, imp in importance:
print(f"  {name}: {imp:.4f}")

过拟合 vs 欠拟合

直觉

欠拟合（Underfitting）：

模型太简单，没有捕捉到数据规律

比喻：用一条直线拟合明显弯曲的数据

表现：训练集和测试集都差

过拟合（Overfitting）：

模型太复杂，把噪声也记住了

比喻：用 20 次多项式去拟合 5 个点——完美穿过每个点，但在点之间疯狂波动

表现：训练集很好，测试集很差

关键区别：

欠拟合 = 没学到规律

过拟合 = 学到了不该学的（噪声）

偏差-方差权衡 Bias-Variance Tradeoff

偏差（Bias）：模型预测值与真实值的系统性偏差 → 欠拟合

方差（Variance）：模型对训练数据的敏感度 → 过拟合

总误差 = 偏差² + 方差 + 不可约误差

目标：找到偏差和方差的最佳平衡点

过拟合检测

训练集表现远好于测试集 → 过拟合

交叉验证：将数据分 k 折，轮流做验证

Python 过拟合可视化演示：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# 真实关系：y = sin(x) + noise
np.random.seed(42)
X_train = np.sort(np.random.uniform(0, 6, 15)).reshape(-1, 1)
y_train = np.sin(X_train.ravel()) + np.random.normal(0, 0.2, 15)

X_test = np.linspace(0, 6, 200).reshape(-1, 1)
y_true = np.sin(X_test.ravel())

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
degrees = [1, 4, 15]
titles = ['欠拟合 (degree=1)', '刚刚好 (degree=4)', '过拟合 (degree=15)']

for ax, degree, title in zip(axes, degrees, titles):
poly = PolynomialFeatures(degree)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)

model = LinearRegression().fit(X_poly_train, y_train)
y_pred_train = model.predict(X_poly_train)
y_pred_test = model.predict(X_poly_test)

r2_train = r2_score(y_train, y_pred_train)

ax.scatter(X_train, y_train, color='blue', s=50, zorder=5, label='训练数据')
ax.plot(X_test, y_true, 'g--', alpha=0.5, label='真实函数')
ax.plot(X_test, y_pred_test, 'r-', linewidth=2, label=f'拟合 (R²={r2_train:.3f})')
ax.set_title(title)
ax.set_ylim(-2, 2)
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)

plt.suptitle('欠拟合 vs 刚好 vs 过拟合', fontsize=14)
plt.tight_layout()
plt.show()

交叉验证代码：

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

np.random.seed(42)
X = np.sort(np.random.uniform(0, 6, 50)).reshape(-1, 1)
y = np.sin(X.ravel()) + np.random.normal(0, 0.2, 50)

print("多项式次数 vs 交叉验证 R²：")
print("-" * 40)
best_degree, best_score = 0, -np.inf

for degree in range(1, 16):
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
scores = cross_val_score(model, X, y, cv=5, scoring='r2')
mean_score = scores.mean()

if mean_score > best_score:
    best_score = mean_score
    best_degree = degree

bar = '█' * int(max(0, mean_score) * 30)
print(f"  degree={degree:2d}: R²={mean_score:+.4f} {bar}")

print(f"\n最优多项式次数: {best_degree} (CV R²={best_score:.4f})")

正则化：给模型加”枷锁”

为什么需要正则化？

过拟合时，模型系数往往很大（疯狂拧合数据）

正则化 = 在损失函数中加一个惩罚项，限制系数不能太大

L2 正则化 Ridge（岭回归）

$\min \sum(y_i - \hat{y}_i)^2 + \alpha \sum a_j^2$

效果：所有系数被压缩变小，但不会变成 0

直觉：让每个特征都参与，但不让任何一个独大

L1 正则化 Lasso

$\min \sum(y_i - \hat{y}_i)^2 + \alpha \sum |a_j|$

效果：一些系数会被压缩到精确为 0（自动特征选择）

直觉：选出最重要的几个特征，其他砍掉

ElasticNet

L1 + L2 的混合

Python 正则化对比：

import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline

np.random.seed(42)
X = np.sort(np.random.uniform(0, 6, 30)).reshape(-1, 1)
y = np.sin(X.ravel()) + np.random.normal(0, 0.3, 30)

degree = 10  # 故意用高次多项式

models = {
"线性回归 (无正则化)": make_pipeline(PolynomialFeatures(degree), LinearRegression()),
"Ridge (L2, α=0.1)": make_pipeline(PolynomialFeatures(degree), Ridge(alpha=0.1)),
"Ridge (L2, α=1.0)": make_pipeline(PolynomialFeatures(degree), Ridge(alpha=1.0)),
"Lasso (L1, α=0.01)": make_pipeline(PolynomialFeatures(degree), Lasso(alpha=0.01)),
"Lasso (L1, α=0.1)": make_pipeline(PolynomialFeatures(degree), Lasso(alpha=0.1)),
}

print(f"多项式次数 = {degree}")
print("-" * 55)
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"{name:30s} → CV R² = {scores.mean():.4f} ± {scores.std():.4f}")

综合应用场景

大数据方向

趋势预测（DAU/MAU）：

历史数据拟合线性/多项式回归

预测未来 N 天的用户量

特征重要性分析：

哪些因素最影响指标？

标准化系数 = 特征重要性排名

代码示例——DAU 趋势预测：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

np.random.seed(42)

# 模拟90天DAU数据（带增长趋势+噪声）
days = np.arange(1, 91).reshape(-1, 1)
dau = 10000 + 50 * days.ravel() + np.random.normal(0, 500, 90)

# 拟合
model = LinearRegression().fit(days, dau)

# 预测未来30天
future_days = np.arange(91, 121).reshape(-1, 1)
future_dau = model.predict(future_days)

plt.figure(figsize=(12, 6))
plt.scatter(days, dau, alpha=0.5, s=20, label='历史DAU')
all_days = np.arange(1, 121).reshape(-1, 1)
plt.plot(all_days, model.predict(all_days), 'r-', linewidth=2, label='趋势线')
plt.axvline(x=90, color='gray', linestyle='--', alpha=0.5, label='今天')
plt.fill_between(future_days.ravel(), future_dau*0.9, future_dau*1.1,
             alpha=0.2, color='orange', label='预测区间(±10%)')
plt.xlabel('天数')
plt.ylabel('DAU')
plt.title(f'DAU趋势预测 (日增约{model.coef_[0]:.0f})')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"趋势: 每天增长约 {model.coef_[0]:.0f} 用户")
print(f"R² = {model.score(days, dau):.4f}")
print(f"预测第120天DAU: {model.predict(120)[0]:.0f}")

安全方向

异常分数回归：

特征（登录频率、IP变化、设备指纹等）→ 风险分数

多元回归建模哪些因素最影响风险

漏洞数量趋势：

每月新增 CVE 数量趋势

预测未来安全团队的工作量

代码示例——安全风险评分模型：

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error

np.random.seed(42)
n = 500

# 特征
login_frequency = np.random.exponential(5, n)       # 每天登录次数
ip_change_count = np.random.poisson(2, n)           # 一周内IP变化次数
failed_attempts = np.random.poisson(1, n)           # 失败尝试次数
access_time_var = np.random.exponential(3, n)       # 访问时间方差

# 真实风险分数（加权组合）
risk_score = (0.1 * login_frequency + 0.3 * ip_change_count +
          0.4 * failed_attempts + 0.2 * access_time_var +
          np.random.normal(0, 0.5, n))
risk_score = np.clip(risk_score, 0, 10)

X = np.column_stack([login_frequency, ip_change_count,
                 failed_attempts, access_time_var])
feature_names = ['登录频率', 'IP变化', '失败尝试', '时间方差']

X_train, X_test, y_train, y_test = train_test_split(X, risk_score,
                                                 test_size=0.2, random_state=42)

model = LinearRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)

print("=== 安全风险评分模型 ===")
print(f"R² = {r2_score(y_test, y_pred):.4f}")
print(f"MAE = {mean_absolute_error(y_test, y_pred):.4f}")
print("\n特征权重：")
for name, coef in sorted(zip(feature_names, model.coef_),
                     key=lambda x: abs(x[1]), reverse=True):
print(f"  {name}: {coef:.4f}")

后端方向

容量预测：

历史 QPS → 未来需求

根据趋势提前扩容

性能回归检测：

每次发版后监控响应时间趋势

如果斜率显著变化 → 性能回归

代码示例——性能回归检测：

import numpy as np
from scipy import stats

np.random.seed(42)

# 发版前30天的响应时间（稳定）
before = 200 + np.random.normal(0, 10, 30)
# 发版后10天的响应时间（可能变差了）
after = 215 + np.random.normal(0, 12, 10)

days_before = np.arange(1, 31)
days_after = np.arange(31, 41)

# 方法1：t检验看均值是否变化
t_stat, p_value = stats.ttest_ind(before, after, alternative='less')
print("=== 性能回归检测 ===")
print(f"发版前均值: {before.mean():.1f} ms")
print(f"发版后均值: {after.mean():.1f} ms")
print(f"t检验: t={t_stat:.3f}, p={p_value:.6f}")
print(f"结论: {'性能回归！响应时间显著升高' if p_value < 0.05 else '正常波动'}")

# 方法2：线性回归看趋势是否变化
from sklearn.linear_model import LinearRegression

# 发版前的趋势
model_before = LinearRegression().fit(days_before.reshape(-1,1), before)
slope_before = model_before.coef_[0]

# 发版后的趋势
model_after = LinearRegression().fit(days_after.reshape(-1,1), after)
slope_after = model_after.coef_[0]

print(f"\n发版前斜率: {slope_before:.3f} ms/天")
print(f"发版后斜率: {slope_after:.3f} ms/天")

模型评估最佳实践

训练集 / 测试集划分

永远不要用训练数据评估模型

常用比例：80% 训练 / 20% 测试

时间序列数据：用过去预测未来，不能随机打乱

交叉验证

k-Fold：数据分成 k 份，轮流做验证

更稳定的评估，减少运气成分

常用评估指标

R²：越接近 1 越好

MAE（平均绝对误差）：预测偏差的平均大小

RMSE（均方根误差）：对大误差更敏感

MAPE（平均绝对百分比误差）：百分比形式更直观

Python 评估指标代码：

import numpy as np
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

def evaluate_model(y_true, y_pred, name="模型"):
"""完整的回归模型评估"""
r2 = r2_score(y_true, y_pred)
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

print(f"=== {name} 评估 ===")
print(f"  R²   = {r2:.4f}  (解释了{r2*100:.1f}%的变异)")
print(f"  MAE  = {mae:.4f}  (平均偏差{mae:.1f})")
print(f"  RMSE = {rmse:.4f}  (对大误差敏感)")
print(f"  MAPE = {mape:.2f}% (百分比误差)")
return r2, mae, rmse, mape

# 使用示例
np.random.seed(42)
y_true = np.random.normal(100, 15, 50)
y_pred_good = y_true + np.random.normal(0, 3, 50)
y_pred_bad = y_true + np.random.normal(0, 15, 50)

evaluate_model(y_true, y_pred_good, "好模型")
print()
evaluate_model(y_true, y_pred_bad, "差模型")

练习题

题目1：简单线性回归

给定数据：x = [1,2,3,4,5], y = [2.1, 3.9, 6.2, 7.8, 10.1]

(a) 用最小二乘法求 a 和 b

(b) 预测 x=6 时的 y

参考答案：

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2.1, 3.9, 6.2, 7.8, 10.1])

model = LinearRegression().fit(X, y)
print(f"(a) y = {model.coef_[0]:.4f}x + {model.intercept_:.4f}")
print(f"(b) x=6 → y = {model.predict(6)[0]:.4f}")
print(f"(c) R² = {model.score(X, y):.4f}")

题目2：过拟合诊断

你的模型在训练集 R²=0.99，测试集 R²=0.45

(a) 这是什么问题？

(b) 列出 3 种解决方法

参考答案：

(a) 过拟合——模型记住了训练数据的噪声

(b) 解决方法：

减少模型复杂度（降低多项式次数、减少特征）
加正则化（Ridge / Lasso）
增加训练数据量
交叉验证选择最优超参数

题目3：多元回归特征选择

你有 5 个特征预测服务器响应时间，Lasso 回归后有 2 个特征系数变为 0

(a) 这意味着什么？

(b) 保留哪些特征更合理？为什么？

参考答案：

(a) 系数为 0 的特征被 Lasso “认为”对预测不重要，自动被剔除

(b) 保留系数不为 0 的 3 个特征。因为 Lasso 的 L1 惩罚天然做特征选择

但注意：如果两个特征高度相关，Lasso 可能只保留其中一个

题目4：容量预测

过去 7 天的 QPS 数据：[500, 520, 550, 540, 580, 600, 610]

(a) 拟合线性回归

(b) 预测第 14 天的 QPS

参考答案：

import numpy as np
from sklearn.linear_model import LinearRegression

days = np.array([1,2,3,4,5,6,7]).reshape(-1, 1)
qps = np.array([500, 520, 550, 540, 580, 600, 610])

model = LinearRegression().fit(days, qps)
print(f"(a) QPS = {model.coef_[0]:.1f} * day + {model.intercept_:.1f}")
print(f"(b) 第14天预测: {model.predict(14)[0]:.0f} QPS")
print(f"R² = {model.score(days, qps):.4f}")
print()
print("(c) 线性外推的风险：")
print("  - 增长可能是非线性的（如指数增长或饱和）")
print("  - 可能存在周期性（工作日vs周末）")
print("  - 突发事件无法预测（大促、故障）")
print("  - 外推越远越不可靠（只在数据范围附近可信）")

题目5：综合——正则化选择

场景：你有 100 个特征预测用户流失概率，怀疑很多特征无用

(a) 应该用 Ridge 还是 Lasso？为什么？

(b) 如果希望保留所有特征但防止过拟合呢？

参考答案：

(a) 用 Lasso（L1）——因为它能自动把无用特征系数压到 0，实现特征选择

(b) 用 Ridge（L2）——它不会消除特征，只是压缩系数

from sklearn.linear_model import LassoCV, RidgeCV
import numpy as np

np.random.seed(42)
X = np.random.randn(200, 100)
# 只有前5个特征真正有用
true_coef = np.zeros(100)
true_coef[:5] = [3, -2, 1.5, -1, 0.8]
y = X @ true_coef + np.random.normal(0, 1, 200)

# LassoCV 自动选 alpha
lasso = LassoCV(cv=5, random_state=42).fit(X, y)
non_zero = np.sum(lasso.coef_ != 0)
print(f"Lasso 选出 {non_zero} 个特征 (真实: 5)")
print(f"最优 α = {lasso.alpha_:.4f}")

# 看哪些特征被选中
selected = np.where(lasso.coef_ != 0)[0]
print(f"选中的特征索引: {selected}")

本节小结

回归的核心思想：用数学函数近似数据的”规律”

简单线性回归 y=ax+b 是起点，多元线性回归处理多个因素

R² 衡量模型好坏，但要小心过拟合

过拟合 vs 欠拟合是机器学习最核心的问题之一

正则化（L1/L2）是对抗过拟合的基本武器

回归是数学重学路线图后续机器学习内容的基础

上一章	目录	下一章
22-假设检验与AB测试	数学重学路线图	24-线性代数实战