数学重学 - 11 集合论与数据操作

这是 数学重学路线图 阶段二的子页面

📌 已有详细笔记:集合论基础(含Java代码和SQL对照),本页补充 Python 视角和大数据场景


一、集合运算复习:5大核心操作

1.1 直觉比喻

集合就是一堆不重复的东西放在一个袋子里

交集∩ → 两个班都报名的同学

并集∪ → 两个班所有报名的同学(去重)

差集A\B → 报了A班但没报B班的同学

补集 → 全校减去报名的同学

对称差△ → 只报了一个班的同学(报A没报B + 报B没报A)

1.2 公式总结

A ∩ B = { x | x ∈ A 且 x ∈ B }

A ∪ B = { x | x ∈ A 或 x ∈ B }

A \ B = { x | x ∈ A 且 x ∉ B }

A△B = (A \ B) ∪ (B \ A) = (A ∪ B) \ (A ∩ B)

补集:A’ = U \ A (U是全集)

1.3 重要性质

交换律:A ∩ B = B ∩ A,A ∪ B = B ∪ A

结合律:(A ∩ B) ∩ C = A ∩ (B ∩ C)

分配律:A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)

德摩根律:(A ∪ B)’ = A’ ∩ B’,(A ∩ B)’ = A’ ∪ B’

安全场景记法:不在(黑名单1 或 黑名单2)中 = 不在黑名单1中 且 不在黑名单2中

1.4 容斥原理

|A ∪ B| = |A| + |B| - |A ∩ B|
|A ∪ B ∪ C| = |A| + |B| + |C| - |A∩B| - |A∩C| - |B∩C| + |A∩B∩C|

场景:统计受多种告警影响的独立IP数


二、Python set 操作全览

2.1 基础操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 创建集合
a = {1, 2, 3, 4, 5}
b = {3, 4, 5, 6, 7}

# 交集:两个集合都有的元素
print(a & b) # {3, 4, 5}
print(a.intersection(b)) # 同上

# 并集:两个集合所有元素(去重)
print(a | b) # {1, 2, 3, 4, 5, 6, 7}
print(a.union(b)) # 同上

# 差集:在a中但不在b中
print(a - b) # {1, 2}
print(a.difference(b)) # 同上

# 对称差:只在一个集合中的元素
print(a ^ b) # {1, 2, 6, 7}
print(a.symmetric_difference(b)) # 同上

# 子集判断
print({1, 2}.issubset(a)) # True
print(a.issuperset({1, 2})) # True
print(a.isdisjoint({8, 9})) # True(无交集)

2.2 set comprehension 与 frozenset

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 集合推导式:快速过滤+转换
nums = [1, 2, 2, 3, 3, 3, 4, 5]
unique_evens = {x for x in nums if x % 2 == 0}
print(unique_evens) # {2, 4}

# frozenset:不可变集合,可以作为dict的key或放入另一个set
fs = frozenset([1, 2, 3])
# fs.add(4) # 报错!不可变

# 实际用途:用frozenset做缓存key
cache = {}
key = frozenset({"user_id": 100, "role": "admin"}.items())
cache[key] = "cached_result"

# 多集合操作
sets = [{1,2,3}, {2,3,4}, {3,4,5}]
all_intersection = set.intersection(*sets) # {3}
all_union = set.union(*sets) # {1,2,3,4,5}

2.3 性能对比:set vs list

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import time

big_list = list(range(1_000_000))
big_set = set(range(1_000_000))

# list查找:O(n)
start = time.time()
999_999 in big_list
print(f"list查找: {time.time()-start:.6f}s")

# set查找:O(1)
start = time.time()
999_999 in big_set
print(f"set查找: {time.time()-start:.6f}s")

# 结论:需要频繁 in 判断时,把list转set能快几百倍

三、SQL 与集合运算的映射

3.1 对照表

集合运算 SQL实现 说明
A ∩ B INNER JOIN / INTERSECT 两表都有的记录
A ∪ B FULL OUTER JOIN / UNION 两表所有记录
A \ B LEFT JOIN WHERE b.id IS NULL / EXCEPT 在A不在B
A△B FULL OUTER JOIN WHERE … IS NULL 只在一边的记录

3.2 SQL 示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 用 Python 模拟 SQL 集合操作的思路

# 表A的用户ID
table_a = {"user_1", "user_2", "user_3", "user_4"}
# 表B的用户ID
table_b = {"user_3", "user_4", "user_5", "user_6"}

# INNER JOIN → 交集
inner = table_a & table_b # {"user_3", "user_4"}

# FULL OUTER JOIN → 并集
full_outer = table_a | table_b # 全部6个用户

# LEFT JOIN WHERE NULL → 差集
left_only = table_a - table_b # {"user_1", "user_2"}

# EXCEPT 等价写法(SQL标准)
# SELECT user_id FROM A EXCEPT SELECT user_id FROM B

3.3 关键记忆

UNION 自带去重 = 集合并集的天然特性

UNION ALL 不去重 = 多重集合(multiset)的并

INTERSECT 和 EXCEPT 是SQL标准但MySQL不直接支持(8.0+支持INTERSECT/EXCEPT)


四、大数据应用场景

4.1 用户交集分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd

# 场景:找出同时使用A产品和B产品的用户
df_a = pd.DataFrame({"user_id": [1,2,3,4,5], "product": "A"})
df_b = pd.DataFrame({"user_id": [3,4,5,6,7], "product": "B"})

# 方法1:merge(等价INNER JOIN)
both = pd.merge(df_a, df_b, on="user_id", how="inner")
print(f"同时使用AB的用户: {both['user_id'].tolist()}") # [3, 4, 5]

# 方法2:用set操作
users_a = set(df_a["user_id"])
users_b = set(df_b["user_id"])
print(f"交集: {users_a & users_b}") # {3, 4, 5}
print(f"只用A: {users_a - users_b}") # {1, 2}
print(f"只用B: {users_b - users_a}") # {6, 7}
print(f"至少用一个: {users_a | users_b}") # {1,2,3,4,5,6,7}

4.2 ETL 数据去重

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 场景:增量数据导入,只导入新数据
existing_ids = {1001, 1002, 1003, 1004} # 已存在的
incoming_ids = {1003, 1004, 1005, 1006} # 新来的

new_only = incoming_ids - existing_ids # {1005, 1006}
print(f"需要新插入: {new_only}")

need_update = incoming_ids & existing_ids # {1003, 1004}
print(f"可能需要更新: {need_update}")

# 大数据场景用 Bloom Filter 近似判断(空间更省)
# pip install pybloom-live
# from pybloom_live import BloomFilter
# bf = BloomFilter(capacity=10_000_000, error_rate=0.01)

4.3 漏斗分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 场景:用户行为漏斗 → 浏览 → 加购 → 下单 → 支付
step1_view = {"u1","u2","u3","u4","u5","u6","u7","u8","u9","u10"}
step2_cart = {"u1","u2","u3","u5","u7","u8"}
step3_order = {"u1","u3","u5","u8"}
step4_pay = {"u1","u5"}

steps = [step1_view, step2_cart, step3_order, step4_pay]
labels = ["浏览", "加购", "下单", "支付"]

for i, (label, s) in enumerate(zip(labels, steps)):
rate = len(s) / len(step1_view) * 100
if i > 0:
conv = len(s) / len(steps[i-1]) * 100
lost = steps[i-1] - s
print(f"{label}: {len(s)}人 (总转化{rate:.0f}%, 环比{conv:.0f}%, 流失用户: {lost})")
else:
print(f"{label}: {len(s)}人 (100%)")

4.4 pandas merge 详解

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd

left = pd.DataFrame({"key": ["a","b","c"], "val_l": [1,2,3]})
right = pd.DataFrame({"key": ["b","c","d"], "val_r": [4,5,6]})

# inner → 交集
print(pd.merge(left, right, on="key", how="inner"))
# key val_l val_r
# b 2 4
# c 3 5

# outer → 并集(缺失填NaN)
print(pd.merge(left, right, on="key", how="outer"))

# left → 保留左表全部
print(pd.merge(left, right, on="key", how="left"))

# 差集:left中没有匹配right的行
merged = pd.merge(left, right, on="key", how="left", indicator=True)
left_only = merged[merged["_merge"] == "left_only"]
print(f"只在左表: {left_only['key'].tolist()}") # ['a']

五、安全应用场景

5.1 IP 黑白名单运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 场景:安全运营中的IP名单管理

# 各来源的恶意IP
threat_intel_ips = {"10.1.1.1", "10.1.1.2", "192.168.1.100", "172.16.0.5"}
ids_alert_ips = {"10.1.1.2", "10.1.1.3", "192.168.1.100", "172.16.0.8"}
honeypot_ips = {"10.1.1.1", "10.1.1.3", "172.16.0.5", "172.16.0.9"}

# 多源确认的高危IP(至少2个来源)→ 用交集
from itertools import combinations

sources = [threat_intel_ips, ids_alert_ips, honeypot_ips]
multi_confirmed = set()
for s1, s2 in combinations(sources, 2):
multi_confirmed |= (s1 & s2)
print(f"多源确认的高危IP: {multi_confirmed}")
# {'10.1.1.1', '10.1.1.2', '10.1.1.3', '192.168.1.100', '172.16.0.5'}

# 汇总所有可疑IP(并集)
all_suspicious = threat_intel_ips | ids_alert_ips | honeypot_ips

# 白名单排除
whitelist = {"10.1.1.1", "192.168.1.100"} # 已确认是正常IP
final_block = all_suspicious - whitelist
print(f"最终封禁列表: {final_block}")

5.2 资产交叉比对

1
2
3
4
5
6
7
8
9
10
11
12
# 场景:发现未纳管的"影子资产"

cmdb_assets = {"srv-001", "srv-002", "srv-003", "srv-004"} # CMDB登记的
scan_found = {"srv-002", "srv-003", "srv-005", "srv-006"} # 扫描发现的

shadow_assets = scan_found - cmdb_assets # 影子资产
offline_assets = cmdb_assets - scan_found # 可能下线的
healthy = cmdb_assets & scan_found # 正常在管的

print(f"影子资产(需纳管): {shadow_assets}") # {'srv-005', 'srv-006'}
print(f"疑似下线(需确认): {offline_assets}") # {'srv-001', 'srv-004'}
print(f"正常在管: {healthy}") # {'srv-002', 'srv-003'}

5.3 告警去重

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 场景:多个安全设备产生重复告警

alert_set = set()

def dedup_alert(src_ip, dst_ip, alert_type):
"""基于三元组去重"""
key = frozenset({("src", src_ip), ("dst", dst_ip), ("type", alert_type)})
if key in alert_set:
return False # 重复
alert_set.add(key)
return True # 新告警

print(dedup_alert("10.0.0.1", "10.0.0.2", "port_scan")) # True
print(dedup_alert("10.0.0.1", "10.0.0.2", "port_scan")) # False(重复)
print(dedup_alert("10.0.0.1", "10.0.0.2", "brute_force")) # True(不同类型)

六、后端应用场景

6.1 缓存与数据库一致性检查

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 场景:检查Redis缓存和数据库数据是否一致

db_keys = {"product:1001", "product:1002", "product:1003", "product:1004"}
cache_keys = {"product:1002", "product:1003", "product:1005"}

# 缓存中有但DB中没有 → 脏数据/过期缓存
stale_cache = cache_keys - db_keys
print(f"需清理的缓存: {stale_cache}") # {'product:1005'}

# DB中有但缓存中没有 → 缓存未命中(可能需要预热)
cache_miss = db_keys - cache_keys
print(f"未缓存的数据: {cache_miss}") # {'product:1001', 'product:1004'}

# 两边都有 → 需进一步比对值是否一致
need_verify = db_keys & cache_keys
print(f"需校验一致性: {need_verify}") # {'product:1002', 'product:1003'}

6.2 权限集合运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 场景:RBAC权限系统

role_admin = {"read", "write", "delete", "manage_users", "view_logs"}
role_editor = {"read", "write", "view_logs"}
role_viewer = {"read", "view_logs"}

user_roles = [role_editor, role_viewer] # 用户同时拥有两个角色

# 用户的最终权限 = 所有角色权限的并集
user_perms = set.union(*user_roles)
print(f"用户权限: {user_perms}") # {'read', 'write', 'view_logs'}

# 检查是否有某权限
required = {"write", "delete"}
has_all = required.issubset(user_perms) # False(缺delete)
has_any = not required.isdisjoint(user_perms) # True(有write)

# 缺少哪些权限
missing = required - user_perms
print(f"缺少权限: {missing}") # {'delete'}

6.3 标签系统

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# 场景:文章标签的集合运算

article_1_tags = {"python", "backend", "security"}
article_2_tags = {"python", "data", "machine-learning"}
article_3_tags = {"security", "backend", "devops"}

# 找同时包含 python 和 security 的文章
target = {"python", "security"}
articles = {
"art1": article_1_tags,
"art2": article_2_tags,
"art3": article_3_tags,
}

result = [name for name, tags in articles.items() if target.issubset(tags)]
print(f"匹配的文章: {result}") # ['art1']

# 推荐相似文章:用 Jaccard 相似度
def jaccard(set_a, set_b):
if not set_a | set_b:
return 0
return len(set_a & set_b) / len(set_a | set_b)

print(f"art1 vs art2: {jaccard(article_1_tags, article_2_tags):.2f}") # 0.20
print(f"art1 vs art3: {jaccard(article_1_tags, article_3_tags):.2f}") # 0.50

七、练习题

练习1:集合运算基础

已知 A = {1,2,3,4,5},B = {3,4,5,6,7},C = {5,6,7,8,9}

求:A∩B∩C,A∪B∪C,A△B,(A∪B)\C

答案:

1
2
3
4
5
6
7
8
A = {1,2,3,4,5}
B = {3,4,5,6,7}
C = {5,6,7,8,9}

print(f"A∩B∩C = {A & B & C}") # {5}
print(f"A∪B∪C = {A | B | C}") # {1,2,3,4,5,6,7,8,9}
print(f"A△B = {A ^ B}") # {1,2,6,7}
print(f"(A∪B)\\C = {(A|B) - C}") # {1,2,3,4}

练习2:容斥原理

某公司200人,会Python的120人,会SQL的100人,两者都会的50人

问:至少会一种的多少人?两个都不会的多少人?

答案:

|A∪B| = |A| + |B| - |A∩B| = 120 + 100 - 50 = 170人

两个都不会:200 - 170 = 30人

练习3:安全场景实战

你有三个威胁情报源的恶意IP列表:

source_a = {“1.1.1.1”, “2.2.2.2”, “3.3.3.3”}

source_b = {“2.2.2.2”, “3.3.3.3”, “4.4.4.4”}

source_c = {“3.3.3.3”, “4.4.4.4”, “5.5.5.5”}

白名单 = {“4.4.4.4”}

需求:找出至少被2个来源标记且不在白名单中的IP

答案:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from itertools import combinations

source_a = {"1.1.1.1", "2.2.2.2", "3.3.3.3"}
source_b = {"2.2.2.2", "3.3.3.3", "4.4.4.4"}
source_c = {"3.3.3.3", "4.4.4.4", "5.5.5.5"}
whitelist = {"4.4.4.4"}

sources = [source_a, source_b, source_c]
multi = set()
for s1, s2 in combinations(sources, 2):
multi |= (s1 & s2)

result = multi - whitelist
print(f"需封禁: {result}") # {'2.2.2.2', '3.3.3.3'}

练习4:缓存一致性

db_ids = {101, 102, 103, 104, 105}

cache_ids = {103, 104, 105, 106, 107}

请分析:哪些缓存是脏数据?哪些数据没被缓存?哪些需要校验一致性?

答案:

1
2
3
4
5
6
7
8
9
db_ids = {101, 102, 103, 104, 105}
cache_ids = {103, 104, 105, 106, 107}

dirty_cache = cache_ids - db_ids # {106, 107}
not_cached = db_ids - cache_ids # {101, 102}
need_check = db_ids & cache_ids # {103, 104, 105}
print(f"脏缓存: {dirty_cache}")
print(f"未缓存: {not_cached}")
print(f"需校验: {need_check}")

练习5:pandas merge

用 pandas 实现:找出只在表A中出现但不在表B中出现的用户

答案:

1
2
3
4
5
6
7
8
import pandas as pd

df_a = pd.DataFrame({"user_id": [1,2,3,4,5]})
df_b = pd.DataFrame({"user_id": [3,4,5,6,7]})

merged = pd.merge(df_a, df_b, on="user_id", how="left", indicator=True)
only_a = merged[merged["_merge"] == "left_only"]["user_id"].tolist()
print(f"只在A中: {only_a}") # [1, 2]

八、本页小结

集合运算的核心就5个:交∩ 并∪ 差\ 补’ 对称差△

Python 的 set 用 & | - ^ 操作符,O(1)查找是性能利器

SQL 的 JOIN 本质就是集合运算,INNER=交,OUTER=并,LEFT+NULL=差

安全场景:IP名单运算、资产比对、告警去重 → 全是集合操作

后端场景:缓存一致性、权限判断、标签匹配 → 集合思维无处不在

下一篇:12-递归与数学归纳法


上一章 目录 下一章
10-布尔代数与逻辑 数学重学路线图 12-递归与数学归纳法