这是 数学重学路线图 阶段二的子页面
📌 已有详细笔记:集合论基础(含Java代码和SQL对照),本页补充 Python 视角和大数据场景
一、集合运算复习:5大核心操作
1.1 直觉比喻
集合就是一堆不重复的东西放在一个袋子里
交集∩ → 两个班都报名的同学
并集∪ → 两个班所有报名的同学(去重)
差集A\B → 报了A班但没报B班的同学
补集 → 全校减去报名的同学
对称差△ → 只报了一个班的同学(报A没报B + 报B没报A)
1.2 公式总结
A ∩ B = { x | x ∈ A 且 x ∈ B }
A ∪ B = { x | x ∈ A 或 x ∈ B }
A \ B = { x | x ∈ A 且 x ∉ B }
A△B = (A \ B) ∪ (B \ A) = (A ∪ B) \ (A ∩ B)
补集:A’ = U \ A (U是全集)
1.3 重要性质
交换律:A ∩ B = B ∩ A,A ∪ B = B ∪ A
结合律:(A ∩ B) ∩ C = A ∩ (B ∩ C)
分配律:A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)
德摩根律:(A ∪ B)’ = A’ ∩ B’,(A ∩ B)’ = A’ ∪ B’
安全场景记法:不在(黑名单1 或 黑名单2)中 = 不在黑名单1中 且 不在黑名单2中
1.4 容斥原理
|A ∪ B| = |A| + |B| - |A ∩ B|
|A ∪ B ∪ C| = |A| + |B| + |C| - |A∩B| - |A∩C| - |B∩C| + |A∩B∩C|
场景:统计受多种告警影响的独立IP数
二、Python set 操作全览
2.1 基础操作
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| a = {1, 2, 3, 4, 5} b = {3, 4, 5, 6, 7}
print(a & b) print(a.intersection(b))
print(a | b) print(a.union(b))
print(a - b) print(a.difference(b))
print(a ^ b) print(a.symmetric_difference(b))
print({1, 2}.issubset(a)) print(a.issuperset({1, 2})) print(a.isdisjoint({8, 9}))
|
2.2 set comprehension 与 frozenset
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| nums = [1, 2, 2, 3, 3, 3, 4, 5] unique_evens = {x for x in nums if x % 2 == 0} print(unique_evens)
fs = frozenset([1, 2, 3])
cache = {} key = frozenset({"user_id": 100, "role": "admin"}.items()) cache[key] = "cached_result"
sets = [{1,2,3}, {2,3,4}, {3,4,5}] all_intersection = set.intersection(*sets) all_union = set.union(*sets)
|
2.3 性能对比:set vs list
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| import time
big_list = list(range(1_000_000)) big_set = set(range(1_000_000))
start = time.time() 999_999 in big_list print(f"list查找: {time.time()-start:.6f}s")
start = time.time() 999_999 in big_set print(f"set查找: {time.time()-start:.6f}s")
|
三、SQL 与集合运算的映射
3.1 对照表
| 集合运算 |
SQL实现 |
说明 |
| A ∩ B |
INNER JOIN / INTERSECT |
两表都有的记录 |
| A ∪ B |
FULL OUTER JOIN / UNION |
两表所有记录 |
| A \ B |
LEFT JOIN WHERE b.id IS NULL / EXCEPT |
在A不在B |
| A△B |
FULL OUTER JOIN WHERE … IS NULL |
只在一边的记录 |
3.2 SQL 示例
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
|
table_a = {"user_1", "user_2", "user_3", "user_4"}
table_b = {"user_3", "user_4", "user_5", "user_6"}
inner = table_a & table_b
full_outer = table_a | table_b
left_only = table_a - table_b
|
3.3 关键记忆
UNION 自带去重 = 集合并集的天然特性
UNION ALL 不去重 = 多重集合(multiset)的并
INTERSECT 和 EXCEPT 是SQL标准但MySQL不直接支持(8.0+支持INTERSECT/EXCEPT)
四、大数据应用场景
4.1 用户交集分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| import pandas as pd
df_a = pd.DataFrame({"user_id": [1,2,3,4,5], "product": "A"}) df_b = pd.DataFrame({"user_id": [3,4,5,6,7], "product": "B"})
both = pd.merge(df_a, df_b, on="user_id", how="inner") print(f"同时使用AB的用户: {both['user_id'].tolist()}")
users_a = set(df_a["user_id"]) users_b = set(df_b["user_id"]) print(f"交集: {users_a & users_b}") print(f"只用A: {users_a - users_b}") print(f"只用B: {users_b - users_a}") print(f"至少用一个: {users_a | users_b}")
|
4.2 ETL 数据去重
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| existing_ids = {1001, 1002, 1003, 1004} incoming_ids = {1003, 1004, 1005, 1006}
new_only = incoming_ids - existing_ids print(f"需要新插入: {new_only}")
need_update = incoming_ids & existing_ids print(f"可能需要更新: {need_update}")
|
4.3 漏斗分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| step1_view = {"u1","u2","u3","u4","u5","u6","u7","u8","u9","u10"} step2_cart = {"u1","u2","u3","u5","u7","u8"} step3_order = {"u1","u3","u5","u8"} step4_pay = {"u1","u5"}
steps = [step1_view, step2_cart, step3_order, step4_pay] labels = ["浏览", "加购", "下单", "支付"]
for i, (label, s) in enumerate(zip(labels, steps)): rate = len(s) / len(step1_view) * 100 if i > 0: conv = len(s) / len(steps[i-1]) * 100 lost = steps[i-1] - s print(f"{label}: {len(s)}人 (总转化{rate:.0f}%, 环比{conv:.0f}%, 流失用户: {lost})") else: print(f"{label}: {len(s)}人 (100%)")
|
4.4 pandas merge 详解
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| import pandas as pd
left = pd.DataFrame({"key": ["a","b","c"], "val_l": [1,2,3]}) right = pd.DataFrame({"key": ["b","c","d"], "val_r": [4,5,6]})
print(pd.merge(left, right, on="key", how="inner"))
print(pd.merge(left, right, on="key", how="outer"))
print(pd.merge(left, right, on="key", how="left"))
merged = pd.merge(left, right, on="key", how="left", indicator=True) left_only = merged[merged["_merge"] == "left_only"] print(f"只在左表: {left_only['key'].tolist()}")
|
五、安全应用场景
5.1 IP 黑白名单运算
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
|
threat_intel_ips = {"10.1.1.1", "10.1.1.2", "192.168.1.100", "172.16.0.5"} ids_alert_ips = {"10.1.1.2", "10.1.1.3", "192.168.1.100", "172.16.0.8"} honeypot_ips = {"10.1.1.1", "10.1.1.3", "172.16.0.5", "172.16.0.9"}
from itertools import combinations
sources = [threat_intel_ips, ids_alert_ips, honeypot_ips] multi_confirmed = set() for s1, s2 in combinations(sources, 2): multi_confirmed |= (s1 & s2) print(f"多源确认的高危IP: {multi_confirmed}")
all_suspicious = threat_intel_ips | ids_alert_ips | honeypot_ips
whitelist = {"10.1.1.1", "192.168.1.100"} final_block = all_suspicious - whitelist print(f"最终封禁列表: {final_block}")
|
5.2 资产交叉比对
1 2 3 4 5 6 7 8 9 10 11 12
|
cmdb_assets = {"srv-001", "srv-002", "srv-003", "srv-004"} scan_found = {"srv-002", "srv-003", "srv-005", "srv-006"}
shadow_assets = scan_found - cmdb_assets offline_assets = cmdb_assets - scan_found healthy = cmdb_assets & scan_found
print(f"影子资产(需纳管): {shadow_assets}") print(f"疑似下线(需确认): {offline_assets}") print(f"正常在管: {healthy}")
|
5.3 告警去重
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
|
alert_set = set()
def dedup_alert(src_ip, dst_ip, alert_type): """基于三元组去重""" key = frozenset({("src", src_ip), ("dst", dst_ip), ("type", alert_type)}) if key in alert_set: return False alert_set.add(key) return True
print(dedup_alert("10.0.0.1", "10.0.0.2", "port_scan")) print(dedup_alert("10.0.0.1", "10.0.0.2", "port_scan")) print(dedup_alert("10.0.0.1", "10.0.0.2", "brute_force"))
|
六、后端应用场景
6.1 缓存与数据库一致性检查
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
|
db_keys = {"product:1001", "product:1002", "product:1003", "product:1004"} cache_keys = {"product:1002", "product:1003", "product:1005"}
stale_cache = cache_keys - db_keys print(f"需清理的缓存: {stale_cache}")
cache_miss = db_keys - cache_keys print(f"未缓存的数据: {cache_miss}")
need_verify = db_keys & cache_keys print(f"需校验一致性: {need_verify}")
|
6.2 权限集合运算
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
|
role_admin = {"read", "write", "delete", "manage_users", "view_logs"} role_editor = {"read", "write", "view_logs"} role_viewer = {"read", "view_logs"}
user_roles = [role_editor, role_viewer]
user_perms = set.union(*user_roles) print(f"用户权限: {user_perms}")
required = {"write", "delete"} has_all = required.issubset(user_perms) has_any = not required.isdisjoint(user_perms)
missing = required - user_perms print(f"缺少权限: {missing}")
|
6.3 标签系统
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
|
article_1_tags = {"python", "backend", "security"} article_2_tags = {"python", "data", "machine-learning"} article_3_tags = {"security", "backend", "devops"}
target = {"python", "security"} articles = { "art1": article_1_tags, "art2": article_2_tags, "art3": article_3_tags, }
result = [name for name, tags in articles.items() if target.issubset(tags)] print(f"匹配的文章: {result}")
def jaccard(set_a, set_b): if not set_a | set_b: return 0 return len(set_a & set_b) / len(set_a | set_b)
print(f"art1 vs art2: {jaccard(article_1_tags, article_2_tags):.2f}") print(f"art1 vs art3: {jaccard(article_1_tags, article_3_tags):.2f}")
|
七、练习题
练习1:集合运算基础
已知 A = {1,2,3,4,5},B = {3,4,5,6,7},C = {5,6,7,8,9}
求:A∩B∩C,A∪B∪C,A△B,(A∪B)\C
答案:
1 2 3 4 5 6 7 8
| A = {1,2,3,4,5} B = {3,4,5,6,7} C = {5,6,7,8,9}
print(f"A∩B∩C = {A & B & C}") print(f"A∪B∪C = {A | B | C}") print(f"A△B = {A ^ B}") print(f"(A∪B)\\C = {(A|B) - C}")
|
练习2:容斥原理
某公司200人,会Python的120人,会SQL的100人,两者都会的50人
问:至少会一种的多少人?两个都不会的多少人?
答案:
|A∪B| = |A| + |B| - |A∩B| = 120 + 100 - 50 = 170人
两个都不会:200 - 170 = 30人
练习3:安全场景实战
你有三个威胁情报源的恶意IP列表:
source_a = {“1.1.1.1”, “2.2.2.2”, “3.3.3.3”}
source_b = {“2.2.2.2”, “3.3.3.3”, “4.4.4.4”}
source_c = {“3.3.3.3”, “4.4.4.4”, “5.5.5.5”}
白名单 = {“4.4.4.4”}
需求:找出至少被2个来源标记且不在白名单中的IP
答案:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| from itertools import combinations
source_a = {"1.1.1.1", "2.2.2.2", "3.3.3.3"} source_b = {"2.2.2.2", "3.3.3.3", "4.4.4.4"} source_c = {"3.3.3.3", "4.4.4.4", "5.5.5.5"} whitelist = {"4.4.4.4"}
sources = [source_a, source_b, source_c] multi = set() for s1, s2 in combinations(sources, 2): multi |= (s1 & s2)
result = multi - whitelist print(f"需封禁: {result}")
|
练习4:缓存一致性
db_ids = {101, 102, 103, 104, 105}
cache_ids = {103, 104, 105, 106, 107}
请分析:哪些缓存是脏数据?哪些数据没被缓存?哪些需要校验一致性?
答案:
1 2 3 4 5 6 7 8 9
| db_ids = {101, 102, 103, 104, 105} cache_ids = {103, 104, 105, 106, 107}
dirty_cache = cache_ids - db_ids not_cached = db_ids - cache_ids need_check = db_ids & cache_ids print(f"脏缓存: {dirty_cache}") print(f"未缓存: {not_cached}") print(f"需校验: {need_check}")
|
练习5:pandas merge
用 pandas 实现:找出只在表A中出现但不在表B中出现的用户
答案:
1 2 3 4 5 6 7 8
| import pandas as pd
df_a = pd.DataFrame({"user_id": [1,2,3,4,5]}) df_b = pd.DataFrame({"user_id": [3,4,5,6,7]})
merged = pd.merge(df_a, df_b, on="user_id", how="left", indicator=True) only_a = merged[merged["_merge"] == "left_only"]["user_id"].tolist() print(f"只在A中: {only_a}")
|
八、本页小结
集合运算的核心就5个:交∩ 并∪ 差\ 补’ 对称差△
Python 的 set 用 & | - ^ 操作符,O(1)查找是性能利器
SQL 的 JOIN 本质就是集合运算,INNER=交,OUTER=并,LEFT+NULL=差
安全场景:IP名单运算、资产比对、告警去重 → 全是集合操作
后端场景:缓存一致性、权限判断、标签匹配 → 集合思维无处不在
下一篇:12-递归与数学归纳法