继续我们有关BPF(Linux内核的通用虚拟机)的系列文章,在本期中,我们将讨论存在哪些类型的BPF程序以及它们如何在现实的资本主义现金世界中使用。另外,在文章的结尾,有许多链接,特别是指向有关BPF的两本现有书籍的链接。
Linux内核5.9定义了30多种不同的BPF程序类型,我将针对其中的一些类型写几篇文章,因此,这篇文章不可避免地是一个概述,并且没有包含与前几篇文章一样多的技术细节。但是,尽管如此,我们将尝试最终回答以下问题:为什么需要所有这些以及为什么BPF周围会有如此多的噪音。
如果您想知道BPF到底如何有效地解决DDoS攻击防护,服务器负载平衡,kubernetes网络堆栈的实现,系统免受攻击,有效跟踪生产中的24x7系统等诸多问题,那么欢迎您。

节目类型和目录
所有现有的BPF程序类型都在include/uapi/linux/bpf.hLinux内核文件中注册。在以下各节中,我尝试将它们分为逻辑组(星号标记为技术教育计划的小节):
BPF .
, - , , BPF_PROG_* .
0975 |
Alexei\ Starovoitov | 2014-09-26 | BPF_PROG_TYPE_UNSPEC |
ddd8 |
Alexei Starovoitov | 2014-12-01 | BPF_PROG_TYPE_SOCKET_FILTER |
2541 |
Alexei Starovoitov | 2015-03-25 | BPF_PROG_TYPE_KPROBE |
96be |
Daniel Borkmann | 2015-03-01 | BPF_PROG_TYPE_SCHED_CLS |
94ca |
Daniel Borkmann | 2015-03-20 | BPF_PROG_TYPE_SCHED_ACT |
98b5 |
Alexei Starovoitov | 2016-04-06 | BPF_PROG_TYPE_TRACEPOINT |
6a77 |
Brenden Blanco | 2016-07-19 | BPF_PROG_TYPE_XDP |
0515 |
Alexei Starovoitov | 2016-09-01 | BPF_PROG_TYPE_PERF_EVENT |
0e33 |
Daniel Mack | 2016-11-23 | BPF_PROG_TYPE_CGROUP_SKB |
6102 |
David Ahern | 2016-12-01 | BPF_PROG_TYPE_CGROUP_SOCK |
3a0a |
Thomas Graf | 2016-11-30 | BPF_PROG_TYPE_LWT_IN |
3a0a |
Thomas Graf | 2016-11-30 | BPF_PROG_TYPE_LWT_OUT |
3a0a |
Thomas Graf | 2016-11-30 | BPF_PROG_TYPE_LWT_XMIT |
4030 |
Lawrence Brakmo | 2017-06-30 | BPF_PROG_TYPE_SOCK_OPS |
b005 |
John Fastabend | 2017-08-15 | BPF_PROG_TYPE_SK_SKB |
ebc6 |
Roman Gushchin | 2017-11-05 | BPF_PROG_TYPE_CGROUP_DEVICE |
4f73 |
John Fastabend | 2018-03-18 | BPF_PROG_TYPE_SK_MSG |
c4f6 |
Alexei Starovoitov | 2018-03-28 | BPF_PROG_TYPE_RAW_TRACEPOINT |
4fba |
Andrey Ignatov | 2018-03-30 | BPF_PROG_TYPE_CGROUP_SOCK_ADDR |
004d |
Mathieu\ Xhonneux | 2018-05-20 | BPF_PROG_TYPE_LWT_SEG6LOCAL |
f436 |
Sean Young | 2018-05-27 | BPF_PROG_TYPE_LIRC_MODE2 |
2dbb |
Martin KaFai Lau | 2018-08-08 | BPF_PROG_TYPE_SK_REUSEPORT |
d58e |
Petar Penkov | 2018-09-14 | BPF_PROG_TYPE_FLOW_DISSECTOR |
7b14 |
Andrey Ignatov | 2019-02-27 | BPF_PROG_TYPE_CGROUP_SYSCTL |
9df1 |
Matt Mullins | 2019-04-26 | BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE |
0d01 |
Stanislav\ Fomichev | 2019-06-27 | BPF_PROG_TYPE_CGROUP_SOCKOPT |
f1b9 |
Alexei Starovoitov | 2019-10-30 | BPF_PROG_TYPE_TRACING |
27ae |
Martin KaFai Lau | 2020-01-08 | BPF_PROG_TYPE_STRUCT_OPS |
be87 |
Alexei Starovoitov | 2020-01-20 | BPF_PROG_TYPE_EXT |
fc61 |
KP Singh | 2020-03-29 | BPF_PROG_TYPE_LSM |
e9dd |
Jakub Sitnicki | 2020-07-17 | BPF_PROG_TYPE_SK_LOOKUP |
Linux
1992 «» (- ) ( ). , «LINUX is obsolete» , Linux ( 1992 ) , . «» , , , :
«, linux , , . , , , . ( ) linux . GNU , : , , . Linux , GNU " "»
, — BPF Linux . 2020 Martin KaFai Lau , . — - , , - .
BPF: BPF_PROG_TYPE_STRUCT_OPS. , Daniel Borkman , BPF — , .
, BPF. - , . BPF tcp_congestion_ops, TCP congestion control. — DCTCP CUBIC BPF.
, , BPF, , , (, BPF ) . , , — . . BPF Summit.
BPF
BPF Brendan Gregg, , Linux . bcc, bpftrace, «BPF Performance Tools», , BPF, .. Facebook Netflix, , BPF, 24x7. BPF — BPF .
? . BPF :
- () Linux
- tracepoint
- perf, software hardware
maps, , . , BPF, , , .
( bpftrace, ):
#! /usr/bin/env bpftrace
#include <linux/skbuff.h>
#include <linux/ip.h>
k:icmp_echo {
$skb = (struct sk_buff *) arg0;
$iphdr = (struct iphdr *) ($skb->head + $skb->network_header);
@pingstats[ntop($iphdr->saddr), ntop($iphdr->daddr)]++;
}
, . kprobe icmp_echo, ICMPv4 echo request. , arg0 , — sk_buff, . IP @pingstats. , , IP ! , kprobe, user space, .
BPF, tracing:
BPF_PROG_TYPE_KPROBE: BPF kprobe, kretprobe, uprobe uretprobe. , (.. ), , , .BPF_PROG_TYPE_PERF_EVENT: BPF perf.BPF_PROG_TYPE_TRACEPOINT: BPF tracepoint. , kprobes? , tracepoints — API ( , / tracepoint ) , tracepoints ( ).BPF_PROG_TYPE_RAW_TRACEPOINT: tracepoints . raw tracepoints BPF «» , ,BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE: tracepoints (. )BPF_PROG_TYPE_TRACING: , : tracepoints, , , ( :sudo cat /sys/kernel/debug/error_injection/list), «». , BTF — .
, — Linux BPF, .
Linux Security Modules
Linux (security hooks), , , , .. Linux (Linux Security Modules LSM) SELinux, AppArmor, .., .
, . , API Kernel Runtime Security Instrumentation LSS-NA 2019. BPF, BPF_PROG_TYPE_LSM, , BPF LSM . , , BPF, ..
, , BPF, . , BPF. , LSM . user mode helper, BPF libbpf, libbpf .
KRSI KP Singh, KRSI, BPF Summit.
BPF
Tail calls
, , BPF . , BPF 4096 . , BPF . — . tail calls.
tail calls . , , — . BPF_MAP_TYPE_PROG_ARRAY, BPF ( ):

- bpf_tail_call. , , bpf_tail_call(&map, ctx, 1), ctx — , . , long jump, . 32, , .
, 5.1 , , . tail calls, .
, , tail calls . , bpf_tail_call, ? — , .
, XDP .
, tail calls — XDP features. XDP, , «» XDP . , , «» . , , , . tail calls, , . , , , - , « » — , , , .
tail calls . BPF, BPF_PROG_TYPE_EXT, . BPF trampoline TRACING , — , .
, , xdp-dispatcher — , XDP . «» -, XDP , - . . Multiple XDP programs on a single interface—status and next steps Toke Høiland-Jørgensen Linux Plumbers 2020.
LIRC: Linux Infrared Remote Control
, BPF BPF_PROG_TYPE_LIRC_MODE2 . lwn Sean Young, , , .
, BPF , / . BPF , , , map. , - , -. , , bpf_rc_keydown :

, , lirc? Sean Young , BPF , , API: IR userspace ( ).
BPF
BPF Berkeley Labs, BPF Linux, , BPF .
«» — XDP Linux — - , / Linux.
Linux:
, BPF , , XDP, Linux. . ( , XDP, , .)
, Linux . , DMA , CPU. , , CPU, RAM.

Linux — top half bottom half. — , , (top half), (bottom half), softirq . , bottom halves, , softirq NET_RX.
softirq , struct sk_buff. sk_buff, socket buffer, — Linux. Linux sk_buff. , : , , -, .., .. , sk_buff .

. , head end , data tail , net_header transport_hdr , , .. data — «» .
netif_receive_skb . ? Netfilter wiki:
sk_buff , (ingress qdisc ), , netfiler. , (sk_buff) , — .
, ( ). start, softirq, eBPF XDP , ...
— Express Data Path
sk_buff — , , VLAN .. BPF XDP (Express Data Path) sk_buff.
XDP , , , RAM . struct xdp_md, , , . — — XDP (XDP_DROP), , (XDP_TX), (XDP_REDIRECT), (XDP_PASS):

, / , , , , , MAC , XDP , / .
XDP AF_XDP, , , zero copy. , DPDK, :

AF_XDP: AF_XDP (rx queue), . XDP, . , , , . ( , , , . , UDP 65784 AF_XDP, 13, , , : ethtool -N flow-type udp4 dst-port 65784 action 13.)
, XDP , . , , CPU 0%. Netronome, , .
«» XDP, : DDoS . , Facebook, load balancer katran, XDP, Cloudfare XDP DDoS load balancing, cilium XDP , .. R&D , XDP P4 , , ( NPU — Networking Processing Unit).
XDP , — , . , XDP, XDP Tutorial, , , — kozlyuk .
struct __sk_buff
BPF, . , Linux sk_buff, . len — , network_header — L3, dev — struct net_device , .
, — sk_buff ( XDP, sk_buff ), , BPF sk_buff. , — BPF struct __sk_buff:
struct __sk_buff {
__u32 len;
__u32 pkt_type;
__u32 mark;
__u32 queue_mapping;
__u32 protocol;
__u32 vlan_present;
...
};
__sk_buff sk_buff, Verifier , . BPF __sk_buff :
int bpf_prog(struct __sk_buf *ctx)
{
__u32 len = ctx->len;
__u32 type = ctx->pkt_type;
...
}

, Verifier . , pkt_type, 3, Verifier , .
, / . , , , .
skbuff.c ( -, ):
#include <linux/bpf.h>
__attribute__((section("socket/test")))
int bpf_prog(struct __sk_buff *ctx)
{
__u32 len = ctx->len;
__u32 type = ctx->pkt_type;
return len + type;
}
:
clang -target bpf -O2 skbuff.c -o skbuff.o -c
(, , , ):
mkdir mnt
sudo mount -t bpf none ./mnt
bpftool prog load ./skbuff.o ./mnt/xxx
:
$ llvm-objdump -D ./skbuff.o --section socket/test
0: 61 12 00 00 00 00 00 00 r2 = *(u32 *)(r1 + 0)
1: 61 10 04 00 00 00 00 00 r0 = *(u32 *)(r1 + 4)
2: 0f 20 00 00 00 00 00 00 r0 += r2
3: 95 00 00 00 00 00 00 00 exit
, :
$ sudo bpftool prog dump xlated pinned ./mnt/xxx
0: (61) r2 = *(u32 *)(r1 +104)
1: (71) r0 = *(u8 *)(r1 +120)
2: (54) w0 &= 7
3: (0f) r0 += r2
4: (95) exitLinux
Linux, , , sk_buff, ingress qdisc. , , , egress qdisc — , , / netfilter.
Qdisc queueing discipline Linux — Traffic Control (TC). egress qdisc — , . , - , .
— classful classless — . — , . , egress qdisc, pfifo_fast, TOS IPv4 IPv6 ( . lartc 9.2):

— qdisc noqueue, , , .
Classful qdiscs . qdiscs. , . . , , : (classifiers) (actions). , , , , . : u32, flower .. : drop ( ), reclassify ( , , , VLAN tag), ..
, qdiscs , . qdiscs , C ? BPF, BPF_PROG_TYPE_SCHED_CLS BPF_PROG_TYPE_SCHED_ACT, , . , qdisc clsact, egress, ingress, BPF BPF_PROG_TYPE_SCHED_CLS direct action. — BPF — actions, .. .
BPF TC, — BPF Reference Guide Daniel Borkman cilium — CNI kubernetes, Alibaba Google.
BPF
BPF — BPF , . eBPF cBPF, , eBPF cBPF. , BPF BPF_PROG_TYPE_SOCKET_FILTER SO_ATTACH_BPF. , , CAP_SYS_ADMIN.
BPF BPF_PROG_TYPE_SOCKET_FILTER , :
- , BPF, BPF, , (
sk_buff) : , (, ). , RAW , .BPF_PROG_TYPE_SOCKET_FILTEREvil eBPF In-Depth DEFCONF 27. -
AF_PACKETPACKET_FANOUT, . , , DPI. Linux . 2015 fanoutPACKET_FANOUT_DATA, BPF. - 2007
xt_bpfnetfilter. BPF, . 2016 eBPF — eBPFBPF_PROG_TYPE_SOCKET_FILTER. - , , , tun , , , , VM. .
TUNSETFILTEREBPF. - tun BPF . .
TUNSETSTEERINGEBPF. - Kernel Connection Multiplexor TCP datagram (. lwn, kcm). TCP BPF
BPF_PROG_TYPE_SOCKET_FILTER,AF_KCMSIOCKCMATTACH(. , ). - ,
BPF_PROG_TYPE_SOCKET_FILTERSO_ATTACH_REUSEPORT_EBPF, . Perfect locality and three epic SystemTap scripts.
« » flower
__skb_flow_dissect, , Linux flow dissector — - . , , ingress Linux, flower.
, , , . BPF — BPF_PROG_TYPE_FLOW_DISSECTOR, BPF. namespace.
BPF
(cgroups) . , BPF : BPF cgroup. , ( , ) -, . cgroups , , , , - . BPF. , .
BPF_PROG_TYPE_CGROUP_SKB BPF (ingress) (egress) . 1, , 0, . , . , , .. : BPF systemd.
, BPF . , BPF BPF_PROG_TYPE_CGROUP_SOCK , struct sock. sk_bound_dev_if , . bind(2) / .
, , BPF BPF_PROG_TYPE_CGROUP_SOCK_ADDR. bind IP , ( use case : cgroup , , . ). connect, getpeername, getsockname, sendmsg recvmsg. , , , cilium iptables k8s.
BPF_PROG_TYPE_CGROUP_SOCKOPT setsockopt.
BPF_PROG_TYPE_CGROUP_DEVICE cgroupv2 , device cgroupsv1.
BPF_PROG_TYPE_CGROUP_SYSCTL sysctl , , , cgroup .
.
BPF
BPF_PROG_TYPE_SK_SKB . : SOCKMAP, . , , recvmsg, BPF, sk_buff . , Isovalent CNI cilium k8s, Cloudfare, . SOCKMAP — TCP splicing of the future.
BPF_PROG_TYPE_SK_SKB, BPF_PROG_TYPE_SK_MSG , sendmsg sendpage , L7 — , . BPF_PROG_TYPE_SK_SKB, sockmap .
BPF_PROG_TYPE_SK_REUSEPORT , SO_REUSEPORT. BPF, , .
BPF_PROG_TYPE_SK_LOOKUP , . : , IP , , , . namespaces.
, , TCP — BPF_PROG_TYPE_SOCK_OPS. cgroupv2, BPF_PROG_TYPE_CGROUP_SOCKOPT, , .., . , TCP , .
LWT:
, , . . , IPv4- IPv6-, VPN, .
, , . , , , , .

, Linux, : ip link add name ipip0 type ipip... .. 2015 Linux . , , , — .
BPF_PROG_TYPE_LWT_IN:lwtunnel_inputBPF_PROG_TYPE_LWT_OUT:lwtunnel_outputBPF_PROG_TYPE_LWT_XMIT:lwtunnel_xmit
input , output — , xmit — . struct __sk_buff, (BPF_OK), (BPF_DROP), (BPF_REDIRECT) , , (BPF_DROP). , xmit — , .
netlink . - , BPF iproute2, :
ip route add 10.0.0.0/24 encap bpf xmit obj <prog.o> section <section> dev <dev>
<prog.o> — BPF ELF, <section> — .
2018 , BPF_PROG_TYPE_LWT_SEG6LOCAL, seg6local, . Using SRv6.
BPF: BPF_PROG_TYPE_UNSPEC. , , / . bpf(2) .
, ! 99% , . , BPF Linux BPF_PROG_TYPE_UNSPEC BPF, , , , tcpdump wireshark Linux, .
,
BPF Linux, , - . BPF , , Linux. , BPF Linux.
(, , ) Linux — BPF kprobes, tracepoints perf events, — libbpf, bcc bpftrace.
2,5
- Brendan Gregg, «BPF Performance Tools». BPF Linux — BCC, . BPF .
- Brendan Gregg, «Systems Performance: Enterprise and the Cloud, 2nd Edition (2020)». «Systems Performance». : BPF, Solaris, . «BPF Performance Tools» «?», «?»
- David Calavera and Lorenzo Fontana, «Linux Observability with BPF». . BPF, , , .
Online-,
关于BPF的文章和报告很多。因此,我们将利用上述Isovalent公司正试图领导使用BPF收集炒作的事实,特别是最近建立了该网站的文档并举行了BPF峰会-关于BPF的小型会议。有趣的事实:上述BPF峰会的参与者选择了一种新的BPF吉祥物“蜜蜂”,并想出了一个易听的名字Ebee:

