继续我们有关BPF(Linux内核的通用虚拟机)的系列文章,在本期中,我们将讨论存在哪些类型的BPF程序以及它们如何在现实的资本主义现金世界中使用。另外,在文章的结尾,有许多链接,特别是指向有关BPF的两本现有书籍的链接。
Linux内核5.9定义了30多种不同的BPF程序类型,我将针对其中的一些类型写几篇文章,因此,这篇文章不可避免地是一个概述,并且没有包含与前几篇文章一样多的技术细节。但是,尽管如此,我们将尝试最终回答以下问题:为什么需要所有这些以及为什么BPF周围会有如此多的噪音。
如果您想知道BPF到底如何有效地解决DDoS攻击防护,服务器负载平衡,kubernetes网络堆栈的实现,系统免受攻击,有效跟踪生产中的24x7系统等诸多问题,那么欢迎您。
节目类型和目录
所有现有的BPF程序类型都在include/uapi/linux/bpf.h
Linux内核文件中注册。在以下各节中,我尝试将它们分为逻辑组(星号标记为技术教育计划的小节):
BPF .
, - , , BPF_PROG_*
.
0975 |
Alexei\ Starovoitov | 2014-09-26 | BPF_PROG_TYPE_UNSPEC |
ddd8 |
Alexei Starovoitov | 2014-12-01 | BPF_PROG_TYPE_SOCKET_FILTER |
2541 |
Alexei Starovoitov | 2015-03-25 | BPF_PROG_TYPE_KPROBE |
96be |
Daniel Borkmann | 2015-03-01 | BPF_PROG_TYPE_SCHED_CLS |
94ca |
Daniel Borkmann | 2015-03-20 | BPF_PROG_TYPE_SCHED_ACT |
98b5 |
Alexei Starovoitov | 2016-04-06 | BPF_PROG_TYPE_TRACEPOINT |
6a77 |
Brenden Blanco | 2016-07-19 | BPF_PROG_TYPE_XDP |
0515 |
Alexei Starovoitov | 2016-09-01 | BPF_PROG_TYPE_PERF_EVENT |
0e33 |
Daniel Mack | 2016-11-23 | BPF_PROG_TYPE_CGROUP_SKB |
6102 |
David Ahern | 2016-12-01 | BPF_PROG_TYPE_CGROUP_SOCK |
3a0a |
Thomas Graf | 2016-11-30 | BPF_PROG_TYPE_LWT_IN |
3a0a |
Thomas Graf | 2016-11-30 | BPF_PROG_TYPE_LWT_OUT |
3a0a |
Thomas Graf | 2016-11-30 | BPF_PROG_TYPE_LWT_XMIT |
4030 |
Lawrence Brakmo | 2017-06-30 | BPF_PROG_TYPE_SOCK_OPS |
b005 |
John Fastabend | 2017-08-15 | BPF_PROG_TYPE_SK_SKB |
ebc6 |
Roman Gushchin | 2017-11-05 | BPF_PROG_TYPE_CGROUP_DEVICE |
4f73 |
John Fastabend | 2018-03-18 | BPF_PROG_TYPE_SK_MSG |
c4f6 |
Alexei Starovoitov | 2018-03-28 | BPF_PROG_TYPE_RAW_TRACEPOINT |
4fba |
Andrey Ignatov | 2018-03-30 | BPF_PROG_TYPE_CGROUP_SOCK_ADDR |
004d |
Mathieu\ Xhonneux | 2018-05-20 | BPF_PROG_TYPE_LWT_SEG6LOCAL |
f436 |
Sean Young | 2018-05-27 | BPF_PROG_TYPE_LIRC_MODE2 |
2dbb |
Martin KaFai Lau | 2018-08-08 | BPF_PROG_TYPE_SK_REUSEPORT |
d58e |
Petar Penkov | 2018-09-14 | BPF_PROG_TYPE_FLOW_DISSECTOR |
7b14 |
Andrey Ignatov | 2019-02-27 | BPF_PROG_TYPE_CGROUP_SYSCTL |
9df1 |
Matt Mullins | 2019-04-26 | BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE |
0d01 |
Stanislav\ Fomichev | 2019-06-27 | BPF_PROG_TYPE_CGROUP_SOCKOPT |
f1b9 |
Alexei Starovoitov | 2019-10-30 | BPF_PROG_TYPE_TRACING |
27ae |
Martin KaFai Lau | 2020-01-08 | BPF_PROG_TYPE_STRUCT_OPS |
be87 |
Alexei Starovoitov | 2020-01-20 | BPF_PROG_TYPE_EXT |
fc61 |
KP Singh | 2020-03-29 | BPF_PROG_TYPE_LSM |
e9dd |
Jakub Sitnicki | 2020-07-17 | BPF_PROG_TYPE_SK_LOOKUP |
Linux
1992 «» (- ) ( ). , «LINUX is obsolete» , Linux ( 1992 ) , . «» , , , :
«, linux , , . , , , . ( ) linux . GNU , : , , . Linux , GNU " "»
, — BPF Linux . 2020 Martin KaFai Lau , . — - , , - .
BPF: BPF_PROG_TYPE_STRUCT_OPS
. , Daniel Borkman , BPF — , .
, BPF. - , . BPF tcp_congestion_ops
, TCP congestion control. — DCTCP CUBIC BPF.
, , BPF, , , (, BPF ) . , , — . . BPF Summit.
BPF
BPF Brendan Gregg, , Linux . bcc, bpftrace, «BPF Performance Tools», , BPF, .. Facebook Netflix, , BPF, 24x7. BPF — BPF .
? . BPF :
- () Linux
- tracepoint
- perf, software hardware
maps, , . , BPF, , , .
( bpftrace
, ):
#! /usr/bin/env bpftrace
#include <linux/skbuff.h>
#include <linux/ip.h>
k:icmp_echo {
$skb = (struct sk_buff *) arg0;
$iphdr = (struct iphdr *) ($skb->head + $skb->network_header);
@pingstats[ntop($iphdr->saddr), ntop($iphdr->daddr)]++;
}
, . kprobe icmp_echo
, ICMPv4 echo request. , arg0
, — sk_buff
, . IP @pingstats
. , , IP ! , kprobe, user space, .
BPF, tracing:
BPF_PROG_TYPE_KPROBE
: BPF kprobe, kretprobe, uprobe uretprobe. , (.. ), , , .BPF_PROG_TYPE_PERF_EVENT
: BPF perf.BPF_PROG_TYPE_TRACEPOINT
: BPF tracepoint. , kprobes? , tracepoints — API ( , / tracepoint ) , tracepoints ( ).BPF_PROG_TYPE_RAW_TRACEPOINT
: tracepoints . raw tracepoints BPF «» , ,BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE
: tracepoints (. )BPF_PROG_TYPE_TRACING
: , : tracepoints, , , ( :sudo cat /sys/kernel/debug/error_injection/list
), «». , BTF — .
, — Linux BPF, .
Linux Security Modules
Linux (security hooks), , , , .. Linux (Linux Security Modules LSM) SELinux, AppArmor, .., .
, . , API Kernel Runtime Security Instrumentation LSS-NA 2019. BPF, BPF_PROG_TYPE_LSM
, , BPF LSM . , , BPF, ..
, , BPF, . , BPF. , LSM . user mode helper, BPF libbpf, libbpf .
KRSI KP Singh, KRSI, BPF Summit.
BPF
Tail calls
, , BPF . , BPF 4096 . , BPF . — . tail calls.
tail calls . , , —
.
BPF_MAP_TYPE_PROG_ARRAY
, BPF ( ):
-
bpf_tail_call
. , ,
bpf_tail_call(&map, ctx, 1)
, ctx
— , . , long jump, . 32, , .
, 5.1 , , . tail calls, .
, , tail calls . , bpf_tail_call
, ? — , .
, XDP .
, tail calls — XDP features. XDP, , «» XDP . , , «» . , , , . tail calls, , . , , , - , « » — , , , .
tail calls . BPF, BPF_PROG_TYPE_EXT
, . BPF trampoline TRACING , — , .
, , xdp-dispatcher — , XDP . «» -, XDP , - . . Multiple XDP programs on a single interface—status and next steps Toke Høiland-Jørgensen Linux Plumbers 2020.
LIRC: Linux Infrared Remote Control
, BPF BPF_PROG_TYPE_LIRC_MODE2
. lwn Sean Young, , , .
, BPF , / . BPF , , , map. , - , -. , , bpf_rc_keydown
:
, , lirc? Sean Young , BPF , , API: IR userspace ( ).
BPF
BPF Berkeley Labs, BPF Linux, , BPF .
«» — XDP Linux — - , / Linux.
Linux:
, BPF , , XDP, Linux. . ( , XDP, , .)
, Linux . , DMA , CPU. , , CPU, RAM.
Linux — top half bottom half. — , , (top half), (bottom half), softirq . , bottom halves, , softirq NET_RX
.
softirq , struct sk_buff
. sk_buff
, socket buffer, — Linux. Linux sk_buff
. , : , , -, .., .. , sk_buff
.
. , head
end
, data
tail
, net_header
transport_hdr
, , .. data
— «» .
netif_receive_skb
. ? Netfilter wiki:
sk_buff
, (ingress qdisc ), , netfiler. , (sk_buff
) , — .
, ( ). start, softirq, eBPF XDP , ...
— Express Data Path
sk_buff
— , , VLAN .. BPF XDP (Express Data Path) sk_buff
.
XDP , , , RAM . struct xdp_md
, , , . — — XDP (XDP_DROP
), , (XDP_TX
), (XDP_REDIRECT
), (XDP_PASS
):
, / , , , , , MAC , XDP , / .
XDP AF_XDP
, , , zero copy. , DPDK, :
AF_XDP
: AF_XDP
(rx queue), . XDP, . , , , . ( , , , . , UDP 65784 AF_XDP
, 13, , , : ethtool -N flow-type udp4 dst-port 65784 action 13
.)
, XDP , . , , CPU 0%. Netronome, , .
«» XDP, : DDoS . , Facebook, load balancer katran, XDP, Cloudfare XDP DDoS load balancing, cilium XDP , .. R&D , XDP P4 , , ( NPU — Networking Processing Unit).
XDP , — , . , XDP, XDP Tutorial, , , — kozlyuk .
struct __sk_buff
BPF, . , Linux sk_buff
, . len
— , network_header
— L3, dev
— struct net_device
, .
, — sk_buff
( XDP, sk_buff
), , BPF sk_buff
. , — BPF struct __sk_buff
:
struct __sk_buff {
__u32 len;
__u32 pkt_type;
__u32 mark;
__u32 queue_mapping;
__u32 protocol;
__u32 vlan_present;
...
};
__sk_buff
sk_buff
, Verifier , . BPF __sk_buff
:
int bpf_prog(struct __sk_buf *ctx)
{
__u32 len = ctx->len;
__u32 type = ctx->pkt_type;
...
}
, Verifier . , pkt_type
, 3, Verifier , .
, / . , , , .
skbuff.c
( -, ):
#include <linux/bpf.h>
__attribute__((section("socket/test")))
int bpf_prog(struct __sk_buff *ctx)
{
__u32 len = ctx->len;
__u32 type = ctx->pkt_type;
return len + type;
}
:
clang -target bpf -O2 skbuff.c -o skbuff.o -c
(, , , ):
mkdir mnt
sudo mount -t bpf none ./mnt
bpftool prog load ./skbuff.o ./mnt/xxx
:
$ llvm-objdump -D ./skbuff.o --section socket/test
0: 61 12 00 00 00 00 00 00 r2 = *(u32 *)(r1 + 0)
1: 61 10 04 00 00 00 00 00 r0 = *(u32 *)(r1 + 4)
2: 0f 20 00 00 00 00 00 00 r0 += r2
3: 95 00 00 00 00 00 00 00 exit
, :
$ sudo bpftool prog dump xlated pinned ./mnt/xxx
0: (61) r2 = *(u32 *)(r1 +104)
1: (71) r0 = *(u8 *)(r1 +120)
2: (54) w0 &= 7
3: (0f) r0 += r2
4: (95) exit
Linux
Linux, , , sk_buff
, ingress qdisc. , , , egress qdisc — , , / netfilter.
Qdisc queueing discipline Linux — Traffic Control (TC). egress qdisc — , . , - , .
— classful classless — . — , . , egress qdisc, pfifo_fast
, TOS IPv4 IPv6 ( . lartc 9.2):
— qdisc noqueue
, , , .
Classful qdiscs . qdiscs. , . . , , : (classifiers) (actions). , , , , . : u32
, flower
.. : drop
( ), reclassify
( , , , VLAN tag), ..
, qdiscs , . qdiscs , C ? BPF, BPF_PROG_TYPE_SCHED_CLS
BPF_PROG_TYPE_SCHED_ACT
, , . , qdisc clsact
, egress, ingress, BPF BPF_PROG_TYPE_SCHED_CLS
direct action. — BPF — actions, .. .
BPF TC, — BPF Reference Guide Daniel Borkman cilium
— CNI kubernetes, Alibaba Google.
BPF
BPF — BPF , . eBPF cBPF, , eBPF cBPF. , BPF BPF_PROG_TYPE_SOCKET_FILTER
SO_ATTACH_BPF
. , , CAP_SYS_ADMIN
.
BPF BPF_PROG_TYPE_SOCKET_FILTER
, :
- , BPF, BPF, , (
sk_buff
) : , (, ). , RAW , .BPF_PROG_TYPE_SOCKET_FILTER
Evil eBPF In-Depth DEFCONF 27. -
AF_PACKET
PACKET_FANOUT
, . , , DPI. Linux . 2015 fanoutPACKET_FANOUT_DATA
, BPF. - 2007
xt_bpf
netfilter. BPF, . 2016 eBPF — eBPFBPF_PROG_TYPE_SOCKET_FILTER
. - , , , tun , , , , VM. .
TUNSETFILTEREBPF
. - tun BPF . .
TUNSETSTEERINGEBPF
. - Kernel Connection Multiplexor TCP datagram (. lwn, kcm). TCP BPF
BPF_PROG_TYPE_SOCKET_FILTER
,AF_KCM
SIOCKCMATTACH
(. , ). - ,
BPF_PROG_TYPE_SOCKET_FILTER
SO_ATTACH_REUSEPORT_EBPF
, . Perfect locality and three epic SystemTap scripts.
« » flower
__skb_flow_dissect
, , Linux flow dissector — - . , , ingress Linux, flower.
, , , . BPF — BPF_PROG_TYPE_FLOW_DISSECTOR
, BPF. namespace.
BPF
(cgroups) . , BPF : BPF cgroup. , ( , ) -, . cgroups , , , , - . BPF. , .
BPF_PROG_TYPE_CGROUP_SKB
BPF (ingress) (egress) . 1, , 0, . , . , , .. : BPF systemd.
, BPF . , BPF BPF_PROG_TYPE_CGROUP_SOCK
, struct sock
. sk_bound_dev_if
, . bind(2) / .
, , BPF BPF_PROG_TYPE_CGROUP_SOCK_ADDR
. bind
IP , ( use case : cgroup , , . ). connect, getpeername, getsockname, sendmsg recvmsg. , , , cilium iptables k8s.
BPF_PROG_TYPE_CGROUP_SOCKOPT
setsockopt.
BPF_PROG_TYPE_CGROUP_DEVICE
cgroupv2 , device
cgroupsv1.
BPF_PROG_TYPE_CGROUP_SYSCTL
sysctl , , , cgroup .
.
BPF
BPF_PROG_TYPE_SK_SKB
. : SOCKMAP, . , , recvmsg
, BPF, sk_buff
. , Isovalent CNI cilium k8s, Cloudfare, . SOCKMAP — TCP splicing of the future.
BPF_PROG_TYPE_SK_SKB
, BPF_PROG_TYPE_SK_MSG
, sendmsg
sendpage
, L7 — , . BPF_PROG_TYPE_SK_SKB
, sockmap
.
BPF_PROG_TYPE_SK_REUSEPORT
, SO_REUSEPORT
. BPF, , .
BPF_PROG_TYPE_SK_LOOKUP
, . : , IP , , , . namespaces.
, , TCP — BPF_PROG_TYPE_SOCK_OPS
. cgroupv2, BPF_PROG_TYPE_CGROUP_SOCKOPT
, , .., . , TCP , .
LWT:
, , . . , IPv4- IPv6-, VPN, .
, , . , , , , .
, Linux, : ip link add name ipip0 type ipip...
.. 2015 Linux . , , , — .
BPF_PROG_TYPE_LWT_IN
:lwtunnel_input
BPF_PROG_TYPE_LWT_OUT
:lwtunnel_output
BPF_PROG_TYPE_LWT_XMIT
:lwtunnel_xmit
input , output — , xmit — . struct __sk_buff
, (BPF_OK
), (BPF_DROP
), (BPF_REDIRECT
) , , (BPF_DROP
). , xmit — , .
netlink . - , BPF iproute2, :
ip route add 10.0.0.0/24 encap bpf xmit obj <prog.o> section <section> dev <dev>
<prog.o>
— BPF ELF, <section>
— .
2018 , BPF_PROG_TYPE_LWT_SEG6LOCAL
, seg6local
, . Using SRv6.
BPF: BPF_PROG_TYPE_UNSPEC
. , , / . bpf(2) .
, ! 99% , . , BPF Linux BPF_PROG_TYPE_UNSPEC
BPF, , , , tcpdump
wireshark
Linux, .
,
BPF Linux, , - . BPF , , Linux. , BPF Linux.
(, , ) Linux — BPF kprobes, tracepoints perf events, — libbpf
, bcc
bpftrace
.
2,5
- Brendan Gregg, «BPF Performance Tools». BPF Linux — BCC, . BPF .
- Brendan Gregg, «Systems Performance: Enterprise and the Cloud, 2nd Edition (2020)». «Systems Performance». : BPF, Solaris, . «BPF Performance Tools» «?», «?»
- David Calavera and Lorenzo Fontana, «Linux Observability with BPF». . BPF, , , .
Online-,
关于BPF的文章和报告很多。因此,我们将利用上述Isovalent公司正试图领导使用BPF收集炒作的事实,特别是最近建立了该网站的文档并举行了BPF峰会-关于BPF的小型会议。有趣的事实:上述BPF峰会的参与者选择了一种新的BPF吉祥物“蜜蜂”,并想出了一个易听的名字Ebee: