DPDK-18-11-flow-director-API 代码示例

intel fdir 研究总结:

  1. X722,X710 网卡支持
    vnet3, e1000, 82599 不支持!
  2. TCP的payload 匹配长度 16字节
  3. UDP的payload 匹配长度 16字节
  4. IP 的payload 匹配长度 16字节(前提报文不是 TCP/UDP/SCTP)
  5. FDIR规则数量上限 8K条

PCTYPE

一个规则链 是一个PCTYPE
一个规则链 的类型必须为已支持的 PCTYPE
一个规则链 上的 MASK 指的是链上所有模式的MASK集合,会通过特定的运算, 得到为 PCTYPE MASK
一个 PCTYPE 只允许存在一个 MASK !!!
一个 PCTYPE 可允许存在多个 SPEC,

i40e_supported_patterns

X710 支持的所有PCTYPE: 搜索i40e_supported_patterns

应用

以下示例, 说明WXA业务的报文衰减规则
PCTYPE 路径 ETH/IPV4/UDP/RAW
UDP 端口可配置多种, 但端口掩码不可变!!!
RAW 载荷可配置多种, 但RAW 掩码不可变!!!

思考

  1. 规则 只能在已存在的 PCTYPE 寻找
    如 WXA业务 是 规则路径
    ETH/IPV4/UDP/RAW
    ETH/IPV6/UDP/RAW

但是个别地区是 规则路径
ETH/IPV4/UDP/GTP/IPV4/UDP/RAW
ETH/IPV4/UDP/GTP/IPV6/UDP/RAW
ETH/IPV6/UDP/GTP/IPV4/UDP/RAW
ETH/IPV6/UDP/GTP/IPV6/UDP/RAW
X710 就无法支持带GTP隧道的报文

  1. 一个路径 就是一个 PCTYPE, 一个 PCTYPE 只允许存在一个MASK。
    PCTYPE 的 MASK 计算是路径上的所有 ITEM 掩码集合。

举例:
ETH/IPV4/UDP/RAW 路径的MASK = (ETH掩码 + IPV4掩码 + UDP掩码 + RAW掩码)

如果再有其他规则,也是ETH/IPV4/UDP/RAW 路径, 那MASK就得保持一致, spec可以任意变动, 否则报错。

运算逻辑

我的猜测:
Intel 是把进入的报文, 按照
链路层(MAC、VLAN、MPLS、PPPoE)
网络层(IPv4、IPv6)
传输层(TCP、UDP、SCTP)
隧道层(GTP、L2TP、GRE、PPP)
报文类型解析,并分类。注意intel没有关注应用层。

未知的部分, 全部解释为RAW, 即为payload
如 HTTP报文到达X710网卡,无法识别到HTTP层,只能识别到ETH/IPV4/TCP/RAW,这里的RAW就是HTTP数据
如 DHCP报文到达X710网卡,无法识别到DHCP层,只能识别到ETH/IPV4/UDP/RAW,这里的RAW就是DHCP数据

X710网卡将收到的数据, 解析完成之后,
开始与PCTYPE的MASK进行 与运算,

再将运算的结果与本PCTYPE的各个规则的SPEC进行比对, 符合的规则视为命中, 进行ACTION动作。
无法命中的, 执行默认ACTION动作。

这里猜测了为什么一个PCTYPE 为什么必须是只有一个 MASK的原因。

参考文档

X722, X710 海报
intel-x710-product-brief.pdf
ethernet-network-adapter-x722-product-brief.pdf

c620 手册
c620-series-chipset-datasheet.pdf

xl710 手册
xl710_10_40_controller_datasheet-1140607.pdf

Hash and Flow Director Filters
Intel® Ethernet Controller 700 Series: Hash and Flow Director Filters

研究报告
Intel Ethernet Flow Director 研究报告V2.pptx

代码

Makefile

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

# SPDX-License-Identifier: BSD-3-Clause
# Copyright(c) 2010-2014 Intel Corporation

# binary name
APP = fdir

# all source are stored in SRCS-y
SRCS-y := main.c

# Build using pkg-config variables if possible
ifeq ($(shell pkg-config --exists libdpdk && echo 0),0)

all: static
.PHONY: shared static
static: build/$(APP)-static
ln -sf $(APP)-static build/$(APP)

PKGCONF ?= pkg-config

PC_FILE := $(shell $(PKGCONF) --path libdpdk 2>/dev/null)
CFLAGS += -O3 $(shell $(PKGCONF) --cflags libdpdk)
LDFLAGS_STATIC = -Wl,-Bstatic $(shell $(PKGCONF) --static --libs libdpdk)

build/$(APP)-static: $(SRCS-y) Makefile $(PC_FILE) | build
$(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_STATIC)

build:
@mkdir -p $@

.PHONY: clean
clean:
rm -f build/$(APP) build/$(APP)-static build/$(APP)-shared
test -d build && rmdir -p build || true

else # Build using legacy build system

ifeq ($(RTE_SDK),)
$(error "Please define RTE_SDK environment variable")
endif

# Default target, can be overridden by command line or environment
RTE_TARGET ?= x86_64-native-linuxapp-gcc

include $(RTE_SDK)/mk/rte.vars.mk

CFLAGS += -O3
CFLAGS += $(WERROR_FLAGS)

include $(RTE_SDK)/mk/rte.extapp.mk
endif

main.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
//intel fdir 研究总结:
// 1. X722,X710 网卡支持
// vnet3, e1000, 82599 不支持!
// 2. TCP的payload 匹配长度 16字节
// 3. UDP的payload 匹配长度 16字节
// 4. IP 的payload 匹配长度 16字节(前提报文不是 TCP/UDP/SCTP)

// 一个规则链 是一个PCTYPE
// 一个规则链 的类型必须为已支持的 PCTYPE
// 一个规则链 上的 MASK 指的是链上所有模式的MASK集合,会通过特定的运算, 得到为 PCTYPE MASK
// 一个 PCTYPE 只允许存在一个 MASK !!!
// 一个 PCTYPE 可允许存在多个 SPEC,

// X710 支持的所有PCTYPE: 搜索i40e_supported_patterns

//以下示例, 说明WXA业务的报文衰减规则
//PCTYPE 路径 ETH/IPV4/UDP/RAW
// UDP 端口可配置多种, 但端口掩码不可变!!!
// RAW 载荷可配置多种, 但RAW 掩码不可变!!!

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <inttypes.h>
#include <sys/types.h>
#include <sys/queue.h>
#include <netinet/in.h>
#include <setjmp.h>
#include <stdarg.h>
#include <ctype.h>
#include <errno.h>
#include <getopt.h>
#include <signal.h>
#include <stdbool.h>

#include <rte_eal.h>
#include <rte_common.h>
#include <rte_malloc.h>
#include <rte_ether.h>
#include <rte_ethdev.h>
#include <rte_mempool.h>
#include <rte_mbuf.h>
#include <rte_net.h>
#include <rte_flow.h>
#include <rte_cycles.h>

static volatile bool force_quit;
static uint16_t port_id;
static uint16_t nr_queues = 3;
struct rte_mempool *mbuf_pool;
struct rte_flow *flow;
uint64_t queue_pkt[16] ={0};

static inline void
print_ether_addr(const char *what, struct ether_addr *eth_addr)
{
char buf[ETHER_ADDR_FMT_SIZE];
ether_format_addr(buf, ETHER_ADDR_FMT_SIZE, eth_addr);
printf("%s%s", what, buf);
}

static void
main_loop(void)
{
struct rte_mbuf *mbufs[32];
struct rte_flow_error error;
uint16_t nb_rx;
uint16_t i;
uint16_t j;

while (!force_quit) {
for (i = 0; i < nr_queues; i++) {
nb_rx = rte_eth_rx_burst(port_id, i, mbufs, 32);
if (nb_rx) {
for (j = 0; j < nb_rx; j++) {
struct rte_mbuf *m = mbufs[j];
//struct ether_hdr *eth_hdr;
//eth_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
//print_ether_addr("src=", &eth_hdr->s_addr);
//print_ether_addr(" -> dst=", &eth_hdr->d_addr);
//printf(" : queue=0x%x\n", (unsigned int)i);
queue_pkt[i]++;
rte_pktmbuf_free(m);
}
}
}
}
rte_flow_flush(port_id, &error);
rte_eth_dev_stop(port_id);
rte_eth_dev_close(port_id);
}

#define CHECK_INTERVAL 1000 /* 100ms */
#define MAX_REPEAT_TIMES 90 /* 9s (90 * 100ms) in total */
static void
assert_link_status(void)
{
struct rte_eth_link link;
uint8_t rep_cnt = MAX_REPEAT_TIMES;

memset(&link, 0, sizeof(link));
do {
rte_eth_link_get(port_id, &link);
if (link.link_status == ETH_LINK_UP)
break;
rte_delay_ms(CHECK_INTERVAL);
} while (--rep_cnt);

if (link.link_status == ETH_LINK_DOWN)
rte_exit(EXIT_FAILURE, ":: error: link is still down\n");
}

static void
init_port(void)
{
int ret;
uint16_t i;
struct rte_eth_conf port_conf = {
.rxmode = {
.split_hdr_size = 0,
},
.txmode = {
.offloads =
DEV_TX_OFFLOAD_VLAN_INSERT |
DEV_TX_OFFLOAD_IPV4_CKSUM |
DEV_TX_OFFLOAD_UDP_CKSUM |
DEV_TX_OFFLOAD_TCP_CKSUM |
DEV_TX_OFFLOAD_SCTP_CKSUM |
DEV_TX_OFFLOAD_TCP_TSO,
},
.fdir_conf = {
.mode = RTE_FDIR_MODE_PERFECT,
.pballoc = RTE_FDIR_PBALLOC_64K,
.status = RTE_FDIR_REPORT_STATUS,
},
};
struct rte_eth_txconf txq_conf;
struct rte_eth_rxconf rxq_conf;
struct rte_eth_dev_info dev_info;

rte_eth_dev_info_get(port_id, &dev_info);
port_conf.txmode.offloads &= dev_info.tx_offload_capa;
ret = rte_eth_dev_configure(port_id, nr_queues, nr_queues, &port_conf);
if (ret < 0) {
rte_exit(EXIT_FAILURE,
":: cannot configure device: err=%d, port=%u\n",
ret, port_id);
}

rxq_conf = dev_info.default_rxconf;
rxq_conf.offloads = port_conf.rxmode.offloads;
/* only set Rx queues: something we care only so far */
for (i = 0; i < nr_queues; i++) {
ret = rte_eth_rx_queue_setup(port_id, i, 512,
rte_eth_dev_socket_id(port_id),
&rxq_conf,
mbuf_pool);
if (ret < 0) {
rte_exit(EXIT_FAILURE,
":: Rx queue setup failed: err=%d, port=%u\n",
ret, port_id);
}
}

txq_conf = dev_info.default_txconf;
txq_conf.offloads = port_conf.txmode.offloads;

for (i = 0; i < nr_queues; i++) {
ret = rte_eth_tx_queue_setup(port_id, i, 512,
rte_eth_dev_socket_id(port_id),
&txq_conf);
if (ret < 0) {
rte_exit(EXIT_FAILURE,
":: Tx queue setup failed: err=%d, port=%u\n",
ret, port_id);
}
}

rte_eth_promiscuous_enable(port_id);
ret = rte_eth_dev_start(port_id);
if (ret < 0) {
rte_exit(EXIT_FAILURE,
"rte_eth_dev_start:err=%d, port=%u\n",
ret, port_id);
}
assert_link_status();
printf(":: initializing port: %d done\n", port_id);
}

static void
signal_handler(int signum)
{
if (signum == SIGINT || signum == SIGTERM) {
force_quit = true;
write(fileno(stdout), '\n', 1);
}
}

static int ipv4_udp_raw(struct rte_flow_item_udp *udp_spec, struct rte_flow_item_udp *udp_mask, const uint8_t* key, const uint8_t* mask, int len)
{
struct rte_flow *flow = NULL;
struct rte_flow_error error;
struct rte_flow_attr attr;
struct rte_flow_item pattern[10];
struct rte_flow_action action[10];
struct rte_flow_action_queue queue = { .index = 1};// 命中的报文 放在 1号队列

memset(pattern, 0, sizeof(pattern));
memset(action, 0, sizeof(action));
memset(&attr, 0, sizeof(struct rte_flow_attr));

struct rte_flow_item_raw raw_spec = {
.relative = 1,
.reserved = 0,
.offset = 0,
.limit = 0,
.length = len,
.pattern = key,
};

struct rte_flow_item_raw raw_mask = {
.relative = 1,
.search = 1,
.reserved = 0x3fffffff,
.offset = 0xffffffff,
.limit = 0xffff,
.length = 0xffff,
.pattern = mask,
};

attr.ingress = 1;

pattern[0].type = RTE_FLOW_ITEM_TYPE_ETH;

pattern[1].type = RTE_FLOW_ITEM_TYPE_IPV4;

pattern[2].type = RTE_FLOW_ITEM_TYPE_UDP;
pattern[2].spec = udp_spec;
pattern[2].mask = udp_mask;

pattern[3].type = RTE_FLOW_ITEM_TYPE_RAW;
pattern[3].spec = &raw_spec;
pattern[3].mask = &raw_mask;

action[0].type = RTE_FLOW_ACTION_TYPE_QUEUE;
action[0].conf = &queue;

flow = rte_flow_create(port_id, &attr, pattern, action, &error);
if (!flow)
{
printf("Flow can't be created %d message: %s\n",
error.type,
error.message ? error.message : "(no stated reason)");
rte_exit(EXIT_FAILURE, "error in creating flow");
}

printf("create flow director successfully %p\n", flow);
return 0;
}


int
main(int argc, char **argv)
{
int ret;
uint16_t nr_ports;

ret = rte_eal_init(argc, argv);
if (ret < 0)
rte_exit(EXIT_FAILURE, ":: invalid EAL arguments\n");

force_quit = false;
signal(SIGINT, signal_handler);
signal(SIGTERM, signal_handler);

nr_ports = rte_eth_dev_count_avail();
if (nr_ports == 0)
rte_exit(EXIT_FAILURE, ":: no Ethernet ports found\n");
port_id = 0;
if (nr_ports != 1) {
printf(":: warn: %d ports detected, but we use only one: port %u\n",
nr_ports, port_id);
}
mbuf_pool = rte_pktmbuf_pool_create("mbuf_pool", 4096, 128, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
if (mbuf_pool == NULL)
rte_exit(EXIT_FAILURE, "Cannot init mbuf pool\n");

init_port();
printf("端口初始化完成\n");

//////////////////// SEQ < 64 ///////////////////////////
struct rte_flow_item_udp udp_spec1 = {
.hdr = {
.src_port = 16285,
.dst_port = 0
},
};
struct rte_flow_item_udp udp_mask1 = {
.hdr = {
.src_port = 0xFFFF,
.dst_port = 0x0
},
};
const uint8_t pkt_spec_1[] = {0x97, 0x11, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01};
const uint8_t pkt_mask_1[] = {0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff};
ipv4_udp_raw(&udp_spec1, &udp_mask1, pkt_spec_1, pkt_mask_1, sizeof(pkt_spec_1));

//////////////////// SEQ 16:1 ///////////////////////////
struct rte_flow_item_udp udp_spec2 = {
.hdr = {
.src_port = 80,
.dst_port = 0
},
};
struct rte_flow_item_udp udp_mask2 = {
.hdr = {
.src_port = 0xFFFF,
.dst_port = 0x0
},
};
const uint8_t pkt_spec_2[] = {0x97, 0x11, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x02};
const uint8_t pkt_mask_2[] = {0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff};
ipv4_udp_raw(&udp_spec2, &udp_mask2, pkt_spec_2, pkt_mask_2, sizeof(pkt_spec_2));

main_loop();
printf("queue_id %d pkt累计:%zu\n", 0, queue_pkt[0]);
printf("queue_id %d pkt累计:%zu\n", 1, queue_pkt[1]);
printf("queue_id %d pkt累计:%zu\n", 2, queue_pkt[2]);
printf("queue_id %d pkt累计:%zu\n", 3, queue_pkt[3]);
return 0;
}

运行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[root@node-yadpi-03 intel_fdir]# ./build/fdir -l 0-3
EAL: Detected 48 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: PCI device 0000:03:00.0 on NUMA socket 0
EAL: probe driver: 8086:1521 net_e1000_igb
EAL: PCI device 0000:03:00.1 on NUMA socket 0
EAL: probe driver: 8086:1521 net_e1000_igb
EAL: PCI device 0000:03:00.2 on NUMA socket 0
EAL: probe driver: 8086:1521 net_e1000_igb
EAL: PCI device 0000:03:00.3 on NUMA socket 0
EAL: probe driver: 8086:1521 net_e1000_igb
EAL: PCI device 0000:1a:00.0 on NUMA socket 0
EAL: probe driver: 8086:37d0 net_i40e
EAL: PCI device 0000:1a:00.1 on NUMA socket 0
EAL: probe driver: 8086:37d0 net_i40e
EAL: PCI device 0000:1a:00.2 on NUMA socket 0
EAL: probe driver: 8086:37d0 net_i40e
EAL: PCI device 0000:1a:00.3 on NUMA socket 0
EAL: probe driver: 8086:37d0 net_i40e
EAL: PCI device 0000:3c:00.0 on NUMA socket 0
EAL: probe driver: 8086:1580 net_i40e
EAL: PCI device 0000:86:00.0 on NUMA socket 1
EAL: probe driver: 8086:1580 net_i40e
:: warn: 3 ports detected, but we use only one: port 0
:: initializing port: 0 done
端口初始化完成
i40e_flow_set_fdir_flex_pit(): i40e device 0000:1a:00.0 changed global register [0x0026898c]. original: 0x00000000, new: 0x000000a6
create flow director successfully 0x17fb96840
create flow director successfully 0x17fb98280
^Ci40e_flex_payload_reg_set_default(): i40e device 0000:1a:00.0 changed global register [0x0026898c]. original: 0x000000a6, new: 0x00000000
queue_id 0 pkt累计:0
queue_id 1 pkt累计:0
queue_id 2 pkt累计:0
queue_id 3 pkt累计:0
[root@node-yadpi-03 intel_fdir]#