#环境
OS:Centos 7.4
GCC: 8.5

现象

先看现场运行崩溃的结果, GDB显示全是??问号

调查

依据现场同事的反馈, 注释掉 acl_match_ipv4()函数的调用, 程序就不会崩溃.

以下是 rte_acl_classify() 接口的说明.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
/**
* Perform search for a matching ACL rule for each input data buffer.
* Each input data buffer can have up to *categories* matches.
* That implies that results array should be big enough to hold
* (categories * num) elements.
* Also categories parameter should be either one or multiple of
* RTE_ACL_RESULTS_MULTIPLIER and can't be bigger than RTE_ACL_MAX_CATEGORIES.
* If more than one rule is applicable for given input buffer and
* given category, then rule with highest priority will be returned as a match.
* Note, that it is a caller's responsibility to ensure that input parameters
* are valid and point to correct memory locations.
*
* @param ctx
* ACL context to search with.
* @param data
* Array of pointers to input data buffers to perform search.
* Note that all fields in input data buffers supposed to be in network
* byte order (MSB).
* @param results
* Array of search results, *categories* results per each input data buffer.
* @param num
* Number of elements in the input data buffers array.
* @param categories
* Number of maximum possible matches for each input buffer, one possible
* match per category.
* @return
* zero on successful completion.
* -EINVAL for incorrect arguments.
*/
extern int
rte_acl_classify(const struct rte_acl_ctx *ctx,
const uint8_t **data,
uint32_t *results, uint32_t num,
uint32_t categories);


以下是rte_acl_classify接口的使用

int acl_match_ipv4(struct acl_context_t *ctx, const char *data)
{
int result = 0;
int ret = rte_acl_classify(ctx->acl_ctx_v4, (const uint8_t **)&data, (uint32_t *)&result, 1, RTE_ACL_MAX_CATEGORIES);
if (ret)
rte_exit(EXIT_FAILURE, "ERROR rte_acl_classify in acl_match_ipv4\n");
return result;
}

排查

我当时真的是非常的困惑. 主管过来一起排查代码找问题.

最后只有这一处代码存在可疑, 其他地方没有发生栈溢出的机会.

那这个API rte_acl_classify 的威力在哪里呢?

看rte_acl_classify中的 num, 只的是 数据输入的个数, 这边的场景是一次匹配一个报文, num 填1, 没错.

result呢? 用于存储命中后rule中的userdata信息, 如果一个规则同时命中了多个category,那么在 result的pkt_index行的category_index列 里面存储命中的4字节信息

还记得 规则rule的构造吗, 每个规则是可以设定 category_mask 的!
category_mask 范围1-16, 那就是 0x01~0xFFFF, 每一个bit代表一个mask, 一条规则最多只允许配置16个category, 但每个category里面只会装下最佳匹配的值
这里没有启用ACL多模, 设置的就是0x01, 所有的规则都是一样.

规则的构建 见下:

外部规则 转 DPDK ACL 规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
static int
acl_convert_ipv4_rule(struct acl_rule_t *rule, struct rte_acl_rule *v)
{
v->field[PROTO_FIELD_IPV4].value.u8 = rule->proto_id;
v->field[PROTO_FIELD_IPV4].mask_range.u8 = rule->proto_id ? 0xff : 0;

v->field[SRC_FIELD_IPV4].value.u32 = rule->ipsrc[0];
v->field[SRC_FIELD_IPV4].mask_range.u32 = rule->ipsrc_prefix;

v->field[DST_FIELD_IPV4].value.u32 = rule->ipdst[0];
v->field[DST_FIELD_IPV4].mask_range.u32 = rule->ipdst_prefix;

v->field[SRCP_FIELD_IPV4].value.u16 = rule->src_port_begin;
v->field[SRCP_FIELD_IPV4].mask_range.u16 = rule->src_port_end;

v->field[DSTP_FIELD_IPV4].value.u16 = rule->dst_port_begin;
v->field[DSTP_FIELD_IPV4].mask_range.u16 = rule->dst_port_end;

v->data.userdata = rule->rule_id;
return 0;
}

//解析外部规则
for(i = 0; i < rule_num; i++)
{
rule = (struct rte_acl_rule *)(acl_rules + acl_cnt * sizeof(struct acl4_rule));
if (acl_convert_ipv4_rule(rule_list + i, rule) != 0)
rte_exit(EXIT_FAILURE, "parse ipv4 rules error\n");

rule->data.priority = ACL_PRIORITY_DEBIG + acl_cnt;
rule->data.category_mask = 1; //无需多模
acl_cnt++;
};

RTE_ACL_MAX_CATEGORIES 值为16, 输入报文个数为N时, result的内存空间布局.
result的内存空间大小为 4字节categories_mask16N个报文=64N字节

RTE_ACL_MAX_CATEGORIES 值为1, 输入报文个数为1时, result的内存空间布局.
result的内存空间大小为 4字节categories_mask11个报文=4字节

从这里推算出, 在我的崩溃场景中, RTE_ACL_MAX_CATEGORIES 值为16, 输入报文个数为1时, result的内存空间大小为 16*4字节=64字节.
这64字节的足以冲毁 acl_match_ipv4的函数返回地址,导致崩溃. GDB的调试全部问号.

BUG的解决, 我的规则没有多模, 每次只有1个报文, 将rte_acl_classify的categories参数填为1, 即解决问题.