Justin Azoff

Hi! Random things are here :-)

bpf_map_get_next_key pitfalls

Background

eBPF maps are a core component of most XDP programs.

I am working on a library called libflowbypass that uses some code and ideas from Suricata to implement flow cutoff inside the kernel using eBPF and XDP in a reusable way for any network monitoring tool. The core of this is 2 maps (one for v4 and one for v6) whose keys are the 5 tuple (proto, src, sport, dst, dstport) of each flow to be dropped in the kernel. The value for each key stores some per flow metadata like packet and byte counts.

Expiring flows

At some point connections end and map entries become outdated and need to be garbage collected. For TCP flows these entries could be removed when a FIN or RST packet is seen, but this would not work well for UDP flows. Plus, there is no guarantee that the FIN packet is seen. One of the endpoints could drop offline mid connection.

The solution to this is to also track the timestamp of the last packet seen for each flow. Then, periodically iterate through the map and remove any flows whose timestamp is too old.

Iterating through a map

This is where things get tricky. The function used for iterating through a map is

bpf_map_get_next_key(int fd, const void *key, void *next_key)

You pass it the file descriptor of a bpf map, a pointer to a key(that can be NULL or invalid) and a pointer to store the next key.

Here is a program that creates a bpf_map and adds 5 items to it.

long long key, next_key, value;
int fd, i, res;

fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value),
            20 /*entries*/, 0);
if (fd < 0) {
    printf("Failed to create hashmap '%s'!\n", strerror(errno));
    exit(1);
}

for(i=0;i < 5 ; i++) {
    key = 100+i;
    value = 1000+i;
    bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
}

key = -1;
while(bpf_map_get_next_key(fd, &key, &next_key) == 0) {
    printf("Got key %lld ", key);
    res = bpf_map_lookup_elem(fd, &key, &value);
    if(res < 0) {
        printf("No value??\n");
    } else {
        printf("%lld\n", value);
    }
    key=next_key;
}

close(fd);

This should print out all keys 100 through 104, but if we run it we see:

Got key -1 No value??
Got key 102 1002
Got key 101 1001
Got key 100 1000
Got key 104 1004

It output one invalid key and missed one value. The mistake is that the parameters to bpf_map_get_next_key are a bit misleading. After it is called, next_key contains the current key, not the parameter named key. Changing the loop to look like

while(bpf_map_get_next_key(fd, &prev_key, &key) == 0) {
    /* ... */
    prev_key=key;
}

fixes that issue.

Pulling the rug out from under yourself

In many programming languages deleting entries from a map while iterating over it causes undefined behavior. You can do this with bpf maps, but you have to be careful. Let’s update the program to delete all odd values while iterating:

while(bpf_map_get_next_key(fd, &prev_key, &key) == 0) {
    res = bpf_map_lookup_elem(fd, &key, &value);
    if(res < 0) {
        printf("No value??\n");
        continue;
    }
    printf("Got %lld:%lld\n", key, value);
    if(value%2==1) {
        printf("Deleting key %lld\n", key);
        bpf_map_delete_elem(fd, &key);
    }
    prev_key=key;
}

Running this we get something very strange:

Got 102:1002
Got 101:1001
Deleting key 101
Got 102:1002
Got 100:1000
Got 104:1004
Got 103:1003
Deleting key 103
Got 102:1002
Got 100:1000
Got 104:1004

It deleted keys 101 and 103, but other keys showed up multiple times in the iteration. If we were not outputting the ‘Got’ lines it would not have looked like anything was wrong. What’s going on here?

The problem is that when we use bpf_map_delete_elem to delete the current key it is no longer valid on the next call to bpf_map_get_next_key. This causes the iteration to start over again from the beginning. We need to ensure that bpf_map_get_next_key is called before bpf_map_delete_elem.

My solution for this problem in libflowbypass is this function:

int bpf_map_get_next_key_and_delete(int fd, const void *key, void *next_key, int *delete)
{
    int res = bpf_map_get_next_key(fd, key, next_key);
    if (*delete) {
        bpf_map_delete_elem(fd, key);
        *delete = 0;
    }
    return res;
}

After updating the test to use it, everthing works properly:

int delete_previous = 0;
while(bpf_map_get_next_key_and_delete(fd, &prev_key, &key, &delete_previous) == 0) {
    res = bpf_map_lookup_elem(fd, &key, &value);
    if(res < 0) {
        printf("No value??\n");
        continue;
    }
    printf("Got %lld:%lld\n", key, value);
    if(value%2==1) {
        printf("Deleting key %lld\n", key);
        delete_previous = 1;
    }
    prev_key=key;
}