-
Notifications
You must be signed in to change notification settings - Fork 152
Remove task and cgroup local storage percpu counters #10507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: bpf-next_base
Are you sure you want to change the base?
Remove task and cgroup local storage percpu counters #10507
Conversation
|
Upstream branch: ec439c3 |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
|
Forwarding comment 3671545346 via email |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
|
Forwarding comment 3671547508 via email |
|
Forwarding comment 3671551778 via email |
|
Forwarding comment 3671560989 via email |
0de3956 to
2d78e4d
Compare
|
Upstream branch: 3d60306 |
848e918 to
d081a09
Compare
2d78e4d to
5007636
Compare
|
Upstream branch: d2749ae |
d081a09 to
8a9c0bb
Compare
5007636 to
0ac68c9
Compare
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock, convert bpf_selem_unlink_map() to failable. It still always succeeds and returns 0 for now. Since some operations updating local storage cannot fail in the middle, open-code bpf_selem_unlink_map() to take the b->lock before the operation. There are two such locations: - bpf_local_storage_alloc() The first selem will be unlinked from smap if cmpxchg owner_storage_ptr fails, which should not fail. Therefore, hold b->lock when linking until allocation complete. Helpers that assume b->lock is held by callers are introduced: bpf_selem_link_map_nolock() and bpf_selem_unlink_map_nolock(). - bpf_local_storage_update() The three step update process: link_map(new_selem), link_storage(new_selem), and unlink_map(old_selem) should not fail in the middle. In bpf_selem_unlink(), bpf_selem_unlink_map() and bpf_selem_unlink_storage() should either all succeed or fail as a whole instead of failing in the middle. So, return if unlink_map() failed. In bpf_local_storage_destroy(), since it cannot deadlock with itself or bpf_local_storage_map_free() who the function might be racing with, retry if bpf_selem_unlink_map() fails due to rqspinlock returning errors. Signed-off-by: Amery Hung <ameryhung@gmail.com>
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock, convert bpf_selem_link_map() to failable. It still always succeeds and returns 0 until the change happens. No functional change. __must_check is added to the function declaration locally to make sure all the callers are accounted for during the conversion. Signed-off-by: Amery Hung <ameryhung@gmail.com>
To prepare for changing bpf_local_storage::lock to rqspinlock, open code bpf_selem_unlink_storage() in the only caller, bpf_selem_unlink(), since unlink_map and unlink_storage must be done together after all the necessary locks are acquired. Signed-off-by: Amery Hung <ameryhung@gmail.com>
To prepare changing both bpf_local_storage_map_bucket::lock and bpf_local_storage::lock to rqspinlock, convert bpf_selem_unlink() to failable. It still always succeeds and returns 0 until the change happens. No functional change. For bpf_local_storage_map_free(), WARN_ON() for now as no real error will happen until we switch to rqspinlock. __must_check is added to the function declaration locally to make sure all callers are accounted for during the conversion. Signed-off-by: Amery Hung <ameryhung@gmail.com>
Change bpf_local_storage::lock and bpf_local_storage_map_bucket::lock to from raw_spin_lock to rqspinlock. Finally, propagate errors from raw_res_spin_lock_irqsave() to syscall return or BPF helper return. In bpf_local_storage_destroy(), WARN_ON for now. A later patch will handle this properly. For, __bpf_local_storage_map_cache(), instead of handling the error, skip updating the cache. Signed-off-by: Amery Hung <ameryhung@gmail.com>
The percpu counter in task local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.
Since the percpu counter is removed, merge back bpf_task_storage_get()
and bpf_task_storage_get_recur(). This will allow the bpf syscalls and
helpers to run concurrently on the same CPU, removing the spurious
-EBUSY error. bpf_task_storage_get(..., F_CREATE) will now always
succeed with enough free memory unless being called recursively.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
The percpu counter in cgroup local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Percpu locks have been removed from cgroup and task local storage. Now that all local storage no longer use percpu variables as locks preventing recursion, there is no need to pass them to bpf_local_storage_map_free(). Remove the argument from the function. Signed-off-by: Amery Hung <ameryhung@gmail.com>
A later patch will introduce bpf_selem_unlink_lockless() to handle
rqspinlock errors. bpf_selem_unlink_lockless() will allow an selem
to be partially unlinked from map or local storage. Therefore,
bpf_selem_free() needs to be decoupled from map and local storage
as SDATA(selem)->smap or selem->local_storage may be NULL.
Decoupling from local storage is already done when local storage
migrated from BPF memory allocator to kmalloc_nolock(). This patch
prepares to decouple from map.
Currently, map is still needed in bpf_selem_free() to:
1. Uncharge memory
a. map->ops->map_local_storage_uncharge
b. map->elem_size
2. Infer how memory should be freed
a. map->use_kmalloc_nolock
3. Free special fields
a. map->record
The dependency of 1.a will be addressed by a later patch by returning
the amount of memory to uncharge directly to the owner who calls
bpf_local_storage_destroy().
The dependency of 3.a will be addressed by a later patch by freeing
special fields under b->lock, when the map is still alive.
This patch handles 1.b and 2.a by simply saving the informnation in
bpf_local_storage_elem.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Introduce bpf_selem_unlink_lockless() to properly handle errors returned from rqspinlock in bpf_local_storage_map_free() and bpf_local_storage_destroy() where the operation must succeeds. The idea of bpf_selem_unlink_lockless() is to allow an selem to be partially linked and use refcount to determine when and who can free the selem. An selem initially is fully linked to a map and a local storage and therefore selem->link_cnt is set to 2. Under normal circumstances, bpf_selem_unlink_lockless() will be able to grab locks and unlink an selem from map and local storage in sequeunce, just like bpf_selem_unlink(), and then add it to a local tofree list provide by the caller. However, if any of the lock attempts fails, it will only clear SDATA(selem)->smap or selem->local_storage depending on the caller and decrement link_cnt to signal that the corresponding data structure holding a reference to the selem is gone. Then, only when both map and local storage are gone, an selem can be free by the last caller that turns link_cnt to 0. To make sure bpf_obj_free_fields() is done only once and when map is still present, it is called when unlinking an selem from b->list under b->lock. To make sure uncharging memory is only done once and when owner is still present, only unlink selem from local_storage->list in bpf_local_storage_destroy() and return the amount of memory to uncharge to the caller (i.e., owner) since the map associated with an selem may already be gone and map->ops->map_local_storage_uncharge can no longer be referenced. Finally, access of selem, SDATA(selem)->smap and selem->local_storage are racy. Callers will protect these fields with RCU. Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com>
…ee, destroy}
Take care of rqspinlock error in bpf_local_storage_{map_free, destroy}()
properly by switching to bpf_selem_unlink_lockless().
Pass reuse_now == false when calling bpf_selem_free_list() since both
callers iterate lists of selem without lock. An selem can only be freed
after an RCU grace period.
Similarly, SDATA(selem)->smap and selem->local_storage need to be
protected by RCU as well since a caller can update these fields
which may also be seen by the other at the same time. Pass reuse_now
== false when calling bpf_local_storage_free(). The local storage map is
already protected as bpf_local_storage_map_free() waits for an RCU grace
period after iterating b->list and before freeing itself.
Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Check sk_omem_alloc when the caller of bpf_local_storage_destroy() returns. bpf_local_storage_destroy() now returns the memory to uncharge to the caller instead of directly uncharge. Therefore, in the sk_storage_omem_uncharge, check sk_omem_alloc when bpf_sk_storage_free() returns instead of bpf_local_storage_destroy(). Signed-off-by: Amery Hung <ameryhung@gmail.com>
Update the expected result of the selftest as recursion of task local
storage syscall and helpers have been relaxed. Now that the percpu
counter is removed, task local storage helpers, bpf_task_storage_get()
and bpf_task_storage_delete() can now run on the same CPU at the same
time unless they cause deadlock.
Note that since there is no percpu counter preventing recursion in
task local storage helpers, bpf_trampoline now catches the recursion
of on_update as reported by recursion_misses.
on_enter: tp_btf/sys_enter
on_update: fentry/bpf_local_storage_update
Old behavior New behavior
____________ ____________
on_enter on_enter
bpf_task_storage_get(&map_a) bpf_task_storage_get(&map_a)
bpf_task_storage_trylock succeed bpf_local_storage_update(&map_a)
bpf_local_storage_update(&map_a)
on_update on_update
bpf_task_storage_get(&map_a) bpf_task_storage_get(&map_a)
bpf_task_storage_trylock fail on_update::misses++ (1)
return NULL create and return map_a::ptr
map_a::ptr += 1 (1)
bpf_task_storage_delete(&map_a)
return 0
bpf_task_storage_get(&map_b) bpf_task_storage_get(&map_b)
bpf_task_storage_trylock fail on_update::misses++ (2)
return NULL create and return map_b::ptr
map_b::ptr += 1 (1)
create and return map_a::ptr create and return map_a::ptr
map_a::ptr = 200 map_a::ptr = 200
bpf_task_storage_get(&map_b) bpf_task_storage_get(&map_b)
bpf_task_storage_trylock succeed lockless lookup succeed
bpf_local_storage_update(&map_b) return map_b::ptr
on_update
bpf_task_storage_get(&map_a)
bpf_task_storage_trylock fail
lockless lookup succeed
return map_a::ptr
map_a::ptr += 1 (201)
bpf_task_storage_delete(&map_a)
bpf_task_storage_trylock fail
return -EBUSY
nr_del_errs++ (1)
bpf_task_storage_get(&map_b)
bpf_task_storage_trylock fail
return NULL
create and return ptr
map_b::ptr = 100
Expected result:
map_a::ptr = 201 map_a::ptr = 200
map_b::ptr = 100 map_b::ptr = 1
nr_del_err = 1 nr_del_err = 0
on_update::recursion_misses = 0 on_update::recursion_misses = 2
On_enter::recursion_misses = 0 on_enter::recursion_misses = 0
Signed-off-by: Amery Hung <ameryhung@gmail.com>
|
Upstream branch: f785a31 |
Adjsut the error code we are checking against as bpf_task_storage_get() now returns -EDEADLK or -ETIMEDOUT when deadlock happens. Signed-off-by: Amery Hung <ameryhung@gmail.com>
Remove a test in test_maps that checks if the updating of the percpu counter in task local storage map is preemption safe as the percpu counter is now removed. Signed-off-by: Amery Hung <ameryhung@gmail.com>
bpf_cgrp_storage_busy has been removed. Use bpf_bprintf_nest_level instead. This percpu variable is also in the bpf subsystem so that if it is removed in the future, BPF-CI will catch this type of CI- breaking change. Signed-off-by: Amery Hung <ameryhung@gmail.com>
8a9c0bb to
2242a7d
Compare
Pull request for series with
subject: Remove task and cgroup local storage percpu counters
version: 3
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1034704