October 15, 2024

The sched_wakeup and sched_wakeup_new hooks are invoked when a course of modifications state from ‘sleeping’ to ‘runnable.’ They allow us to establish when a course of is able to run and is ready for CPU time. Throughout this occasion, we generate a timestamp and retailer it in an eBPF hash map utilizing the method ID as the important thing.

struct 
__uint(sort, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_TASK_ENTRIES);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u64));
runq_enqueued SEC(".maps");

SEC("tp_btf/sched_wakeup")
int tp_sched_wakeup(u64 *ctx)

struct task_struct *activity = (void *)ctx[0];
u32 pid = task->pid;
u64 ts = bpf_ktime_get_ns();

bpf_map_update_elem(&runq_enqueued, &pid, &ts, BPF_NOEXIST);
return 0;

Conversely, the sched_switch hook is triggered when the CPU switches between processes. This hook offers tips to the method at present using the CPU and the method about to take over. We use the upcoming activity’s course of ID (PID) to fetch the timestamp from the eBPF map. This timestamp represents when the method entered the queue, which we had beforehand saved. We then calculate the run queue latency by merely subtracting the timestamps.

SEC("tp_btf/sched_switch")
int tp_sched_switch(u64 *ctx)
{
struct task_struct *prev = (struct task_struct *)ctx[1];
struct task_struct *subsequent = (struct task_struct *)ctx[2];
u32 prev_pid = prev->pid;
u32 next_pid = next->pid;

// fetch timestamp of when the following activity was enqueued
u64 *tsp = bpf_map_lookup_elem(&runq_enqueued, &next_pid);
if (tsp == NULL)
return 0; // missed enqueue

// calculate runq latency earlier than deleting the saved timestamp
u64 now = bpf_ktime_get_ns();
u64 runq_lat = now - *tsp;

// delete pid from enqueued map
bpf_map_delete_elem(&runq_enqueued, &next_pid);
....

One of many benefits of eBPF is its means to supply tips to the precise kernel knowledge constructions representing processes or threads, also referred to as duties in kernel terminology. This function allows entry to a wealth of knowledge saved a few course of. We required the method’s cgroup ID to affiliate it with a container for our particular use case. Nonetheless, the cgroup data within the course of struct is safeguarded by an RCU (Read Copy Update) lock.

To soundly entry this RCU-protected data, we are able to leverage kfuncs in eBPF. kfuncs are kernel features that may be known as from eBPF applications. There are kfuncs obtainable to lock and unlock RCU read-side vital sections. These features be certain that our eBPF program stays protected and environment friendly whereas retrieving the cgroup ID from the duty struct.

void bpf_rcu_read_lock(void) __ksym;
void bpf_rcu_read_unlock(void) __ksym;

u64 get_task_cgroup_id(struct task_struct *activity)

struct css_set *cgroups;
u64 cgroup_id;
bpf_rcu_read_lock();
cgroups = task->cgroups;
cgroup_id = cgroups->dfl_cgrp->kn->id;
bpf_rcu_read_unlock();
return cgroup_id;

As soon as the info is prepared, we should package deal it and ship it to userspace. For this goal, we selected the eBPF ring buffer. It’s environment friendly, high-performing, and user-friendly. It will probably deal with variable-length knowledge information and permits knowledge studying with out necessitating further reminiscence copying or syscalls. Nonetheless, the sheer variety of knowledge factors was inflicting the userspace program to make use of an excessive amount of CPU, so we carried out a charge limiter in eBPF to pattern the info.

struct 
__uint(sort, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, RINGBUF_SIZE_BYTES);
occasions SEC(".maps");

struct
__uint(sort, BPF_MAP_TYPE_PERCPU_HASH);
__uint(max_entries, MAX_TASK_ENTRIES);
__uint(key_size, sizeof(u64));
__uint(value_size, sizeof(u64));
cgroup_id_to_last_event_ts SEC(".maps");

struct runq_event
u64 prev_cgroup_id;
u64 cgroup_id;
u64 runq_lat;
u64 ts;
;

SEC("tp_btf/sched_switch")
int tp_sched_switch(u64 *ctx)

// ....
// The earlier code
// ....

u64 prev_cgroup_id = get_task_cgroup_id(prev);
u64 cgroup_id = get_task_cgroup_id(subsequent);

// per-cgroup-id-per-CPU rate-limiting
// to steadiness observability with efficiency overhead
u64 *last_ts =
bpf_map_lookup_elem(&cgroup_id_to_last_event_ts, &cgroup_id);
u64 last_ts_val = last_ts == NULL ? 0 : *last_ts;

// test the speed restrict for the cgroup_id in consideration
// earlier than doing extra work
if (now - last_ts_val < RATE_LIMIT_NS)
// Price restrict exceeded, drop the occasion
return 0;

struct runq_event *occasion;
occasion = bpf_ringbuf_reserve(&occasions, sizeof(*occasion), 0);

if (occasion)
event->prev_cgroup_id = prev_cgroup_id;
event->cgroup_id = cgroup_id;
event->runq_lat = runq_lat;
event->ts = now;
bpf_ringbuf_submit(occasion, 0);
// Replace the final occasion timestamp for the present cgroup_id
bpf_map_update_elem(&cgroup_id_to_last_event_ts, &cgroup_id,
&now, BPF_ANY);

return 0;

Our userspace utility, developed in Go, processes occasions from the ring buffer to emit metrics to our metrics backend, Atlas. Every occasion features a run queue latency pattern with a cgroup ID, which we affiliate with containers operating on the host. We categorize it as a system service if no such affiliation is discovered. When a cgroup ID is related to a container, we emit a percentile timer Atlas metric (runq.latency) for that container. We additionally increment a counter metric (sched.change.out) to observe preemptions occurring for the container’s processes. Entry to the prev_cgroup_id of the preempted course of permits us to tag the metric with the reason for the preemption, whether or not it is because of a course of inside the similar container (or cgroup), a course of in one other container, or a system service.

It is vital to spotlight that each the runq.latency metric and the sched.change.out metrics are wanted to find out if a container is affected by noisy neighbors, which is the purpose we intention to attain — relying solely on the runq.latency metric can result in misconceptions. For instance, if a container is at or over its cgroup CPU restrict, the scheduler will throttle it, leading to an obvious spike in run queue latency because of delays within the queue. If we had been solely to contemplate this metric, we’d incorrectly attribute the efficiency degradation to noisy neighbors when it is usually because the container is hitting its CPU quota. Nonetheless, simultaneous spikes in each metrics, primarily when the trigger is a distinct container or system course of, clearly point out a loud neighbor challenge.

Under is the runq.latency metric for a server operating a single container with ample CPU capability. The 99th percentile averages 83.4µs (microseconds), serving as our baseline. Though there are some spikes reaching 400µs, the latency stays inside acceptable parameters.