Guest post originally published on the ContainIQ blog by Matt Lenhard, co-founder & CTO

In this blog post, we recap the process and methodology we used to build our eBPF-based profiler. We include techniques and examples for both compiled and interpreted languages.

Recently, the team at ContainIQ has been working hard to release an eBPF-based profiler for a number of different languages. Continuous profiling is an important practice because it can help engineering teams spot performance bottlenecks and troubleshoot issues faster.

In building our eBPF-based profiler, we learned a number of new techniques that might be interesting to people who want to implement something similar or who simply want to get started writing eBPF-based programs.

In this post, we’ll recap our methodology and process for building our profiler. The techniques used to profile differ based on the language you’re targeting; this post has a section on techniques for compiled languages and a section on techniques for interpreted languages.

Techniques for Compiled Languages

The process for compiled languages is relatively straightforward and uses some already well-known Linux systems. This section highlights the process we used to support compiled languages.

Using the linux perf_event subsystem, we can attach a software event with the cpu_clock config:


SEC("perf_event")
int cpu_sampler(struct bpf_perf_event_data *ctx) {

}

After attaching our function to the perf event, we can grab the stack and generate a unique stack ID for counting via the bpf_get_stackid helper:


key.user_stack_id = bpf_get_stackid(ctx, &stack_traces, BPF_F_USER_STACK);
key.kernel_stack_id = bpf_get_stackid(ctx, &stack_traces, 0);

We save the stack trace to a stack trace BPFmap:


struct {
  __uint(type, BPF_MAP_TYPE_STACK_TRACE);
  __uint(max_entries, 10000);
   __uint(value_size, PERF_MAX_STACK_DEPTH * sizeof(u64));
  __type(key, u32);
} stack_traces SEC(".maps");

And then we use the stack ID to increment a counter for that stack:


u64* count = bpf_map_lookup_elem(& stack_trace_counts, &key);
if(count){
   u64 c = *count;
   c++;
   bpf_map_update_elem(&stack_trace_counts, &key, &c, BPF_EXIST);
}else{
   u64 one = 1;
   bpf_map_update_elem(&stack_trace_counts, &key, &one, BPF_NOEXIST);      
}

struct {
  __uint(type, BPF_MAP_TYPE_HASH);
  __uint(max_entries, 10000);
  __type(key, struct stack_trace_key_t);
  __type(value, u64);
} stack_trace_counts SEC(".maps");

Lastly, in user space we utilize a technique called stack walking to generate the symbols for the stack, and then convert to folded format for further analysis.

Stack Walking Examples

Loop through all of the stacks:


for(auto const &count : stack_trace_counts){
   auto trace = stack_trace_map[count.first.user_stack_id];
   if(!count.second)
       return;
   bpf_map_delete_elem(count_fd, &count.first);
   auto kern_trace = stack_trace_map[count.first.kernel_stack_id];
   syms = syms_cache__get_syms(syms_cache, count.first.pid);
   if(!syms){
       ciq::set_ns mount_namespace{static_cast<pid_t>(count.first.pid),"mnt"};
       syms = syms_cache__get_syms(syms_cache, count.first.pid);
   }

Loop through all of addresses within a given trace:


for(auto const addr : trace){
   if(!addr) break;
   sym = syms__map_addr(syms, addr);
   if(sym){
       const char* sym_name = sym->name;
       const int sym_name_len = strlen(sym_name);
       constexpr const char mangled_name[] = "_Z";
       constexpr const int mangled_name_len = sizeof(mangled_name)-1;
       bool mangled = (sym_name_len > mangled_name_len) && strncmp(mangled_name, sym_name, mangled_name_len) == 0;
       if(mangled){
           fmt::format_to(inserter, "{};", util::demangle(sym_name));
       }else{
           fmt::format_to(inserter, "{};", sym_name);
       }
   }else{
       fmt::format_to(inserter, "[unknown];");
   }
}

Techniques for Interpreted Languages

Things are more interesting with interpreted languages where symbol resolution isn’t as easy. Existing profiling solutions for interpreted or JIT languages usually require that the language generate a perf-map that correlates symbol addresses to their human readable names or, in some cases, that it read from the process memory, directly mapping addresses to language-specific structs that differ based on version.

Using eBPF, we can take another approach by using language specific USDT probes. USDTs are a low-overhead (sometimes) way of deriving specific insights from the application you’re instrumenting. The code below shows how you can leverage USDTs for specific languages to build out a complete stack.

The first step is to add the probe to the language runtime you’d like to instrument. In the examples below, I’m using libbpf to add the probes and Ruby as the desired language:


SEC("usdt/" RUBY_REGEX ":ruby:method__entry")
int BPF_USDT(ruby_method__entry, const char *classname, const char *method, const char *filename, int lineno){
   // bpf_printk("ruby_method__entry");
   dynamic_stack_frame_t frame = {0};
   build_stack_frame(&frame, classname, method, filename, lineno);
   handle_frame(&frame);
   return 0;
}

After hooking the method entry, we now need to build the stack frame and add it to our maps:


static __always_inline void build_stack_frame(dynamic_stack_frame_t *frame, const char* clazz, const char* method, const char* filename, int lineno){
   if(clazz)
       bpf_probe_read_user_str(&frame->clazz, sizeof(frame->clazz), (void*)clazz);
   if(method)
       bpf_probe_read_user_str(&frame->method, sizeof(frame->method), (void*)method);
   if(filename)
       bpf_probe_read_user_str(&frame->filename, sizeof(frame->filename), (void*)filename);
   frame->lineno = lineno;
}

static __always_inline int handle_frame(dynamic_stack_frame_t *frame){
   dynamic_stack_trace_t* stack = current_stack();
   if(!stack)
       return 0;
   stack_push(stack, frame);
   sample_stack(stack);
   print_stack_trace(stack);
   return 0;
}

Then we push the stack to user space for further analysis:


static __always_inline void sample_stack(dynamic_stack_trace_t* stack){
   if(!stack)
       return;
   if(!should_sample(stack))
       return;
   dynamic_stack_trace_key_t key = {0};
   key.stack_id = stack_id(stack);
   key.pid_tgid = bpf_get_current_pid_tgid();
   bpf_get_current_comm(&key.name, sizeof(key.name));
   bpf_map_update_elem(&dynamic_stack_traces, &key.stack_id, stack, 0);
   u32* count = bpf_map_lookup_elem(&dynamic_stack_trace_counts, &key);
   if(count){
       (*count)++;
       bpf_map_update_elem(&dynamic_stack_trace_counts, &key, count, BPF_EXIST);
   }else{
       u32 one = 1;
       bpf_map_update_elem(&dynamic_stack_trace_counts, &key, &one, BPF_NOEXIST);      
   }
   return;


}

When the method returns, we pop it off the stack:


static __always_inline int generic_method_return(){
   dynamic_stack_trace_t *stack_trace = current_stack();
   stack_pop(stack_trace);
   return 0;
}

Once we have the information in user space, we iterate over the stacks to generate counts and convert them into folded format for further analysis.

Limitations

The performance overhead of the function entry and exit probes is, as expected, relatively poor. Without further modifications, the code above can cause significant drags on your application.

Internally, we developed several methods that make the above implementation more performant. In the next post in this series, we’ll outline the performance improvements necessary to make this implementation feasible.

Questions, comments, or improvements? Reach out to matt@containiq.com.