oss-sec mailing list archives
backtrace_symbols() misuse by Ceph and its supposedly-safe use
From: Alexander Patrakov <patrakov () gmail com>
Date: Fri, 12 Jul 2024 17:37:59 +0800
Hi,
Given the recent discussions regarding the calls to
non-async-signal-safe functions from signal handlers, let me add my
two pesos to the discussion, even though I don't think there is
anything security-related in this email.
A few months ago, I handled a customer report about a deadlock in Ceph
RADOS Gateway. The Ceph version is 17.2.7. In that case, I attached
gdb to the deadlocked process and applied the backtrace to all
threads. One of the threads had this backtrace:
```
Thread 594 (Thread 0x7f41ac610700 (LWP 870073) "radosgw"):
#0 0x00007f43ca050d0e in __lll_lock_wait_private () from
target:/lib64/libc.so.6
#1 0x00007f43ca0568fa in malloc () from target:/lib64/libc.so.6
#2 0x00007f43ca0df884 in backtrace_symbols () from target:/lib64/libc.so.6
#3 0x00007f43cb24485a in handle_oneshot_fatal_signal(int) () from
target:/usr/lib64/libradosgw.so.2
#4 <signal handler called>
#5 0x00007f43ca003d2b in raise () from target:/lib64/libc.so.6
#6 0x00007f43ca0053e5 in abort () from target:/lib64/libc.so.6
#7 0x00007f43ca049c87 in __libc_message () from target:/lib64/libc.so.6
#8 0x00007f43ca051d2a in malloc_printerr () from target:/lib64/libc.so.6
#9 0x00007f43ca0525f6 in unlink_chunk.isra () from target:/lib64/libc.so.6
#10 0x00007f43ca0555c0 in _int_malloc () from target:/lib64/libc.so.6
#11 0x00007f43ca0567d8 in malloc () from target:/lib64/libc.so.6
#12 0x00007f43ca28e3fc in operator new(unsigned long) () from
target:/usr/lib64/libstdc++.so.6
#13 0x00007f43cae3b42f in std::map<rgw_obj, RGWObjState,
std::less<rgw_obj>, std::allocator<std::pair<rgw_obj const,
RGWObjState> > >::operator[](rgw_obj const&) () from
target:/usr/lib64/libradosgw.so.2
#14 0x00007f43caddb5a5 in RGWObjectCtx::set_atomic(rgw_obj const&) ()
from target:/usr/lib64/libradosgw.so.2
#15 0x00007f43caf189c0 in
rgw::sal::RadosObject::set_atomic(RGWObjectCtx*) const () from
target:/usr/lib64/libradosgw.so.2
#16 0x00007f43cad8319b in
RGWDeleteMultiObj::handle_individual_object(rgw_obj_key const&,
optional_yield,
boost::asio::basic_deadline_timer<boost::posix_time::ptime,
boost::asio::time_traits<boost::posix_time::ptime>,
boost::asio::executor>*) () from target:/usr/lib64/libradosgw.so.2
#17 0x00007f43cad83b99 in void
boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation,
boost::context::basic_fixedsize_stack<boost::context::stack_traits>,
spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(),
boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>,
0u> > >,
RGWDeleteMultiObj::execute(optional_yield)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void
(*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>,
0u> > > >)#3}, boost::context::basic_fixedsize_stack<boost::context::stack_traits>
::operator()()::{lambda(boost::context::continuation&&)#1}>
(boost::context::detail::transfer_t) () from
target:/usr/lib64/libradosgw.so.2 #18 0x00007f43cb293c1f in make_fcontext () from target:/usr/lib64/libradosgw.so.2 ``` Here is my attempt to make sense of it from the bottom up. In frame #13, something was written into a C++ std::map; this has led to a memory allocation via malloc() in frame #11. However, the internal logic of the glibc implementation of malloc() detected some corruption, and there was, indeed this message in the log: Jun 21 04:52:35 abc-osd08 radosgw[868945]: corrupted size vs. prev_size Jun 21 04:52:36 abc-osd08 radosgw[868945]: *** Caught signal (Aborted) ** Jun 21 04:52:36 abc-osd08 radosgw[868945]: in thread 7f41ac610700 thread_name:radosgw After printing this message, glibc tried to safely abort the process to prevent further damage/exploitation, see frame #6. Ceph daemons, however, have a signal handler that catches SIGABRT and SIGSEGV and tries to format and log a backtrace. Their systemd units are also set to automatically restart the daemons. Without the handler, this is what would have happened. However, within the handler, radosgw calls backtrace_symbols(), which, in turn, calls malloc() in frame #1, which is not async-signal-safe. So, here is a usage bug by definition, and it converted a restartable crash into a classical A-A deadlock because the outer malloc() call from frame #11 already holds a lock. In Ceph public bug tracker, there is also a similar report with MDS instead of the RADOS gateway: https://tracker.ceph.com/issues/65039 The question here is whether this is only a Ceph bug or whether backtrace_symbols() is an unusable piece of API a-la gets() because it is calling malloc() and is documented to do so. The same manual page documents backtrace_symbols_fd(), but, as-is, it is not good enough because Ceph also wants to demangle C++ symbols, to log the result, and to post the crash information into the designated directory. I acknowledge that there is a todo comment in src/global/signal_handler.cc: // TODO: don't use an ostringstream here. It could call malloc(), which we // don't want inside a signal handler. // Also fix the backtrace code not to allocate memory. ClibBackTrace bt(1); ostringstream oss; bt.print(oss); dout_emergency(oss.str()); Apparently, the authors don't realize that the use of ostringstream is not the only cause of possible malloc() calls. What would be a good solution (as in: something that does not convert crashes into deadlocks) here? I understand that, after memory corruption, we are already in the UB territory, but is there anything better possible than what is implemented? -- Alexander Patrakov
Current thread:
- backtrace_symbols() misuse by Ceph and its supposedly-safe use Alexander Patrakov (Jul 12)
- Re: backtrace_symbols() misuse by Ceph and its supposedly-safe use Jacob Bachmeyer (Jul 13)
- Re: backtrace_symbols() misuse by Ceph and its supposedly-safe use Simon McVittie (Jul 13)
