My dumping ground for what I've been upto

Hello! I’m Joel and I run this site! I work in the Android kernel team at Google. My interests are scheduler, tracing, synchronization and kernel internals. I also love contributing to the upstream Linux kernel and other open source projects.

Connect with me on Twitter, and LinkedIn. Or, drop me an email at: joel at joelfernandes dot org

Here’s a list of recent kernel patches I submitted. I got featured on hackaday and have written for LWN as well. Check out my resume and also see a list of past talks/presentations. is a resource I created as a collection of articles and resources exploring Linux kernel and internals topics.

Full list of all posts on this site:
  • 10 May 2018   RCU-preempt: What happens on a context switch [linuxinternals rcu]
  • 10 Feb 2018   USDT for reliable Userspace event tracing [linuxinternals]
  • 08 Jan 2018   BPFd- Running BCC tools remotely across systems [linuxinternals]
  • 31 Dec 2016   ARMv8: flamegraph and NMI support [linuxinternals]
  • 18 Jun 2016   Ftrace events mechanism [linuxinternals]
  • 20 Mar 2016   TIF_NEED_RESCHED: why is it needed [linuxinternals]
  • 25 Dec 2015   Tying 2 voltage sources/signals together [electronics,linuxinternals]
  • 04 Jun 2014   MicroSD card remote switch [linuxinternals]
  • 07 May 2014   Linux Spinlock Internals [linuxinternals]
  • 24 Apr 2014   Studying cache-line sharing effects on SMP systems [linuxinternals]
  • 22 Apr 2014   Design of fork followed by exec in Linux [linuxinternals]
  • Most Recept Post:

    RCU-preempt: What happens on a context switch

    | Comments

    Note: This article requires knowledge of RCU (read copy update) basics and its different flavors.

    RCU’s main algorithm is to detect when it is free to reclaim objects that RCU readers no longer need. The “RCU-sched” flavor of RCU does this by just disabling preemption across the read section. So any time any of the CPUs is not running in a preempt disabled section (such as with preemption off, or interrupts off), then the CPU is said to be in a “quiescent state” (QS). Once all CPUs reach a QS after the reclaimer filed a claim to release an object, the object can be safely released. The time from when the request for RCU to release an object to when RCU says its Ok to release it, is called the grace period.

    RCU-sched is kind of a big hammer, having readers disable preemption can have poor performance effects. After all, read sections are expected to be light in RCU. It can also effect real-time response of applications.

    For this reason, preemptible RCU came about (also called RCU-preempt). Obviously in this flavor, RCU reader sections can get preempted to run something else.

    A recent discussion on LKML clarified to me that “preempted to run something else” not only covers involuntary preemption but also voluntarily sleeping. This design is because, with PREEMPT_RT kernels, “rt” version of spinlocks are actually mutexes that can put the RCU reader to sleep.

    So coming back to the point of this article, I want to go over what happens on a context-switch. When the scheduler is called, we end up in __schedule function. Here in the beginning rcu_note_context_switch is called with the preempt parameter. The preempt parameter indicates if task blocked with help of schedule() or if it was a kernel path (such as return from interrupts or system calls) that called into the scheduler to preempt the currently running task.

    rcu_note_context_switch first calls rcu_preempt_note_context_switch for RCU-preempt to take note. Lets discuss this function.

    First note that the RCU-preempt flavor does warn you if you voluntarily sleep inside an RCU read side section. I’m not sure how the “RT-spinlock” for RT kernels doesn’t get this warning. Probably they delete this warning in PREEMPT_RT patchset, idk. The warning is WARN_ON_ONCE(!preempt && t->rcu_read_lock_nesting > 0);. But seems pretty clear to me a non-RT kernel would scream with this warning if an RCU-preempt read section went to sleep. Getting preempted is Ok but not voluntary sleeping according to this code! (see side note in last para)

    If the task being preempted is in a read-side RCU section, then (and only then) it calls rcu_preempt_ctxt_queue. Here the task being preempted is added to a list of blocked tasks. The reason why we need to add it is, RCU-preempt has 2 perspectives of Quiescent state (QS). Recall, a QS is reached whenever an entity is not blocking the current grace period (GP). RCU-preempt considers 2 entity perspectives: Either the task, or the CPU. In the RCU-preempt world, if a task that is currently in an RCU read section gets preempted, then the CPU has reached a QS because it is no longer running the RCU-read section that is blocking the GP. But now, the task has reached a non-QS (It is blocking the GP). This list basically indicates this fact. If there are blocking tasks, then the GP cannot complete even though the CPU reports its QS. Paul Mckenney explains this here. The other benefit of having a list of tasks is that preempted RCU read sections can be boosted. Paul Mckenney again came to the rescue to explain this to me.

    Finally, you see that rcu_preempt_note_context_switch does report a QS. This is because if the task was in a read section, it has just been added to the blocked task list. If its not, then we just reached a QS for the CPU. Either way we entered a CPU QS. So is recorded with a call to rcu_preempt_qs();.

    Please go through the Expedited GP document which also explains some of the RCU-preempt behaviors.

    Side note: At the moment, I don’t immediately see why by blocking in a RCU-preempt section shouldn’t be allowed. Since we’re tracking blocked tasks the same way as preempted tasks, it should be possible to handle them the same way. They both cause a CPU QS and a task non-QS to be entered, they both need priority boosting. Perhaps the warning should be removed? Let me know your feedback in the comments.