Age is of no importance unless you're a cheese -- Billie Burke

Hello! I’m Joel and this my personal website built with Jekyll! I currently work at Google. My interests are scheduler, RCU, tracing, synchronization, memory models and other kernel internals. I also love contributing to the upstream Linux kernel and other open source projects.

Connect with me on Twitter, and LinkedIn. Or, drop me an email at: joel at joelfernandes dot org

Look for my name in the kernel git log to find my upstream kernel patches. Check out my resume for full details of my work experience. I also actively present at conferences, see a list of my past talks, presentations and publications.

Full list of all blog posts on this site:
  • 24 Feb 2023   Getting YouCompleteMe working for kernel development
  • 29 Jan 2023   Figuring out herd7 memory models
  • 13 Nov 2022   On workings of hrtimer's slack time functionality
  • 25 Oct 2020   C++ rvalue references
  • 06 Mar 2020   SRCU state double scan
  • 18 Oct 2019   Modeling (lack of) store ordering using PlusCal - and a wishlist
  • 02 Sep 2019   Making sense of scheduler deadlocks in RCU
  • 23 Dec 2018   Dumping User and Kernel stacks on Kernel events
  • 15 Jun 2018   RCU and dynticks-idle mode
  • 10 Jun 2018   Single-stepping the kernel's C code
  • 10 May 2018   RCU-preempt: What happens on a context switch
  • 11 Feb 2018   USDT for reliable Userspace event tracing
  • 08 Jan 2018   BPFd- Running BCC tools remotely across systems
  • 01 Jan 2017   ARMv8: flamegraph and NMI support
  • 19 Jun 2016   Ftrace events mechanism
  • 20 Mar 2016   TIF_NEED_RESCHED: why is it needed
  • 25 Dec 2015   Tying 2 voltage sources/signals together
  • 04 Jun 2014   MicroSD card remote switch
  • 08 May 2014   Linux Spinlock Internals
  • 25 Apr 2014   Studying cache-line sharing effects on SMP systems
  • 23 Apr 2014   Design of fork followed by exec in Linux
  • Most Recept Post:

    On workings of hrtimer's slack time functionality

    | Comments

    Below are some notes I wrote while studying hrtimer slack behavior (range timers), which was added to reduce wakeups and save power, in the commit below. The idea is that:

    1. Normal hrtimers will have both a soft and hard expiry which are equal to each other.
    2. But hrtimers with timer slack will have a soft expiry and a hard expiry which is the soft expiry + delta.

    The slack/delay effect is achieved by splitting the execution of the timer function, and the programming of the next timer event into 2 separate steps. That is, we execute the timer function as soon as we notice that its soft expiry has passed (hrtimer_run_queues()). However, for programming the next timer interrupt, we only look at the hard expiry (hrtimer_update_next_event() -> __hrtimer_get_next_event() -> __hrtimer_next_event_base()->hrtimer_get_expires()). As a result, the only way a slack-based timer will execute before its slack time elapses, is, if another timer without any slack time gets queued such that it hard-expires before the slack time of the slack-based timer passes.

    The commit containing the original code added for range timers is:

    commit 654c8e0b1c623b156c5b92f28d914ab38c9c2c90
    Author: Arjan van de Ven <>
    Date:   Mon Sep 1 15:47:08 2008 -0700
        hrtimer: turn hrtimers into range timers
        this patch turns hrtimers into range timers;
        they have 2 expire points
        1) the soft expire point
        2) the hard expire point
        the kernel will do it's regular best effort attempt to get the timer run at
    the hard expire point. However, if some other time fires after the soft expire
    point, the kernel now has the freedom to fire this timer at this point, and
    thus grouping the events and preventing a power-expensive wakeup in the future.

    The original code seems a bit buggy. I got a bit confused about how/where we handle the case in hrtimer_interrupt() where other normal timers that expire before the slack time elapses, have their next timer interrupt programmed correctly such that the interrupt goes off before the slack time passes.

    To see the issue, consider the case where we have 2 timers queued:

    1. The first one soft expires at t = 10, and say it has a slack of 50, so it hard expires at t = 60.

    2. The second one is a normal timer, so the soft/hard expiry of it is both at t = 30.

    Now say, an hrtimer interrupt happens at t=5 courtesy of an unrelated expiring timer. In the below code, we notice that the next expiring timer is (the one with slack one), which has not soft-expired yet. So we have no reason to run it. However, we reprogram the next timer interrupt to be t=60 which is its hard expiry time (this is stored in expires_next to use as the value to program the next timer interrupt with).  Now we have a big problem, because the timer expiring at t=30 will not run in time and run much later.

    As shown below, the loop in hrtimer_interrupt() goes through all the active timers in the timerqueue, _softexpires is made to be the real expiry, and the old _expires now becomes _softexpires + slack.

           while((node = timerqueue_getnext(&base->active))) {
                  struct hrtimer *timer;
                  timer = container_of(node, struct hrtimer, node);
                   * The immediate goal for using the softexpires is
                   * minimizing wakeups, not running timers at the
                   * earliest interrupt after their soft expiration.
                   * This allows us to avoid using a Priority Search
                   * Tree, which can answer a stabbing querry for
                   * overlapping intervals and instead use the simple
                   * BST we already have.
                   * We don't add extra wakeups by delaying timers that
                   * are right-of a not yet expired timer, because that
                   * timer will have to trigger a wakeup anyway.
                  if (basenow.tv64 < hrtimer_get_softexpires_tv64(timer)) {
                          ktime_t expires;
                          expires = ktime_sub(hrtimer_get_expires(timer),
                          if (expires.tv64 < expires_next.tv64)
                                  expires_next = expires;
                  __run_hrtimer(timer, &basenow);

    However, this seems to be an old kernel issue, as, in upstream v6.0, I believe the next hrtimer interrupt will be programmed correctly because __hrtimer_next_event_base() calls hrtimer_get_expires() which correctly use the “hard expiry” times to do the programming.