The previous part was the first part in the current chapter that describes timers and time management related stuff in the Linux kernel. We got acquainted with two concepts in the previous part:
jiffies
clocksource
The first is the global variable that is defined in the include/linux/jiffies.h header file and represents the counter that is increased during each timer interrupt. So if we can access this global variable and we know the timer interrupt rate we can convert jiffies
to the human time units. As we already know the timer interrupt rate represented by the compile-time constant that is called HZ
in the Linux kernel. The value of HZ
is equal to the value of the CONFIG_HZ
kernel configuration option and if we will look into the arch/x86/configs/x86_64_defconfig kernel configuration file, we will see that:
CONFIG_HZ_1000=y
kernel configuration option is set. This means that value of CONFIG_HZ
will be 1000
by default for the x86_64 architecture. So, if we divide the value of jiffies
by the value of HZ
:
jiffies / HZ
we will get the amount of seconds that elapsed since the beginning of the moment the Linux kernel started to work or in other words we will get the system uptime. Since HZ
represents the amount of timer interrupts in a second, we can set a value for some time in the future. For example:
/* one minute from now */
unsigned long later = jiffies + 60*HZ;
/* five minutes from now */
unsigned long later = jiffies + 5*60*HZ;
This is a very common practice in the Linux kernel. For example, if you will look into the arch/x86/kernel/smpboot.c source code file, you will find the do_boot_cpu
function. This function boots all processors besides bootstrap processor. You can find a snippet that waits ten seconds for a response from the application processor:
if (!boot_error) {
timeout = jiffies + 10*HZ;
while (time_before(jiffies, timeout)) {
...
...
...
udelay(100);
}
...
...
...
}
We assign jiffies + 10*HZ
value to the timeout
variable here. As I think you already understood, this means a ten seconds timeout. After this we are entering a loop where we use the time_before
macro to compare the current jiffies
value and our timeout.
Or for example if we look into the sound/isa/sscape.c source code file which represents the driver for the Ensoniq Soundscape Elite sound card, we will see the obp_startup_ack
function that waits upto a given timeout for the On-Board Processor to return its start-up acknowledgement sequence:
static int obp_startup_ack(struct soundscape *s, unsigned timeout)
{
unsigned long end_time = jiffies + msecs_to_jiffies(timeout);
do {
...
...
...
x = host_read_unsafe(s->io_base);
...
...
...
if (x == 0xfe || x == 0xff)
return 1;
msleep(10);
} while (time_before(jiffies, end_time));
return 0;
}
As you can see, the jiffies
variable is very widely used in the Linux kernel code. As I already wrote, we met yet another new time management related concept in the previous part - clocksource
. We have only seen a short description of this concept and the API for a clock source registration. Let's take a closer look in this part.
The clocksource
concept represents the generic API for clock sources management in the Linux kernel. Why do we need a separate framework for this? Let's go back to the beginning. The time
concept is the fundamental concept in the Linux kernel and other operating system kernels. And the timekeeping is one of the necessities to use this concept. For example Linux kernel must know and update the time elapsed since system startup, it must determine how long the current process has been running for every processor and many many more. Where the Linux kernel can get information about time? First of all it is Real Time Clock or RTC that represents by the a nonvolatile device. You can find a set of architecture-independent real time clock drivers in the Linux kernel in the drivers/rtc directory. Besides this, each architecture can provide a driver for the architecture-dependent real time clock, for example - CMOS/RTC
- arch/x86/kernel/rtc.c for the x86 architecture. The second is system timer - timer that excites interrupts with a periodic rate. For example, for IBM PC compatibles it was - programmable interval timer.
We already know that for timekeeping purposes we can use jiffies
in the Linux kernel. The jiffies
can be considered as read only global variable which is updated with HZ
frequency. We know that the HZ
is a compile-time kernel parameter whose reasonable range is from 100
to 1000
Hz. So, it is guaranteed to have an interface for time measurement with 1
- 10
milliseconds resolution. Besides standard jiffies
, we saw the refined_jiffies
clock source in the previous part that is based on the i8253/i8254
programmable interval timer tick rate which is almost 1193182
hertz. So we can get something about 1
microsecond resolution with the refined_jiffies
. In this time, nanoseconds are the favorite choice for the time value units of the given clock source.
The availability of more precise techniques for time intervals measurement is hardware-dependent. We just knew a little about x86
dependent timers hardware. But each architecture provides own timers hardware. Earlier each architecture had own implementation for this purpose. Solution of this problem is an abstraction layer and associated API in a common code framework for managing various clock sources and independent of the timer interrupt. This common code framework became - clocksource
framework.
Generic timeofday and clock source management framework moved a lot of timekeeping code into the architecture independent portion of the code, with the architecture-dependent portion reduced to defining and managing low-level hardware pieces of clocksources. It takes a large amount of funds to measure the time interval on different architectures with different hardware, and it is very complex. Implementation of the each clock related service is strongly associated with an individual hardware device and as you can understand, it results in similar implementations for different architectures.
Within this framework, each clock source is required to maintain a representation of time as a monotonically increasing value. As we can see in the Linux kernel code, nanoseconds are the favorite choice for the time value units of a clock source in this time. One of the main point of the clock source framework is to allow an user to select clock source among a range of available hardware devices supporting clock functions when configuring the system and selecting, accessing and scaling different clock sources.
The fundamental of the clocksource
framework is the clocksource
structure that defined in the include/linux/clocksource.h header file. We already saw some fields that are provided by the clocksource
structure in the previous part. Let's look on the full definition of this structure and try to describe all of its fields:
struct clocksource {
cycle_t (*read)(struct clocksource *cs);
cycle_t mask;
u32 mult;
u32 shift;
u64 max_idle_ns;
u32 maxadj;
#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
struct arch_clocksource_data archdata;
#endif
u64 max_cycles;
const char *name;
struct list_head list;
int rating;
int (*enable)(struct clocksource *cs);
void (*disable)(struct clocksource *cs);
unsigned long flags;
void (*suspend)(struct clocksource *cs);
void (*resume)(struct clocksource *cs);
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
struct list_head wd_list;
cycle_t cs_last;
cycle_t wd_last;
#endif
struct module *owner;
} ____cacheline_aligned;
We already saw the first field of the clocksource
structure in the previous part - it is pointer to the read
function that returns best counter selected by the clocksource framework. For example we use jiffies_read
function to read jiffies
value:
static struct clocksource clocksource_jiffies = {
...
.read = jiffies_read,
...
}
where jiffies_read
just returns:
static cycle_t jiffies_read(struct clocksource *cs)
{
return (cycle_t) jiffies;
}
Or the read_tsc
function:
static struct clocksource clocksource_tsc = {
...
.read = read_tsc,
...
};
for the time stamp counter reading.
The next field is mask
that allows to ensure that subtraction between counters values from non 64 bit
counters do not need special overflow logic. After the mask
field, we can see two fields: mult
and shift
. These are the fields that are base of mathematical functions that are provide ability to convert time values specific to each clock source. In other words these two fields help us to convert an abstract machine time units of a counter to nanoseconds.
After these two fields we can see the 64
bits max_idle_ns
field represents max idle time permitted by the clocksource in nanoseconds. We need in this field for the Linux kernel with enabled CONFIG_NO_HZ
kernel configuration option. This kernel configuration option enables the Linux kernel to run without a regular timer tick (we will see full explanation of this in other part). The problem that dynamic tick allows the kernel to sleep for periods longer than a single tick, moreover sleep time could be unlimited. The max_idle_ns
field represents this sleeping limit.
The next field after the max_idle_ns
is the maxadj
field which is the maximum adjustment value to mult
. The main formula by which we convert cycles to the nanoseconds:
((u64) cycles * mult) >> shift;
is not 100%
accurate. Instead the number is taken as close as possible to a nanosecond and maxadj
helps to correct this and allows clocksource API to avoid mult
values that might overflow when adjusted. The next four fields are pointers to the function:
enable
- optional function to enable clocksource;disable
- optional function to disable clocksource;suspend
- suspend function for the clocksource;resume
- resume function for the clocksource;
The next field is the max_cycles
and as we can understand from its name, this field represents maximum cycle value before potential overflow. And the last field is owner
represents reference to a kernel module that is owner of a clocksource. This is all. We just went through all the standard fields of the clocksource
structure. But you can noted that we missed some fields of the clocksource
structure. We can divide all of missed field on two types: Fields of the first type are already known for us. For example, they are name
field that represents name of a clocksource
, the rating
field that helps to the Linux kernel to select the best clocksource and etc. The second type, fields which are dependent from the different Linux kernel configuration options. Let's look on these fields.
The first field is the archdata
. This field has arch_clocksource_data
type and depends on the CONFIG_ARCH_CLOCKSOURCE_DATA
kernel configuration option. This field is actual only for the x86 and IA64 architectures for this moment. And again, as we can understand from the field's name, it represents architecture-specific data for a clock source. For example, it represents vDSO
clock mode:
struct arch_clocksource_data {
int vclock_mode;
};
for the x86
architectures. Where the vDSO
clock mode can be one of the:
#define VCLOCK_NONE 0
#define VCLOCK_TSC 1
#define VCLOCK_HPET 2
#define VCLOCK_PVCLOCK 3
The last three fields are wd_list
, cs_last
and the wd_last
depends on the CONFIG_CLOCKSOURCE_WATCHDOG
kernel configuration option. First of all let's try to understand what is it watchdog
. In a simple words, watchdog is a timer that is used for detection of the computer malfunctions and recovering from it. All of these three fields contain watchdog related data that is used by the clocksource
framework. If we will grep the Linux kernel source code, we will see that only arch/x86/KConfig kernel configuration file contains the CONFIG_CLOCKSOURCE_WATCHDOG
kernel configuration option. So, why do x86
and x86_64
need in watchdog? You already may know that all x86
processors has special 64-bit register - time stamp counter. This register contains number of cycles since the reset. Sometimes the time stamp counter needs to be verified against another clock source. We will not see initialization of the watchdog
timer in this part, before this we must learn more about timers.
That's all. From this moment we know all fields of the clocksource
structure. This knowledge will help us to learn insides of the clocksource
framework.
We saw only one function from the clocksource
framework in the previous part. This function was - __clocksource_register
. This function defined in the include/linux/clocksource.h header file and as we can understand from the function's name, main point of this function is to register new clocksource. If we will look on the implementation of the __clocksource_register
function, we will see that it just makes call of the __clocksource_register_scale
function and returns its result:
static inline int __clocksource_register(struct clocksource *cs)
{
return __clocksource_register_scale(cs, 1, 0);
}
Before we will see implementation of the __clocksource_register_scale
function, we can see that clocksource
provides additional API for a new clock source registration:
static inline int clocksource_register_hz(struct clocksource *cs, u32 hz)
{
return __clocksource_register_scale(cs, 1, hz);
}
static inline int clocksource_register_khz(struct clocksource *cs, u32 khz)
{
return __clocksource_register_scale(cs, 1000, khz);
}
And all of these functions do the same. They return value of the __clocksource_register_scale
function but with different set of parameters. The __clocksource_register_scale
function defined in the kernel/time/clocksource.c source code file. To understand difference between these functions, let's look on the parameters of the clocksource_register_khz
function. As we can see, this function takes three parameters:
cs
- clocksource to be installed;scale
- scale factor of a clock source. In other words, if we will multiply value of this parameter on frequency, we will gethz
of a clocksource;freq
- clock source frequency divided by scale.
Now let's look on the implementation of the __clocksource_register_scale
function:
int __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq)
{
__clocksource_update_freq_scale(cs, scale, freq);
mutex_lock(&clocksource_mutex);
clocksource_enqueue(cs);
clocksource_enqueue_watchdog(cs);
clocksource_select();
mutex_unlock(&clocksource_mutex);
return 0;
}
First of all we can see that the __clocksource_register_scale
function starts from the call of the __clocksource_update_freq_scale
function that defined in the same source code file and updates given clock source with the new frequency. Let's look on the implementation of this function. In the first step we need to check given frequency and if it was not passed as zero
, we need to calculate mult
and shift
parameters for the given clock source. Why do we need to check value of the frequency
? Actually it can be zero. if you attentively looked on the implementation of the __clocksource_register
function, you may have noticed that we passed frequency
as 0
. We will do it only for some clock sources that have self defined mult
and shift
parameters. Look in the previous part and you will see that we saw calculation of the mult
and shift
for jiffies
. The __clocksource_update_freq_scale
function will do it for us for other clock sources.
So in the start of the __clocksource_update_freq_scale
function we check the value of the frequency
parameter and if is not zero we need to calculate mult
and shift
for the given clock source. Let's look on the mult
and shift
calculation:
void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq)
{
u64 sec;
if (freq) {
sec = cs->mask;
do_div(sec, freq);
do_div(sec, scale);
if (!sec)
sec = 1;
else if (sec > 600 && cs->mask > UINT_MAX)
sec = 600;
clocks_calc_mult_shift(&cs->mult, &cs->shift, freq,
NSEC_PER_SEC / scale, sec * scale);
}
...
...
...
}
Here we can see calculation of the maximum number of seconds which we can run before a clock source counter will overflow. First of all we fill the sec
variable with the value of a clock source mask. Remember that a clock source's mask represents maximum amount of bits that are valid for the given clock source. After this, we can see two division operations. At first we divide our sec
variable on a clock source frequency and then on scale factor. The freq
parameter shows us how many timer interrupts will be occurred in one second. So, we divide mask
value that represents maximum number of a counter (for example jiffy
) on the frequency of a timer and will get the maximum number of seconds for the certain clock source. The second division operation will give us maximum number of seconds for the certain clock source depends on its scale factor which can be 1
hertz or 1
kilohertz (10^ Hz).
After we have got maximum number of seconds, we check this value and set it to 1
or 600
depends on the result at the next step. These values is maximum sleeping time for a clocksource in seconds. In the next step we can see call of the clocks_calc_mult_shift
. Main point of this function is calculation of the mult
and shift
values for a given clock source. In the end of the __clocksource_update_freq_scale
function we check that just calculated mult
value of a given clock source will not cause overflow after adjustment, update the max_idle_ns
and max_cycles
values of a given clock source with the maximum nanoseconds that can be converted to a clock source counter and print result to the kernel buffer:
pr_info("%s: mask: 0x%llx max_cycles: 0x%llx, max_idle_ns: %lld ns\n",
cs->name, cs->mask, cs->max_cycles, cs->max_idle_ns);
that we can see in the dmesg output:
$ dmesg | grep "clocksource:"
[ 0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
[ 0.094084] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[ 0.205302] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 1.452979] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x7350b459580, max_idle_ns: 881591204237 ns
After the __clocksource_update_freq_scale
function will finish its work, we can return back to the __clocksource_register_scale
function that will register new clock source. We can see the call of the following three functions:
mutex_lock(&clocksource_mutex);
clocksource_enqueue(cs);
clocksource_enqueue_watchdog(cs);
clocksource_select();
mutex_unlock(&clocksource_mutex);
Note that before the first will be called, we lock the clocksource_mutex
mutex. The point of the clocksource_mutex
mutex is to protect curr_clocksource
variable which represents currently selected clocksource
and clocksource_list
variable which represents list that contains registered clocksources
. Now, let's look on these three functions.
The first clocksource_enqueue
function and other two defined in the same source code file. We go through all already registered clocksources
or in other words we go through all elements of the clocksource_list
and tries to find best place for a given clocksource
:
static void clocksource_enqueue(struct clocksource *cs)
{
struct list_head *entry = &clocksource_list;
struct clocksource *tmp;
list_for_each_entry(tmp, &clocksource_list, list)
if (tmp->rating >= cs->rating)
entry = &tmp->list;
list_add(&cs->list, entry);
}
In the end we just insert new clocksource to the clocksource_list
. The second function - clocksource_enqueue_watchdog
does almost the same that previous function, but it inserts new clock source to the wd_list
depends on flags of a clock source and starts new watchdog timer. As I already wrote, we will not consider watchdog
related stuff in this part but will do it in next parts.
The last function is the clocksource_select
. As we can understand from the function's name, main point of this function - select the best clocksource
from registered clocksources. This function consists only from the call of the function helper:
static void clocksource_select(void)
{
return __clocksource_select(false);
}
Note that the __clocksource_select
function takes one parameter (false
in our case). This bool parameter shows how to traverse the clocksource_list
. In our case we pass false
that is meant that we will go through all entries of the clocksource_list
. We already know that clocksource
with the best rating will the first in the clocksource_list
after the call of the clocksource_enqueue
function, so we can easily get it from this list. After we found a clock source with the best rating, we switch to it:
if (curr_clocksource != best && !timekeeping_notify(best)) {
pr_info("Switched to clocksource %s\n", best->name);
curr_clocksource = best;
}
The result of this operation we can see in the dmesg
output:
$ dmesg | grep Switched
[ 0.199688] clocksource: Switched to clocksource hpet
[ 2.452966] clocksource: Switched to clocksource tsc
Note that we can see two clock sources in the dmesg
output (hpet
and tsc
in our case). Yes, actually there can be many different clock sources on a particular hardware. So the Linux kernel knows about all registered clock sources and switches to a clock source with a better rating each time after registration of a new clock source.
If we will look on the bottom of the kernel/time/clocksource.c source code file, we will see that it has sysfs interface. Main initialization occurs in the init_clocksource_sysfs
function which will be called during device initcalls
. Let's look on the implementation of the init_clocksource_sysfs
function:
static struct bus_type clocksource_subsys = {
.name = "clocksource",
.dev_name = "clocksource",
};
static int __init init_clocksource_sysfs(void)
{
int error = subsys_system_register(&clocksource_subsys, NULL);
if (!error)
error = device_register(&device_clocksource);
if (!error)
error = device_create_file(
&device_clocksource,
&dev_attr_current_clocksource);
if (!error)
error = device_create_file(&device_clocksource,
&dev_attr_unbind_clocksource);
if (!error)
error = device_create_file(
&device_clocksource,
&dev_attr_available_clocksource);
return error;
}
device_initcall(init_clocksource_sysfs);
First of all we can see that it registers a clocksource
subsystem with the call of the subsys_system_register
function. In other words, after the call of this function, we will have following directory:
$ pwd
/sys/devices/system/clocksource
After this step, we can see registration of the device_clocksource
device which is represented by the following structure:
static struct device device_clocksource = {
.id = 0,
.bus = &clocksource_subsys,
};
and creation of three files:
dev_attr_current_clocksource
;dev_attr_unbind_clocksource
;dev_attr_available_clocksource
.
These files will provide information about current clock source in the system, available clock sources in the system and interface which allows to unbind the clock source.
After the init_clocksource_sysfs
function will be executed, we will be able find some information about available clock sources in the:
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
Or for example information about current clock source in the system:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
In the previous part, we saw API for the registration of the jiffies
clock source, but didn't dive into details about the clocksource
framework. In this part we did it and saw implementation of the new clock source registration and selection of a clock source with the best rating value in the system. Of course, this is not all API that clocksource
framework provides. There a couple additional functions like clocksource_unregister
for removing given clock source from the clocksource_list
and etc. But I will not describe this functions in this part, because they are not important for us right now. Anyway if you are interesting in it, you can find it in the kernel/time/clocksource.c.
That's all.
This is the end of the second part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the following two concepts: jiffies
and clocksource
. In this part we saw some examples of the jiffies
usage and knew more details about the clocksource
concept.
If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email or just create issue.
Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to linux-insides.