Driver Basics

Driver Entry and Exit points

module_init(x)

driver initialization entry point

Parameters

x

function to be run at kernel boot time or module insertion

Description

module_init() will either be called during do_initcalls() (if builtin) or at module insertion time (if a module). There can only be one per module.

module_exit(x)

driver exit entry point

Parameters

x

function to be run when driver is removed

Description

module_exit() will wrap the driver clean-up code with cleanup_module() when used with rmmod when the driver is a module. If the driver is statically compiled into the kernel, module_exit() has no effect. There can only be one per module.

Driver device table

struct pci_device_id

PCI device ID structure

Definition

struct pci_device_id {
  __u32 vendor, device;
  __u32 subvendor, subdevice;
  __u32 class, class_mask;
  kernel_ulong_t driver_data;
};

Members

vendor

Vendor ID to match (or PCI_ANY_ID)

device

Device ID to match (or PCI_ANY_ID)

subvendor

Subsystem vendor ID to match (or PCI_ANY_ID)

subdevice

Subsystem device ID to match (or PCI_ANY_ID)

class

Device class, subclass, and “interface” to match. See Appendix D of the PCI Local Bus Spec or include/linux/pci_ids.h for a full list of classes. Most drivers do not need to specify class/class_mask as vendor/device is normally sufficient.

class_mask

Limit which sub-fields of the class field are compared. See drivers/scsi/sym53c8xx_2/ for example of usage.

driver_data

Data private to the driver. Most drivers don’t need to use driver_data field. Best practice is to use driver_data as an index into a static list of equivalent device types, instead of using it as a pointer.

struct usb_device_id

identifies USB devices for probing and hotplugging

Definition

struct usb_device_id {
  __u16 match_flags;
  __u16 idVendor;
  __u16 idProduct;
  __u16 bcdDevice_lo;
  __u16 bcdDevice_hi;
  __u8 bDeviceClass;
  __u8 bDeviceSubClass;
  __u8 bDeviceProtocol;
  __u8 bInterfaceClass;
  __u8 bInterfaceSubClass;
  __u8 bInterfaceProtocol;
  __u8 bInterfaceNumber;
  kernel_ulong_t driver_info;
};

Members

match_flags

Bit mask controlling which of the other fields are used to match against new devices. Any field except for driver_info may be used, although some only make sense in conjunction with other fields. This is usually set by a USB_DEVICE_*() macro, which sets all other fields in this structure except for driver_info.

idVendor

USB vendor ID for a device; numbers are assigned by the USB forum to its members.

idProduct

Vendor-assigned product ID.

bcdDevice_lo

Low end of range of vendor-assigned product version numbers. This is also used to identify individual product versions, for a range consisting of a single device.

bcdDevice_hi

High end of version number range. The range of product versions is inclusive.

bDeviceClass

Class of device; numbers are assigned by the USB forum. Products may choose to implement classes, or be vendor-specific. Device classes specify behavior of all the interfaces on a device.

bDeviceSubClass

Subclass of device; associated with bDeviceClass.

bDeviceProtocol

Protocol of device; associated with bDeviceClass.

bInterfaceClass

Class of interface; numbers are assigned by the USB forum. Products may choose to implement classes, or be vendor-specific. Interface classes specify behavior only of a given interface; other interfaces may support other classes.

bInterfaceSubClass

Subclass of interface; associated with bInterfaceClass.

bInterfaceProtocol

Protocol of interface; associated with bInterfaceClass.

bInterfaceNumber

Number of interface; composite devices may use fixed interface numbers to differentiate between vendor-specific interfaces.

driver_info

Holds information used by the driver. Usually it holds a pointer to a descriptor understood by the driver, or perhaps device flags.

Description

In most cases, drivers will create a table of device IDs by using USB_DEVICE(), or similar macros designed for that purpose. They will then export it to userspace using MODULE_DEVICE_TABLE(), and provide it to the USB core through their usb_driver structure.

See the usb_match_id() function for information about how matches are performed. Briefly, you will normally use one of several macros to help construct these entries. Each entry you provide will either identify one or more specific products, or will identify a class of products which have agreed to behave the same. You should put the more specific matches towards the beginning of your table, so that driver_info can record quirks of specific products.

struct mdio_device_id

identifies PHY devices on an MDIO/MII bus

Definition

struct mdio_device_id {
  __u32 phy_id;
  __u32 phy_id_mask;
};

Members

phy_id

The result of (mdio_read(MII_PHYSID1) << 16 | mdio_read(MII_PHYSID2)) & phy_id_mask for this PHY type

phy_id_mask

Defines the significant bits of phy_id. A value of 0 is used to terminate an array of struct mdio_device_id.

struct amba_id

identifies a device on an AMBA bus

Definition

struct amba_id {
  unsigned int            id;
  unsigned int            mask;
  void *data;
};

Members

id

The significant bits if the hardware device ID

mask

Bitmask specifying which bits of the id field are significant when matching. A driver binds to a device when ((hardware device ID) & mask) == id.

data

Private data used by the driver.

struct mips_cdmm_device_id

identifies devices in MIPS CDMM bus

Definition

struct mips_cdmm_device_id {
  __u8 type;
};

Members

type

Device type identifier.

struct mei_cl_device_id

MEI client device identifier

Definition

struct mei_cl_device_id {
  char name[MEI_CL_NAME_SIZE];
  uuid_le uuid;
  __u8 version;
  kernel_ulong_t driver_info;
};

Members

name

helper name

uuid

client uuid

version

client protocol version

driver_info

information used by the driver.

Description

identifies mei client device by uuid and name

struct rio_device_id

RIO device identifier

Definition

struct rio_device_id {
  __u16 did, vid;
  __u16 asm_did, asm_vid;
};

Members

did

RapidIO device ID

vid

RapidIO vendor ID

asm_did

RapidIO assembly device ID

asm_vid

RapidIO assembly vendor ID

Description

Identifies a RapidIO device based on both the device/vendor IDs and the assembly device/vendor IDs.

struct fsl_mc_device_id

MC object device identifier

Definition

struct fsl_mc_device_id {
  __u16 vendor;
  const char obj_type[16];
};

Members

vendor

vendor ID

obj_type

MC object type

Description

Type of entries in the “device Id” table for MC object devices supported by a MC object device driver. The last entry of the table has vendor set to 0x0

struct tb_service_id

Thunderbolt service identifiers

Definition

struct tb_service_id {
  __u32 match_flags;
  char protocol_key[8 + 1];
  __u32 protocol_id;
  __u32 protocol_version;
  __u32 protocol_revision;
  kernel_ulong_t driver_data;
};

Members

match_flags

Flags used to match the structure

protocol_key

Protocol key the service supports

protocol_id

Protocol id the service supports

protocol_version

Version of the protocol

protocol_revision

Revision of the protocol software

driver_data

Driver specific data

Description

Thunderbolt XDomain services are exposed as devices where each device carries the protocol information the service supports. Thunderbolt XDomain service drivers match against that information.

struct typec_device_id

USB Type-C alternate mode identifiers

Definition

struct typec_device_id {
  __u16 svid;
  __u8 mode;
  kernel_ulong_t driver_data;
};

Members

svid

Standard or Vendor ID

mode

Mode index

driver_data

Driver specific data

struct tee_client_device_id

tee based device identifier

Definition

struct tee_client_device_id {
  uuid_t uuid;
};

Members

uuid

For TEE based client devices we use the device uuid as the identifier.

struct wmi_device_id

WMI device identifier

Definition

struct wmi_device_id {
  const char guid_string[UUID_STRING_LEN+1];
  const void *context;
};

Members

guid_string

36 char string of the form fa50ff2b-f2e8-45de-83fa-65417f2f49ba

Delaying, scheduling, and timer routines

struct prev_cputime

snapshot of system and user cputime

Definition

struct prev_cputime {
#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE;
  u64 utime;
  u64 stime;
  raw_spinlock_t lock;
#endif;
};

Members

utime

time spent in user mode

stime

time spent in system mode

lock

protects the above two fields

Description

Stores previous user/system time values such that we can guarantee monotonicity.

struct task_cputime

collected CPU time counts

Definition

struct task_cputime {
  u64 utime;
  u64 stime;
  unsigned long long              sum_exec_runtime;
};

Members

utime

time spent in user mode, in nanoseconds

stime

time spent in kernel mode, in nanoseconds

sum_exec_runtime

total time spent on the CPU, in nanoseconds

Description

This structure groups together three kinds of CPU time that are tracked for threads and thread groups. Most things considering CPU time want to group these counts together and treat all three of them in parallel.

struct util_est

Estimation utilization of FAIR tasks

Definition

struct util_est {
  unsigned int                    enqueued;
  unsigned int                    ewma;
#define UTIL_EST_WEIGHT_SHIFT           2;
};

Members

enqueued

instantaneous estimated utilization of a task/cpu

ewma

the Exponential Weighted Moving Average (EWMA) utilization of a task

Description

Support data structure to track an Exponential Weighted Moving Average (EWMA) of a FAIR task’s utilization. New samples are added to the moving average each time a task completes an activation. Sample’s weight is chosen so that the EWMA will be relatively insensitive to transient changes to the task’s workload.

The enqueued attribute has a slightly different meaning for tasks and cpus: - task: the task’s util_avg at last task dequeue time - cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU Thus, the util_est.enqueued of a task represents the contribution on the estimated utilization of the CPU where that task is currently enqueued.

Only for tasks we track a moving average of the past instantaneous estimated utilization. This allows to absorb sporadic drops in utilization of an otherwise almost periodic task.

int pid_alive(const struct task_struct * p)

check that a task structure is not stale

Parameters

const struct task_struct * p

Task structure to be checked.

Description

Test if a process is not yet dead (at most zombie state) If pid_alive fails, then pointers within the task structure can be stale and must not be dereferenced.

Return

1 if the process is alive. 0 otherwise.

int is_global_init(struct task_struct * tsk)

check if a task structure is init. Since init is free to have sub-threads we need to check tgid.

Parameters

struct task_struct * tsk

Task structure to be checked.

Description

Check if a task structure is the first user space task the kernel created.

Return

1 if the task structure is init. 0 otherwise.

int task_nice(const struct task_struct * p)

return the nice value of a given task.

Parameters

const struct task_struct * p

the task in question.

Return

The nice value [ -20 … 0 … 19 ].

bool is_idle_task(const struct task_struct * p)

is the specified task an idle task?

Parameters

const struct task_struct * p

the task in question.

Return

1 if p is an idle task. 0 otherwise.

int wake_up_process(struct task_struct * p)

Wake up a specific process

Parameters

struct task_struct * p

The process to be woken up.

Description

Attempt to wake up the nominated process and move it to the set of runnable processes.

Return

1 if the process was woken up, 0 if it was already running.

This function executes a full memory barrier before accessing the task state.

void preempt_notifier_register(struct preempt_notifier * notifier)

tell me when current is being preempted & rescheduled

Parameters

struct preempt_notifier * notifier

notifier struct to register

void preempt_notifier_unregister(struct preempt_notifier * notifier)

no longer interested in preemption notifications

Parameters

struct preempt_notifier * notifier

notifier struct to unregister

Description

This is not safe to call from within a preemption notifier.

__visible void notrace preempt_schedule_notrace(void)

preempt_schedule called by tracing

Parameters

void

no arguments

Description

The tracing infrastructure uses preempt_enable_notrace to prevent recursion and tracing preempt enabling caused by the tracing infrastructure itself. But as tracing can happen in areas coming from userspace or just about to enter userspace, a preempt enable can occur before user_exit() is called. This will cause the scheduler to be called when the system is still in usermode.

To prevent this, the preempt_enable_notrace will use this function instead of preempt_schedule() to exit user context if needed before calling the scheduler.

int sched_setscheduler(struct task_struct * p, int policy, const struct sched_param * param)

change the scheduling policy and/or RT priority of a thread.

Parameters

struct task_struct * p

the task in question.

int policy

new policy.

const struct sched_param * param

structure containing the new RT priority.

Return

0 on success. An error code otherwise.

NOTE that the task may be already dead.

int sched_setscheduler_nocheck(struct task_struct * p, int policy, const struct sched_param * param)

change the scheduling policy and/or RT priority of a thread from kernelspace.

Parameters

struct task_struct * p

the task in question.

int policy

new policy.

const struct sched_param * param

structure containing the new RT priority.

Description

Just like sched_setscheduler, only don’t bother checking if the current context has permission. For example, this is needed in stop_machine(): we create temporary high priority worker threads, but our caller might not have that capability.

Return

0 on success. An error code otherwise.

void yield(void)

yield the current processor to other threads.

Parameters

void

no arguments

Description

Do not ever use this function, there’s a 99% chance you’re doing it wrong.

The scheduler is at all times free to pick the calling task as the most eligible task to run, if removing the yield() call from your code breaks it, its already broken.

Typical broken usage is:

while (!event)

yield();

where one assumes that yield() will let ‘the other’ process run that will make event true. If the current task is a SCHED_FIFO task that will never happen. Never use yield() as a progress guarantee!!

If you want to use yield() to wait for something, use wait_event(). If you want to use yield() to be ‘nice’ for others, use cond_resched(). If you still want to use yield(), do not!

int yield_to(struct task_struct * p, bool preempt)

yield the current processor to another thread in your thread group, or accelerate that thread toward the processor it’s on.

Parameters

struct task_struct * p

target task

bool preempt

whether task preemption is allowed or not

Description

It’s the caller’s job to ensure that the target task struct can’t go away on us before we can do any checks.

Return

true (>0) if we indeed boosted the target task. false (0) if we failed to boost the target. -ESRCH if there’s no task to yield to.

int cpupri_find(struct cpupri * cp, struct task_struct * p, struct cpumask * lowest_mask)

find the best (lowest-pri) CPU in the system

Parameters

struct cpupri * cp

The cpupri context

struct task_struct * p

The task

struct cpumask * lowest_mask

A mask to fill in with selected CPUs (or NULL)

Note

This function returns the recommended CPUs as calculated during the current invocation. By the time the call returns, the CPUs may have in fact changed priorities any number of times. While not ideal, it is not an issue of correctness since the normal rebalancer logic will correct any discrepancies created by racing against the uncertainty of the current priority configuration.

Return

(int)bool - CPUs were found

void cpupri_set(struct cpupri * cp, int cpu, int newpri)

update the CPU priority setting

Parameters

struct cpupri * cp

The cpupri context

int cpu

The target CPU

int newpri

The priority (INVALID-RT99) to assign to this CPU

Note

Assumes cpu_rq(cpu)->lock is locked

Return

(void)

int cpupri_init(struct cpupri * cp)

initialize the cpupri structure

Parameters

struct cpupri * cp

The cpupri context

Return

-ENOMEM on memory allocation failure.

void cpupri_cleanup(struct cpupri * cp)

clean up the cpupri structure

Parameters

struct cpupri * cp

The cpupri context

void update_tg_load_avg(struct cfs_rq * cfs_rq, int force)

update the tg’s load avg

Parameters

struct cfs_rq * cfs_rq

the cfs_rq whose avg changed

int force

update regardless of how small the difference

Description

This function ‘ensures’: tg->load_avg := Sum tg->cfs_rq[]->avg.load. However, because tg->load_avg is a global value there are performance considerations.

In order to avoid having to look at the other cfs_rq’s, we use a differential update where we store the last value we propagated. This in turn allows skipping updates if the differential is ‘small’.

Updating tg’s load_avg is necessary before update_cfs_share().

int update_cfs_rq_load_avg(u64 now, struct cfs_rq * cfs_rq)

update the cfs_rq’s load/util averages

Parameters

u64 now

current time, as per cfs_rq_clock_pelt()

struct cfs_rq * cfs_rq

cfs_rq to update

Description

The cfs_rq avg is the direct sum of all its entities (blocked and runnable) avg. The immediate corollary is that all (fair) tasks must be attached, see post_init_entity_util_avg().

cfs_rq->avg is used for task_h_load() and update_cfs_share() for example.

Returns true if the load decayed or we removed load.

Since both these conditions indicate a changed cfs_rq->avg.load we should call update_tg_load_avg() when this function returns true.

void attach_entity_load_avg(struct cfs_rq * cfs_rq, struct sched_entity * se, int flags)

attach this entity to its cfs_rq load avg

Parameters

struct cfs_rq * cfs_rq

cfs_rq to attach to

struct sched_entity * se

sched_entity to attach

int flags

migration hints

Description

Must call update_cfs_rq_load_avg() before this, since we rely on cfs_rq->avg.last_update_time being current.

void detach_entity_load_avg(struct cfs_rq * cfs_rq, struct sched_entity * se)

detach this entity from its cfs_rq load avg

Parameters

struct cfs_rq * cfs_rq

cfs_rq to detach from

struct sched_entity * se

sched_entity to detach

Description

Must call update_cfs_rq_load_avg() before this, since we rely on cfs_rq->avg.last_update_time being current.

unsigned long cpu_util(int cpu)

Parameters

int cpu

the CPU to get the utilization of

Description

The unit of the return value must be the one of capacity so we can compare the utilization with the capacity of the CPU that is available for CFS task (ie cpu_capacity).

cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the recent utilization of currently non-runnable tasks on a CPU. It represents the amount of utilization of a CPU in the range [0..capacity_orig] where capacity_orig is the cpu_capacity available at the highest frequency (arch_scale_freq_capacity()). The utilization of a CPU converges towards a sum equal to or less than the current capacity (capacity_curr <= capacity_orig) of the CPU because it is the running time on this CPU scaled by capacity_curr.

The estimated utilization of a CPU is defined to be the maximum between its cfs_rq.avg.util_avg and the sum of the estimated utilization of the tasks currently RUNNABLE on that CPU. This allows to properly represent the expected utilization of a CPU which has just got a big task running since a long sleep period. At the same time however it preserves the benefits of the “blocked utilization” in describing the potential for other tasks waking up on the same CPU.

Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even higher than capacity_orig because of unfortunate rounding in cfs.avg.util_avg or just after migrating tasks and new task wakeups until the average stabilizes with the new running time. We need to check that the utilization stays within the range of [0..capacity_orig] and cap it if necessary. Without utilization capping, a group could be seen as overloaded (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of available capacity. We allow utilization to overshoot capacity_curr (but not capacity_orig) as it useful for predicting the capacity required after task migrations (scheduler-driven DVFS).

Return

the (estimated) utilization for the specified CPU

void update_sg_lb_stats(struct lb_env * env, struct sched_group * group, struct sg_lb_stats * sgs, int * sg_status)

Update sched_group’s statistics for load balancing.

Parameters

struct lb_env * env

The load balancing environment.

struct sched_group * group

sched_group whose statistics are to be updated.

struct sg_lb_stats * sgs

variable to hold the statistics for this group.

int * sg_status

Holds flag indicating the status of the sched_group

bool update_sd_pick_busiest(struct lb_env * env, struct sd_lb_stats * sds, struct sched_group * sg, struct sg_lb_stats * sgs)

return 1 on busiest group

Parameters

struct lb_env * env

The load balancing environment.

struct sd_lb_stats * sds

sched_domain statistics

struct sched_group * sg

sched_group candidate to be checked for being the busiest

struct sg_lb_stats * sgs

sched_group statistics

Description

Determine if sg is a busier group than the previously selected busiest group.

Return

true if sg is a busier group than the previously selected busiest group. false otherwise.

void update_sd_lb_stats(struct lb_env * env, struct sd_lb_stats * sds)

Update sched_domain’s statistics for load balancing.

Parameters

struct lb_env * env

The load balancing environment.

struct sd_lb_stats * sds

variable to hold the statistics for this sched_domain.

int check_asym_packing(struct lb_env * env, struct sd_lb_stats * sds)

Check to see if the group is packed into the sched domain.

Parameters

struct lb_env * env

The load balancing environment.

struct sd_lb_stats * sds

Statistics of the sched_domain which is to be packed

Description

This is primarily intended to used at the sibling level. Some cores like POWER7 prefer to use lower numbered SMT threads. In the case of POWER7, it can move to lower SMT modes only when higher threads are idle. When in lower SMT modes, the threads will perform better since they share less core resources. Hence when we have idle threads, we want them to be the higher ones.

This packing function is run on idle threads. It checks to see if the busiest CPU in this domain (core in the P7 case) has a higher CPU number than the packing function is being run on. Here we are assuming lower CPU number will be equivalent to lower a SMT thread number.

Return

1 when packing is required and a task should be moved to this CPU. The amount of the imbalance is returned in env->imbalance.

void fix_small_imbalance(struct lb_env * env, struct sd_lb_stats * sds)

Calculate the minor imbalance that exists amongst the groups of a sched_domain, during load balancing.

Parameters

struct lb_env * env

The load balancing environment.

struct sd_lb_stats * sds

Statistics of the sched_domain whose imbalance is to be calculated.

void calculate_imbalance(struct lb_env * env, struct sd_lb_stats * sds)

Calculate the amount of imbalance present within the groups of a given sched_domain during load balance.

Parameters

struct lb_env * env

load balance environment

struct sd_lb_stats * sds

statistics of the sched_domain whose imbalance is to be calculated.

struct sched_group * find_busiest_group(struct lb_env * env)

Returns the busiest group within the sched_domain if there is an imbalance.

Parameters

struct lb_env * env

The load balancing environment.

Description

Also calculates the amount of runnable load which should be moved to restore balance.

Return

  • The busiest group if imbalance exists.

DECLARE_COMPLETION(work)

declare and initialize a completion structure

Parameters

work

identifier for the completion structure

Description

This macro declares and initializes a completion structure. Generally used for static declarations. You should use the _ONSTACK variant for automatic variables.

DECLARE_COMPLETION_ONSTACK(work)

declare and initialize a completion structure

Parameters

work

identifier for the completion structure

Description

This macro declares and initializes a completion structure on the kernel stack.

void __init_completion(struct completion * x)

Initialize a dynamically allocated completion

Parameters

struct completion * x

pointer to completion structure that is to be initialized

Description

This inline function will initialize a dynamically created completion structure.

void reinit_completion(struct completion * x)

reinitialize a completion structure

Parameters

struct completion * x

pointer to completion structure that is to be reinitialized

Description

This inline function should be used to reinitialize a completion structure so it can be reused. This is especially important after complete_all() is used.

unsigned long __round_jiffies(unsigned long j, int cpu)

function to round jiffies to a full second

Parameters

unsigned long j

the time in (absolute) jiffies that should be rounded

int cpu

the processor number on which the timeout will happen

Description

__round_jiffies() rounds an absolute time in the future (in jiffies) up or down to (approximately) full seconds. This is useful for timers for which the exact time they fire does not matter too much, as long as they fire approximately every X seconds.

By rounding these timers to whole seconds, all such timers will fire at the same time, rather than at various times spread out. The goal of this is to have the CPU wake up less, which saves power.

The exact rounding is skewed for each processor to avoid all processors firing at the exact same time, which could lead to lock contention or spurious cache line bouncing.

The return value is the rounded version of the j parameter.

unsigned long __round_jiffies_relative(unsigned long j, int cpu)

function to round jiffies to a full second

Parameters

unsigned long j

the time in (relative) jiffies that should be rounded

int cpu

the processor number on which the timeout will happen

Description

__round_jiffies_relative() rounds a time delta in the future (in jiffies) up or down to (approximately) full seconds. This is useful for timers for which the exact time they fire does not matter too much, as long as they fire approximately every X seconds.

By rounding these timers to whole seconds, all such timers will fire at the same time, rather than at various times spread out. The goal of this is to have the CPU wake up less, which saves power.

The exact rounding is skewed for each processor to avoid all processors firing at the exact same time, which could lead to lock contention or spurious cache line bouncing.

The return value is the rounded version of the j parameter.

unsigned long round_jiffies(unsigned long j)

function to round jiffies to a full second

Parameters

unsigned long j

the time in (absolute) jiffies that should be rounded

Description

round_jiffies() rounds an absolute time in the future (in jiffies) up or down to (approximately) full seconds. This is useful for timers for which the exact time they fire does not matter too much, as long as they fire approximately every X seconds.

By rounding these timers to whole seconds, all such timers will fire at the same time, rather than at various times spread out. The goal of this is to have the CPU wake up less, which saves power.

The return value is the rounded version of the j parameter.

unsigned long round_jiffies_relative(unsigned long j)

function to round jiffies to a full second

Parameters

unsigned long j

the time in (relative) jiffies that should be rounded

Description

round_jiffies_relative() rounds a time delta in the future (in jiffies) up or down to (approximately) full seconds. This is useful for timers for which the exact time they fire does not matter too much, as long as they fire approximately every X seconds.

By rounding these timers to whole seconds, all such timers will fire at the same time, rather than at various times spread out. The goal of this is to have the CPU wake up less, which saves power.

The return value is the rounded version of the j parameter.

unsigned long __round_jiffies_up(unsigned long j, int cpu)

function to round jiffies up to a full second

Parameters

unsigned long j

the time in (absolute) jiffies that should be rounded

int cpu

the processor number on which the timeout will happen

Description

This is the same as __round_jiffies() except that it will never round down. This is useful for timeouts for which the exact time of firing does not matter too much, as long as they don’t fire too early.

unsigned long __round_jiffies_up_relative(unsigned long j, int cpu)

function to round jiffies up to a full second

Parameters

unsigned long j

the time in (relative) jiffies that should be rounded

int cpu

the processor number on which the timeout will happen

Description

This is the same as __round_jiffies_relative() except that it will never round down. This is useful for timeouts for which the exact time of firing does not matter too much, as long as they don’t fire too early.

unsigned long round_jiffies_up(unsigned long j)

function to round jiffies up to a full second

Parameters

unsigned long j

the time in (absolute) jiffies that should be rounded

Description

This is the same as round_jiffies() except that it will never round down. This is useful for timeouts for which the exact time of firing does not matter too much, as long as they don’t fire too early.

unsigned long round_jiffies_up_relative(unsigned long j)

function to round jiffies up to a full second

Parameters

unsigned long j

the time in (relative) jiffies that should be rounded

Description

This is the same as round_jiffies_relative() except that it will never round down. This is useful for timeouts for which the exact time of firing does not matter too much, as long as they don’t fire too early.

void init_timer_key(struct timer_list * timer, void (*func) (struct timer_list *, unsigned int flags, const char * name, struct lock_class_key * key)

initialize a timer

Parameters

struct timer_list * timer

the timer to be initialized

void (*)(struct timer_list *) func

timer callback function

unsigned int flags

timer flags

const char * name

name of the timer

struct lock_class_key * key

lockdep class key of the fake lock used for tracking timer sync lock dependencies

Description

init_timer_key() must be done to a timer prior calling any of the other timer functions.

int mod_timer_pending(struct timer_list * timer, unsigned long expires)

modify a pending timer’s timeout

Parameters

struct timer_list * timer

the pending timer to be modified

unsigned long expires

new timeout in jiffies

Description

mod_timer_pending() is the same for pending timers as mod_timer(), but will not re-activate and modify already deleted timers.

It is useful for unserialized use of timers.

int mod_timer(struct timer_list * timer, unsigned long expires)

modify a timer’s timeout

Parameters

struct timer_list * timer

the timer to be modified

unsigned long expires

new timeout in jiffies

Description

mod_timer() is a more efficient way to update the expire field of an active timer (if the timer is inactive it will be activated)

mod_timer(timer, expires) is equivalent to:

del_timer(timer); timer->expires = expires; add_timer(timer);

Note that if there are multiple unserialized concurrent users of the same timer, then mod_timer() is the only safe way to modify the timeout, since add_timer() cannot modify an already running timer.

The function returns whether it has modified a pending timer or not. (ie. mod_timer() of an inactive timer returns 0, mod_timer() of an active timer returns 1.)

int timer_reduce(struct timer_list * timer, unsigned long expires)

Modify a timer’s timeout if it would reduce the timeout

Parameters

struct timer_list * timer

The timer to be modified

unsigned long expires

New timeout in jiffies

Description

timer_reduce() is very similar to mod_timer(), except that it will only modify a running timer if that would reduce the expiration time (it will start a timer that isn’t running).

void add_timer(struct timer_list * timer)

start a timer

Parameters

struct timer_list * timer

the timer to be added

Description

The kernel will do a ->function(timer) callback from the timer interrupt at the ->expires point in the future. The current time is ‘jiffies’.

The timer’s ->expires, ->function fields must be set prior calling this function.

Timers with an ->expires field in the past will be executed in the next timer tick.

void add_timer_on(struct timer_list * timer, int cpu)

start a timer on a particular CPU

Parameters

struct timer_list * timer

the timer to be added

int cpu

the CPU to start it on

Description

This is not very scalable on SMP. Double adds are not possible.

int del_timer(struct timer_list * timer)

deactivate a timer.

Parameters

struct timer_list * timer

the timer to be deactivated

Description

del_timer() deactivates a timer - this works on both active and inactive timers.

The function returns whether it has deactivated a pending timer or not. (ie. del_timer() of an inactive timer returns 0, del_timer() of an active timer returns 1.)

int try_to_del_timer_sync(struct timer_list * timer)

Try to deactivate a timer

Parameters

struct timer_list * timer

timer to delete

Description

This function tries to deactivate a timer. Upon successful (ret >= 0) exit the timer is not queued and the handler is not running on any CPU.

int del_timer_sync(struct timer_list * timer)

deactivate a timer and wait for the handler to finish.

Parameters

struct timer_list * timer

the timer to be deactivated

Description

This function only differs from del_timer() on SMP: besides deactivating the timer it also makes sure the handler has finished executing on other CPUs.

Synchronization rules: Callers must prevent restarting of the timer, otherwise this function is meaningless. It must not be called from interrupt contexts unless the timer is an irqsafe one. The caller must not hold locks which would prevent completion of the timer’s handler. The timer’s handler must not call add_timer_on(). Upon exit the timer is not queued and the handler is not running on any CPU.

Note

For !irqsafe timers, you must not hold locks that are held in

interrupt context while calling this function. Even if the lock has nothing to do with the timer in question. Here’s why:

CPU0                             CPU1
----                             ----
                                 <SOFTIRQ>
                                   call_timer_fn();
                                   base->running_timer = mytimer;
spin_lock_irq(somelock);
                                 <IRQ>
                                    spin_lock(somelock);
del_timer_sync(mytimer);
while (base->running_timer == mytimer);

Now del_timer_sync() will never return and never release somelock. The interrupt on the other CPU is waiting to grab somelock but it has interrupted the softirq that CPU0 is waiting to finish.

The function returns whether it has deactivated a pending timer or not.

signed long schedule_timeout(signed long timeout)

sleep until timeout

Parameters

signed long timeout

timeout value in jiffies

Description

Make the current task sleep until timeout jiffies have elapsed. The routine will return immediately unless the current task state has been set (see set_current_state()).

You can set the task state as follows -

TASK_UNINTERRUPTIBLE - at least timeout jiffies are guaranteed to pass before the routine returns unless the current task is explicitly woken up, (e.g. by wake_up_process())”.

TASK_INTERRUPTIBLE - the routine may return early if a signal is delivered to the current task or the current task is explicitly woken up.

The current task state is guaranteed to be TASK_RUNNING when this routine returns.

Specifying a timeout value of MAX_SCHEDULE_TIMEOUT will schedule the CPU away without a bound on the timeout. In this case the return value will be MAX_SCHEDULE_TIMEOUT.

Returns 0 when the timer has expired otherwise the remaining time in jiffies will be returned. In all cases the return value is guaranteed to be non-negative.

void msleep(unsigned int msecs)

sleep safely even with waitqueue interruptions

Parameters

unsigned int msecs

Time in milliseconds to sleep for

unsigned long msleep_interruptible(unsigned int msecs)

sleep waiting for signals

Parameters

unsigned int msecs

Time in milliseconds to sleep for

void usleep_range(unsigned long min, unsigned long max)

Sleep for an approximate time

Parameters

unsigned long min

Minimum time in usecs to sleep

unsigned long max

Maximum time in usecs to sleep

Description

In non-atomic context where the exact wakeup time is flexible, use usleep_range() instead of udelay(). The sleep improves responsiveness by avoiding the CPU-hogging busy-wait of udelay(), and the range reduces power usage by allowing hrtimers to take advantage of an already- scheduled interrupt instead of scheduling a new one just for this sleep.

Wait queues and Wake events

int waitqueue_active(struct wait_queue_head * wq_head)
  • locklessly test for waiters on the queue

Parameters

struct wait_queue_head * wq_head

the waitqueue to test for waiters

Description

returns true if the wait list is not empty

NOTE

this function is lockless and requires care, incorrect usage _will_ lead to sporadic and non-obvious failure.

Use either while holding wait_queue_head::lock or when used for wakeups with an extra smp_mb() like:

CPU0 - waker                    CPU1 - waiter

                                for (;;) {
@cond = true;                     prepare_to_wait(&wq_head, &wait, state);
smp_mb();                         // smp_mb() from set_current_state()
if (waitqueue_active(wq_head))         if (@cond)
  wake_up(wq_head);                      break;
                                  schedule();
                                }
                                finish_wait(&wq_head, &wait);

Because without the explicit smp_mb() it’s possible for the waitqueue_active() load to get hoisted over the cond store such that we’ll observe an empty wait list while the waiter might not observe cond.

Also note that this ‘optimization’ trades a spin_lock() for an smp_mb(), which (when the lock is uncontended) are of roughly equal cost.

bool wq_has_single_sleeper(struct wait_queue_head * wq_head)

check if there is only one sleeper

Parameters

struct wait_queue_head * wq_head

wait queue head

Description

Returns true of wq_head has only one sleeper on the list.

Please refer to the comment for waitqueue_active.

bool wq_has_sleeper(struct wait_queue_head * wq_head)

check if there are any waiting processes

Parameters

struct wait_queue_head * wq_head

wait queue head

Description

Returns true if wq_head has waiting processes

Please refer to the comment for waitqueue_active.

wait_event(wq_head, condition)

sleep until a condition gets true

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

Description

The process is put to sleep (TASK_UNINTERRUPTIBLE) until the condition evaluates to true. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

wait_event_freezable(wq_head, condition)

sleep (or freeze) until a condition gets true

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

Description

The process is put to sleep (TASK_INTERRUPTIBLE – so as not to contribute to system load) until the condition evaluates to true. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

wait_event_timeout(wq_head, condition, timeout)

sleep until a condition gets true or a timeout elapses

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

timeout

timeout, in jiffies

Description

The process is put to sleep (TASK_UNINTERRUPTIBLE) until the condition evaluates to true. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

Return

0 if the condition evaluated to false after the timeout elapsed, 1 if the condition evaluated to true after the timeout elapsed, or the remaining jiffies (at least 1) if the condition evaluated to true before the timeout elapsed.

wait_event_cmd(wq_head, condition, cmd1, cmd2)

sleep until a condition gets true

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

cmd1

the command will be executed before sleep

cmd2

the command will be executed after sleep

Description

The process is put to sleep (TASK_UNINTERRUPTIBLE) until the condition evaluates to true. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

wait_event_interruptible(wq_head, condition)

sleep until a condition gets true

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

Description

The process is put to sleep (TASK_INTERRUPTIBLE) until the condition evaluates to true or a signal is received. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

The function will return -ERESTARTSYS if it was interrupted by a signal and 0 if condition evaluated to true.

wait_event_interruptible_timeout(wq_head, condition, timeout)

sleep until a condition gets true or a timeout elapses

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

timeout

timeout, in jiffies

Description

The process is put to sleep (TASK_INTERRUPTIBLE) until the condition evaluates to true or a signal is received. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

Return

0 if the condition evaluated to false after the timeout elapsed, 1 if the condition evaluated to true after the timeout elapsed, the remaining jiffies (at least 1) if the condition evaluated to true before the timeout elapsed, or -ERESTARTSYS if it was interrupted by a signal.

wait_event_hrtimeout(wq_head, condition, timeout)

sleep until a condition gets true or a timeout elapses

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

timeout

timeout, as a ktime_t

Description

The process is put to sleep (TASK_UNINTERRUPTIBLE) until the condition evaluates to true or a signal is received. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

The function returns 0 if condition became true, or -ETIME if the timeout elapsed.

wait_event_interruptible_hrtimeout(wq, condition, timeout)

sleep until a condition gets true or a timeout elapses

Parameters

wq

the waitqueue to wait on

condition

a C expression for the event to wait for

timeout

timeout, as a ktime_t

Description

The process is put to sleep (TASK_INTERRUPTIBLE) until the condition evaluates to true or a signal is received. The condition is checked each time the waitqueue wq is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

The function returns 0 if condition became true, -ERESTARTSYS if it was interrupted by a signal, or -ETIME if the timeout elapsed.

wait_event_idle(wq_head, condition)

wait for a condition without contributing to system load

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

Description

The process is put to sleep (TASK_IDLE) until the condition evaluates to true. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

wait_event_idle_exclusive(wq_head, condition)

wait for a condition with contributing to system load

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

Description

The process is put to sleep (TASK_IDLE) until the condition evaluates to true. The condition is checked each time the waitqueue wq_head is woken up.

The process is put on the wait queue with an WQ_FLAG_EXCLUSIVE flag set thus if other processes wait on the same list, when this process is woken further processes are not considered.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

wait_event_idle_timeout(wq_head, condition, timeout)

sleep without load until a condition becomes true or a timeout elapses

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

timeout

timeout, in jiffies

Description

The process is put to sleep (TASK_IDLE) until the condition evaluates to true. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

Return

0 if the condition evaluated to false after the timeout elapsed, 1 if the condition evaluated to true after the timeout elapsed, or the remaining jiffies (at least 1) if the condition evaluated to true before the timeout elapsed.

wait_event_idle_exclusive_timeout(wq_head, condition, timeout)

sleep without load until a condition becomes true or a timeout elapses

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

timeout

timeout, in jiffies

Description

The process is put to sleep (TASK_IDLE) until the condition evaluates to true. The condition is checked each time the waitqueue wq_head is woken up.

The process is put on the wait queue with an WQ_FLAG_EXCLUSIVE flag set thus if other processes wait on the same list, when this process is woken further processes are not considered.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

Return

0 if the condition evaluated to false after the timeout elapsed, 1 if the condition evaluated to true after the timeout elapsed, or the remaining jiffies (at least 1) if the condition evaluated to true before the timeout elapsed.

wait_event_interruptible_locked(wq, condition)

sleep until a condition gets true

Parameters

wq

the waitqueue to wait on

condition

a C expression for the event to wait for

Description

The process is put to sleep (TASK_INTERRUPTIBLE) until the condition evaluates to true or a signal is received. The condition is checked each time the waitqueue wq is woken up.

It must be called with wq.lock being held. This spinlock is unlocked while sleeping but condition testing is done while lock is held and when this macro exits the lock is held.

The lock is locked/unlocked using spin_lock()/spin_unlock() functions which must match the way they are locked/unlocked outside of this macro.

wake_up_locked() has to be called after changing any variable that could change the result of the wait condition.

The function will return -ERESTARTSYS if it was interrupted by a signal and 0 if condition evaluated to true.

wait_event_interruptible_locked_irq(wq, condition)

sleep until a condition gets true

Parameters

wq

the waitqueue to wait on

condition

a C expression for the event to wait for

Description

The process is put to sleep (TASK_INTERRUPTIBLE) until the condition evaluates to true or a signal is received. The condition is checked each time the waitqueue wq is woken up.

It must be called with wq.lock being held. This spinlock is unlocked while sleeping but condition testing is done while lock is held and when this macro exits the lock is held.

The lock is locked/unlocked using spin_lock_irq()/spin_unlock_irq() functions which must match the way they are locked/unlocked outside of this macro.

wake_up_locked() has to be called after changing any variable that could change the result of the wait condition.

The function will return -ERESTARTSYS if it was interrupted by a signal and 0 if condition evaluated to true.

wait_event_interruptible_exclusive_locked(wq, condition)

sleep exclusively until a condition gets true

Parameters

wq

the waitqueue to wait on

condition

a C expression for the event to wait for

Description

The process is put to sleep (TASK_INTERRUPTIBLE) until the condition evaluates to true or a signal is received. The condition is checked each time the waitqueue wq is woken up.

It must be called with wq.lock being held. This spinlock is unlocked while sleeping but condition testing is done while lock is held and when this macro exits the lock is held.

The lock is locked/unlocked using spin_lock()/spin_unlock() functions which must match the way they are locked/unlocked outside of this macro.

The process is put on the wait queue with an WQ_FLAG_EXCLUSIVE flag set thus when other process waits process on the list if this process is awaken further processes are not considered.

wake_up_locked() has to be called after changing any variable that could change the result of the wait condition.

The function will return -ERESTARTSYS if it was interrupted by a signal and 0 if condition evaluated to true.

wait_event_interruptible_exclusive_locked_irq(wq, condition)

sleep until a condition gets true

Parameters

wq

the waitqueue to wait on

condition

a C expression for the event to wait for

Description

The process is put to sleep (TASK_INTERRUPTIBLE) until the condition evaluates to true or a signal is received. The condition is checked each time the waitqueue wq is woken up.

It must be called with wq.lock being held. This spinlock is unlocked while sleeping but condition testing is done while lock is held and when this macro exits the lock is held.

The lock is locked/unlocked using spin_lock_irq()/spin_unlock_irq() functions which must match the way they are locked/unlocked outside of this macro.

The process is put on the wait queue with an WQ_FLAG_EXCLUSIVE flag set thus when other process waits process on the list if this process is awaken further processes are not considered.

wake_up_locked() has to be called after changing any variable that could change the result of the wait condition.

The function will return -ERESTARTSYS if it was interrupted by a signal and 0 if condition evaluated to true.

wait_event_killable(wq_head, condition)

sleep until a condition gets true

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

Description

The process is put to sleep (TASK_KILLABLE) until the condition evaluates to true or a signal is received. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

The function will return -ERESTARTSYS if it was interrupted by a signal and 0 if condition evaluated to true.

wait_event_killable_timeout(wq_head, condition, timeout)

sleep until a condition gets true or a timeout elapses

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

timeout

timeout, in jiffies

Description

The process is put to sleep (TASK_KILLABLE) until the condition evaluates to true or a kill signal is received. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

Return

0 if the condition evaluated to false after the timeout elapsed, 1 if the condition evaluated to true after the timeout elapsed, the remaining jiffies (at least 1) if the condition evaluated to true before the timeout elapsed, or -ERESTARTSYS if it was interrupted by a kill signal.

Only kill signals interrupt this process.

wait_event_lock_irq_cmd(wq_head, condition, lock, cmd)

sleep until a condition gets true. The condition is checked under the lock. This is expected to be called with the lock taken.

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

lock

a locked spinlock_t, which will be released before cmd and schedule() and reacquired afterwards.

cmd

a command which is invoked outside the critical section before sleep

Description

The process is put to sleep (TASK_UNINTERRUPTIBLE) until the condition evaluates to true. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

This is supposed to be called while holding the lock. The lock is dropped before invoking the cmd and going to sleep and is reacquired afterwards.

wait_event_lock_irq(wq_head, condition, lock)

sleep until a condition gets true. The condition is checked under the lock. This is expected to be called with the lock taken.

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

lock

a locked spinlock_t, which will be released before schedule() and reacquired afterwards.

Description

The process is put to sleep (TASK_UNINTERRUPTIBLE) until the condition evaluates to true. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

This is supposed to be called while holding the lock. The lock is dropped before going to sleep and is reacquired afterwards.

wait_event_interruptible_lock_irq_cmd(wq_head, condition, lock, cmd)

sleep until a condition gets true. The condition is checked under the lock. This is expected to be called with the lock taken.

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

lock

a locked spinlock_t, which will be released before cmd and schedule() and reacquired afterwards.

cmd

a command which is invoked outside the critical section before sleep

Description

The process is put to sleep (TASK_INTERRUPTIBLE) until the condition evaluates to true or a signal is received. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

This is supposed to be called while holding the lock. The lock is dropped before invoking the cmd and going to sleep and is reacquired afterwards.

The macro will return -ERESTARTSYS if it was interrupted by a signal and 0 if condition evaluated to true.

wait_event_interruptible_lock_irq(wq_head, condition, lock)

sleep until a condition gets true. The condition is checked under the lock. This is expected to be called with the lock taken.

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

lock

a locked spinlock_t, which will be released before schedule() and reacquired afterwards.

Description

The process is put to sleep (TASK_INTERRUPTIBLE) until the condition evaluates to true or signal is received. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

This is supposed to be called while holding the lock. The lock is dropped before going to sleep and is reacquired afterwards.

The macro will return -ERESTARTSYS if it was interrupted by a signal and 0 if condition evaluated to true.

wait_event_interruptible_lock_irq_timeout(wq_head, condition, lock, timeout)

sleep until a condition gets true or a timeout elapses. The condition is checked under the lock. This is expected to be called with the lock taken.

Parameters

wq_head

the waitqueue to wait on

condition

a C expression for the event to wait for

lock

a locked spinlock_t, which will be released before schedule() and reacquired afterwards.

timeout

timeout, in jiffies

Description

The process is put to sleep (TASK_INTERRUPTIBLE) until the condition evaluates to true or signal is received. The condition is checked each time the waitqueue wq_head is woken up.

wake_up() has to be called after changing any variable that could change the result of the wait condition.

This is supposed to be called while holding the lock. The lock is dropped before going to sleep and is reacquired afterwards.

The function returns 0 if the timeout elapsed, -ERESTARTSYS if it was interrupted by a signal, and the remaining jiffies otherwise if the condition evaluated to true before the timeout elapsed.

void __wake_up(struct wait_queue_head * wq_head, unsigned int mode, int nr_exclusive, void * key)

wake up threads blocked on a waitqueue.

Parameters

struct wait_queue_head * wq_head

the waitqueue

unsigned int mode

which threads

int nr_exclusive

how many wake-one or wake-many threads to wake up

void * key

is directly passed to the wakeup function

Description

If this function wakes up a task, it executes a full memory barrier before accessing the task state.

void __wake_up_sync_key(struct wait_queue_head * wq_head, unsigned int mode, int nr_exclusive, void * key)

wake up threads blocked on a waitqueue.

Parameters

struct wait_queue_head * wq_head

the waitqueue

unsigned int mode

which threads

int nr_exclusive

how many wake-one or wake-many threads to wake up

void * key

opaque value to be passed to wakeup targets

Description

The sync wakeup differs that the waker knows that it will schedule away soon, so while the target thread will be woken up, it will not be migrated to another CPU - ie. the two threads are ‘synchronized’ with each other. This can prevent needless bouncing between CPUs.

On UP it can prevent extra preemption.

If this function wakes up a task, it executes a full memory barrier before accessing the task state.

void finish_wait(struct wait_queue_head * wq_head, struct wait_queue_entry * wq_entry)

clean up after waiting in a queue

Parameters

struct wait_queue_head * wq_head

waitqueue waited on

struct wait_queue_entry * wq_entry

wait descriptor

Description

Sets current thread back to running state and removes the wait descriptor from the given waitqueue if still queued.

High-resolution timers

ktime_t ktime_set(const s64 secs, const unsigned long nsecs)

Set a ktime_t variable from a seconds/nanoseconds value

Parameters

const s64 secs

seconds to set

const unsigned long nsecs

nanoseconds to set

Return

The ktime_t representation of the value.

int ktime_compare(const ktime_t cmp1, const ktime_t cmp2)

Compares two ktime_t variables for less, greater or equal

Parameters

const ktime_t cmp1

comparable1

const ktime_t cmp2

comparable2

Return

cmp1 < cmp2: return <0 cmp1 == cmp2: return 0 cmp1 > cmp2: return >0

bool ktime_after(const ktime_t cmp1, const ktime_t cmp2)

Compare if a ktime_t value is bigger than another one.

Parameters

const ktime_t cmp1

comparable1

const ktime_t cmp2

comparable2

Return

true if cmp1 happened after cmp2.

bool ktime_before(const ktime_t cmp1, const ktime_t cmp2)

Compare if a ktime_t value is smaller than another one.

Parameters

const ktime_t cmp1

comparable1

const ktime_t cmp2

comparable2

Return

true if cmp1 happened before cmp2.

bool ktime_to_timespec_cond(const ktime_t kt, struct timespec * ts)

convert a ktime_t variable to timespec format only if the variable contains data

Parameters

const ktime_t kt

the ktime_t variable to convert

struct timespec * ts

the timespec variable to store the result in

Return

true if there was a successful conversion, false if kt was 0.

bool ktime_to_timespec64_cond(const ktime_t kt, struct timespec64 * ts)

convert a ktime_t variable to timespec64 format only if the variable contains data

Parameters

const ktime_t kt

the ktime_t variable to convert

struct timespec64 * ts

the timespec variable to store the result in

Return

true if there was a successful conversion, false if kt was 0.

struct hrtimer

the basic hrtimer structure

Definition

struct hrtimer {
  struct timerqueue_node          node;
  ktime_t _softexpires;
  enum hrtimer_restart            (*function)(struct hrtimer *);
  struct hrtimer_clock_base       *base;
  u8 state;
  u8 is_rel;
  u8 is_soft;
};

Members

node

timerqueue node, which also manages node.expires, the absolute expiry time in the hrtimers internal representation. The time is related to the clock on which the timer is based. Is setup by adding slack to the _softexpires value. For non range timers identical to _softexpires.

_softexpires

the absolute earliest expiry time of the hrtimer. The time which was given as expiry time when the timer was armed.

function

timer expiry callback function

base

pointer to the timer base (per cpu and per clock)

state

state information (See bit values above)

is_rel

Set if the timer was armed relative

is_soft

Set if hrtimer will be expired in soft interrupt context.

Description

The hrtimer structure must be initialized by hrtimer_init()

struct hrtimer_sleeper

simple sleeper structure

Definition

struct hrtimer_sleeper {
  struct hrtimer timer;
  struct task_struct *task;
};

Members

timer

embedded timer structure

task

task to wake up

Description

task is set to NULL, when the timer expires.

struct hrtimer_clock_base

the timer base for a specific clock

Definition

struct hrtimer_clock_base {
  struct hrtimer_cpu_base *cpu_base;
  unsigned int            index;
  clockid_t clockid;
  seqcount_t seq;
  struct hrtimer          *running;
  struct timerqueue_head  active;
  ktime_t (*get_time)(void);
  ktime_t offset;
};

Members

cpu_base

per cpu clock base

index

clock type index for per_cpu support when moving a timer to a base on another cpu.

clockid

clock id for per_cpu support

seq

seqcount around __run_hrtimer

running

pointer to the currently running hrtimer

active

red black tree root node for the active timers

get_time

function to retrieve the current time of the clock

offset

offset of this clock to the monotonic base

struct hrtimer_cpu_base

the per cpu clock bases

Definition

struct hrtimer_cpu_base {
  raw_spinlock_t lock;
  unsigned int                    cpu;
  unsigned int                    active_bases;
  unsigned int                    clock_was_set_seq;
  unsigned int                    hres_active             : 1,in_hrtirq               : 1,hang_detected           : 1, softirq_activated       : 1;
#ifdef CONFIG_HIGH_RES_TIMERS;
  unsigned int                    nr_events;
  unsigned short                  nr_retries;
  unsigned short                  nr_hangs;
  unsigned int                    max_hang_time;
#endif;
  ktime_t expires_next;
  struct hrtimer                  *next_timer;
  ktime_t softirq_expires_next;
  struct hrtimer                  *softirq_next_timer;
  struct hrtimer_clock_base       clock_base[HRTIMER_MAX_CLOCK_BASES];
};

Members

lock

lock protecting the base and associated clock bases and timers

cpu

cpu number

active_bases

Bitfield to mark bases with active timers

clock_was_set_seq

Sequence counter of clock was set events

hres_active

State of high resolution mode

in_hrtirq

hrtimer_interrupt() is currently executing

hang_detected

The last hrtimer interrupt detected a hang

softirq_activated

displays, if the softirq is raised - update of softirq related settings is not required then.

nr_events

Total number of hrtimer interrupt events

nr_retries

Total number of hrtimer interrupt retries

nr_hangs

Total number of hrtimer interrupt hangs

max_hang_time

Maximum time spent in hrtimer_interrupt

expires_next

absolute time of the next event, is required for remote hrtimer enqueue; it is the total first expiry time (hard and soft hrtimer are taken into account)

next_timer

Pointer to the first expiring timer

softirq_expires_next

Time to check, if soft queues needs also to be expired

softirq_next_timer

Pointer to the first expiring softirq based timer

clock_base

array of clock bases for this cpu

Note

next_timer is just an optimization for __remove_hrtimer().

Do not dereference the pointer because it is not reliable on cross cpu removals.

void hrtimer_start(struct hrtimer * timer, ktime_t tim, const enum hrtimer_mode mode)

(re)start an hrtimer

Parameters

struct hrtimer * timer

the timer to be added

ktime_t tim

expiry time

const enum hrtimer_mode mode

timer mode: absolute (HRTIMER_MODE_ABS) or relative (HRTIMER_MODE_REL), and pinned (HRTIMER_MODE_PINNED); softirq based mode is considered for debug purpose only!

u64 hrtimer_forward_now(struct hrtimer * timer, ktime_t interval)

forward the timer expiry so it expires after now

Parameters

struct hrtimer * timer

hrtimer to forward

ktime_t interval

the interval to forward

Description

Forward the timer expiry so it will expire after the current time of the hrtimer clock base. Returns the number of overruns.

Can be safely called from the callback function of timer. If called from other contexts timer must neither be enqueued nor running the callback and the caller needs to take care of serialization.

Note

This only updates the timer expiry value and does not requeue the timer.

u64 hrtimer_forward(struct hrtimer * timer, ktime_t now, ktime_t interval)

forward the timer expiry

Parameters

struct hrtimer * timer

hrtimer to forward

ktime_t now

forward past this time

ktime_t interval

the interval to forward

Description

Forward the timer expiry so it will expire in the future. Returns the number of overruns.

Can be safely called from the callback function of timer. If called from other contexts timer must neither be enqueued nor running the callback and the caller needs to take care of serialization.

Note

This only updates the timer expiry value and does not requeue the timer.

void hrtimer_start_range_ns(struct hrtimer * timer, ktime_t tim, u64 delta_ns, const enum hrtimer_mode mode)

(re)start an hrtimer

Parameters

struct hrtimer * timer

the timer to be added

ktime_t tim

expiry time

u64 delta_ns

“slack” range for the timer

const enum hrtimer_mode mode

timer mode: absolute (HRTIMER_MODE_ABS) or relative (HRTIMER_MODE_REL), and pinned (HRTIMER_MODE_PINNED); softirq based mode is considered for debug purpose only!

int hrtimer_try_to_cancel(struct hrtimer * timer)

try to deactivate a timer

Parameters

struct hrtimer * timer

hrtimer to stop

Return

  • 0 when the timer was not active

  • 1 when the timer was active

  • -1 when the timer is currently executing the callback function and cannot be stopped

int hrtimer_cancel(struct hrtimer * timer)

cancel a timer and wait for the handler to finish.

Parameters

struct hrtimer * timer

the timer to be cancelled

Return

0 when the timer was not active 1 when the timer was active

ktime_t __hrtimer_get_remaining(const struct hrtimer * timer, bool adjust)

get remaining time for the timer

Parameters

const struct hrtimer * timer

the timer to read

bool adjust

adjust relative timers when CONFIG_TIME_LOW_RES=y

void hrtimer_init(struct hrtimer * timer, clockid_t clock_id, enum hrtimer_mode mode)

initialize a timer to the given clock

Parameters

struct hrtimer * timer

the timer to be initialized

clockid_t clock_id

the clock to be used

enum hrtimer_mode mode

The modes which are relevant for intitialization: HRTIMER_MODE_ABS, HRTIMER_MODE_REL, HRTIMER_MODE_ABS_SOFT, HRTIMER_MODE_REL_SOFT

Description

The PINNED variants of the above can be handed in, but the PINNED bit is ignored as pinning happens when the hrtimer is started

int schedule_hrtimeout_range(ktime_t * expires, u64 delta, const enum hrtimer_mode mode)

sleep until timeout

Parameters

ktime_t * expires

timeout value (ktime_t)

u64 delta

slack in expires timeout (ktime_t)

const enum hrtimer_mode mode

timer mode

Description

Make the current task sleep until the given expiry time has elapsed. The routine will return immediately unless the current task state has been set (see set_current_state()).

The delta argument gives the kernel the freedom to schedule the actual wakeup to a time that is both power and performance friendly. The kernel give the normal best effort behavior for “expires**+**delta”, but may decide to fire the timer earlier, but no earlier than expires.

You can set the task state as follows -

TASK_UNINTERRUPTIBLE - at least timeout time is guaranteed to pass before the routine returns unless the current task is explicitly woken up, (e.g. by wake_up_process()).

TASK_INTERRUPTIBLE - the routine may return early if a signal is delivered to the current task or the current task is explicitly woken up.

The current task state is guaranteed to be TASK_RUNNING when this routine returns.

Returns 0 when the timer has expired. If the task was woken before the timer expired by a signal (only possible in state TASK_INTERRUPTIBLE) or by an explicit wakeup, it returns -EINTR.

int schedule_hrtimeout(ktime_t * expires, const enum hrtimer_mode mode)

sleep until timeout

Parameters

ktime_t * expires

timeout value (ktime_t)

const enum hrtimer_mode mode

timer mode

Description

Make the current task sleep until the given expiry time has elapsed. The routine will return immediately unless the current task state has been set (see set_current_state()).

You can set the task state as follows -

TASK_UNINTERRUPTIBLE - at least timeout time is guaranteed to pass before the routine returns unless the current task is explicitly woken up, (e.g. by wake_up_process()).

TASK_INTERRUPTIBLE - the routine may return early if a signal is delivered to the current task or the current task is explicitly woken up.

The current task state is guaranteed to be TASK_RUNNING when this routine returns.

Returns 0 when the timer has expired. If the task was woken before the timer expired by a signal (only possible in state TASK_INTERRUPTIBLE) or by an explicit wakeup, it returns -EINTR.

Workqueues and Kevents

struct workqueue_attrs

A struct for workqueue attributes.

Definition

struct workqueue_attrs {
  int nice;
  cpumask_var_t cpumask;
  bool no_numa;
};

Members

nice

nice level

cpumask

allowed CPUs

no_numa

disable NUMA affinity

Unlike other fields, no_numa isn’t a property of a worker_pool. It only modifies how apply_workqueue_attrs() select pools and thus doesn’t participate in pool hash calculations or equality comparisons.

Description

This can be used to change attributes of an unbound workqueue.

work_pending(work)

Find out whether a work item is currently pending

Parameters

work

The work item in question

delayed_work_pending(w)

Find out whether a delayable work item is currently pending

Parameters

w

The work item in question

struct workqueue_struct * alloc_workqueue(const char * fmt, unsigned int flags, int max_active, ...)

allocate a workqueue

Parameters

const char * fmt

printf format for the name of the workqueue

unsigned int flags

WQ_* flags

int max_active

max in-flight work items, 0 for default remaining args: args for fmt

...

variable arguments

Description

Allocate a workqueue with the specified parameters. For detailed information on WQ_* flags, please refer to Documentation/core-api/workqueue.rst.

Return

Pointer to the allocated workqueue on success, NULL on failure.

alloc_ordered_workqueue(fmt, flags, args…)

allocate an ordered workqueue

Parameters

fmt

printf format for the name of the workqueue

flags

WQ_* flags (only WQ_FREEZABLE and WQ_MEM_RECLAIM are meaningful)

args...

args for fmt

Description

Allocate an ordered workqueue. An ordered workqueue executes at most one work item at any given time in the queued order. They are implemented as unbound workqueues with max_active of one.

Return

Pointer to the allocated workqueue on success, NULL on failure.

bool queue_work(struct workqueue_struct * wq, struct work_struct * work)

queue work on a workqueue

Parameters

struct workqueue_struct * wq

workqueue to use

struct work_struct * work

work to queue

Description

Returns false if work was already on a queue, true otherwise.

We queue the work to the CPU on which it was submitted, but if the CPU dies it can be processed by another CPU.

bool queue_delayed_work(struct workqueue_struct * wq, struct delayed_work * dwork, unsigned long delay)

queue work on a workqueue after delay

Parameters

struct workqueue_struct * wq

workqueue to use

struct delayed_work * dwork

delayable work to queue

unsigned long delay

number of jiffies to wait before queueing

Description

Equivalent to queue_delayed_work_on() but tries to use the local CPU.

bool mod_delayed_work(struct workqueue_struct * wq, struct delayed_work * dwork, unsigned long delay)

modify delay of or queue a delayed work

Parameters

struct workqueue_struct * wq

workqueue to use

struct delayed_work * dwork

work to queue

unsigned long delay

number of jiffies to wait before queueing

Description

mod_delayed_work_on() on local CPU.

bool schedule_work_on(int cpu, struct work_struct * work)

put work task on a specific cpu

Parameters

int cpu

cpu to put the work task on

struct work_struct * work

job to be done

Description

This puts a job on a specific cpu

bool schedule_work(struct work_struct * work)

put work task in global workqueue

Parameters

struct work_struct * work

job to be done

Description

Returns false if work was already on the kernel-global workqueue and true otherwise.

This puts a job in the kernel-global workqueue if it was not already queued and leaves it in the same position on the kernel-global workqueue otherwise.

void flush_scheduled_work(void)

ensure that any scheduled work has run to completion.

Parameters

void

no arguments

Description

Forces execution of the kernel-global workqueue and blocks until its completion.

Think twice before calling this function! It’s very easy to get into trouble if you don’t take great care. Either of the following situations will lead to deadlock:

One of the work items currently on the workqueue needs to acquire a lock held by your code or its caller.

Your code is running in the context of a work routine.

They will be detected by lockdep when they occur, but the first might not occur very often. It depends on what work items are on the workqueue and what locks they need, which you have no control over.

In most situations flushing the entire workqueue is overkill; you merely need to know that a particular work item isn’t queued and isn’t running. In such cases you should use cancel_delayed_work_sync() or cancel_work_sync() instead.

bool schedule_delayed_work_on(int cpu, struct delayed_work * dwork, unsigned long delay)

queue work in global workqueue on CPU after delay

Parameters

int cpu

cpu to use

struct delayed_work * dwork

job to be done

unsigned long delay

number of jiffies to wait

Description

After waiting for a given time this puts a job in the kernel-global workqueue on the specified CPU.

bool schedule_delayed_work(struct delayed_work * dwork, unsigned long delay)

put work task in global workqueue after delay

Parameters

struct delayed_work * dwork

job to be done

unsigned long delay

number of jiffies to wait or 0 for immediate execution

Description

After waiting for a given time this puts a job in the kernel-global workqueue.

bool queue_work_on(int cpu, struct workqueue_struct * wq, struct work_struct * work)

queue work on specific cpu

Parameters

int cpu

CPU number to execute work on

struct workqueue_struct * wq

workqueue to use

struct work_struct * work

work to queue

Description

We queue the work to a specific CPU, the caller must ensure it can’t go away.

Return

false if work was already on a queue, true otherwise.

bool queue_work_node(int node, struct workqueue_struct * wq, struct work_struct * work)

queue work on a “random” cpu for a given NUMA node

Parameters

int node

NUMA node that we are targeting the work for

struct workqueue_struct * wq

workqueue to use

struct work_struct * work

work to queue

Description

We queue the work to a “random” CPU within a given NUMA node. The basic idea here is to provide a way to somehow associate work with a given NUMA node.

This function will only make a best effort attempt at getting this onto the right NUMA node. If no node is requested or the requested node is offline then we just fall back to standard queue_work behavior.

Currently the “random” CPU ends up being the first available CPU in the intersection of cpu_online_mask and the cpumask of the node, unless we are running on the node. In that case we just use the current CPU.

Return

false if work was already on a queue, true otherwise.

bool queue_delayed_work_on(int cpu, struct workqueue_struct * wq, struct delayed_work * dwork, unsigned long delay)

queue work on specific CPU after delay

Parameters

int cpu

CPU number to execute work on

struct workqueue_struct * wq

workqueue to use

struct delayed_work * dwork

work to queue

unsigned long delay

number of jiffies to wait before queueing

Return

false if work was already on a queue, true otherwise. If delay is zero and dwork is idle, it will be scheduled for immediate execution.

bool mod_delayed_work_on(int cpu, struct workqueue_struct * wq, struct delayed_work * dwork, unsigned long delay)

modify delay of or queue a delayed work on specific CPU

Parameters

int cpu

CPU number to execute work on

struct workqueue_struct * wq

workqueue to use

struct delayed_work * dwork

work to queue

unsigned long delay

number of jiffies to wait before queueing

Description

If dwork is idle, equivalent to queue_delayed_work_on(); otherwise, modify dwork’s timer so that it expires after delay. If delay is zero, work is guaranteed to be scheduled immediately regardless of its current state.

Return

false if dwork was idle and queued, true if dwork was pending and its timer was modified.

This function is safe to call from any context including IRQ handler. See try_to_grab_pending() for details.

bool queue_rcu_work(struct workqueue_struct * wq, struct rcu_work * rwork)

queue work after a RCU grace period

Parameters

struct workqueue_struct * wq

workqueue to use

struct rcu_work * rwork

work to queue

Return

false if rwork was already pending, true otherwise. Note that a full RCU grace period is guaranteed only after a true return. While rwork is guaranteed to be executed after a false return, the execution may happen before a full RCU grace period has passed.

void flush_workqueue(struct workqueue_struct * wq)

ensure that any scheduled work has run to completion.

Parameters

struct workqueue_struct * wq

workqueue to flush

Description

This function sleeps until all work items which were queued on entry have finished execution, but it is not livelocked by new incoming ones.

void drain_workqueue(struct workqueue_struct * wq)

drain a workqueue

Parameters

struct workqueue_struct * wq

workqueue to drain

Description

Wait until the workqueue becomes empty. While draining is in progress, only chain queueing is allowed. IOW, only currently pending or running work items on wq can queue further work items on it. wq is flushed repeatedly until it becomes empty. The number of flushing is determined by the depth of chaining and should be relatively short. Whine if it takes too long.

bool flush_work(struct work_struct * work)

wait for a work to finish executing the last queueing instance

Parameters

struct work_struct * work

the work to flush

Description

Wait until work has finished execution. work is guaranteed to be idle on return if it hasn’t been requeued since flush started.

Return

true if flush_work() waited for the work to finish execution, false if it was already idle.

bool cancel_work_sync(struct work_struct * work)

cancel a work and wait for it to finish

Parameters

struct work_struct * work

the work to cancel

Description

Cancel work and wait for its execution to finish. This function can be used even if the work re-queues itself or migrates to another workqueue. On return from this function, work is guaranteed to be not pending or executing on any CPU.

cancel_work_sync(delayed_work->work) must not be used for delayed_work’s. Use cancel_delayed_work_sync() instead.

The caller must ensure that the workqueue on which work was last queued can’t be destroyed before this function returns.

Return

true if work was pending, false otherwise.

bool flush_delayed_work(struct delayed_work * dwork)

wait for a dwork to finish executing the last queueing

Parameters

struct delayed_work * dwork

the delayed work to flush

Description

Delayed timer is cancelled and the pending work is queued for immediate execution. Like flush_work(), this function only considers the last queueing instance of dwork.

Return

true if flush_work() waited for the work to finish execution, false if it was already idle.

bool flush_rcu_work(struct rcu_work * rwork)

wait for a rwork to finish executing the last queueing

Parameters

struct rcu_work * rwork

the rcu work to flush

Return

true if flush_rcu_work() waited for the work to finish execution, false if it was already idle.

bool cancel_delayed_work(struct delayed_work * dwork)

cancel a delayed work

Parameters

struct delayed_work * dwork

delayed_work to cancel

Description

Kill off a pending delayed_work.

Return

true if dwork was pending and canceled; false if it wasn’t pending.

Note

The work callback function may still be running on return, unless it returns true and the work doesn’t re-arm itself. Explicitly flush or use cancel_delayed_work_sync() to wait on it.

This function is safe to call from any context including IRQ handler.

bool cancel_delayed_work_sync(struct delayed_work * dwork)

cancel a delayed work and wait for it to finish

Parameters

struct delayed_work * dwork

the delayed work cancel

Description

This is cancel_work_sync() for delayed works.

Return

true if dwork was pending, false otherwise.

int execute_in_process_context(work_func_t fn, struct execute_work * ew)

reliably execute the routine with user context

Parameters

work_func_t fn

the function to execute

struct execute_work * ew

guaranteed storage for the execute work structure (must be available when the work executes)

Description

Executes the function immediately if process context is available, otherwise schedules the function for delayed execution.

Return

0 - function was executed

1 - function was scheduled for execution

void destroy_workqueue(struct workqueue_struct * wq)

safely terminate a workqueue

Parameters

struct workqueue_struct * wq

target workqueue

Description

Safely destroy a workqueue. All work currently pending will be done first.

void workqueue_set_max_active(struct workqueue_struct * wq, int max_active)

adjust max_active of a workqueue

Parameters

struct workqueue_struct * wq

target workqueue

int max_active

new max_active value.

Description

Set max_active of wq to max_active.

Context

Don’t call from IRQ context.

struct work_struct * current_work(void)

retrieve current task’s work struct

Parameters

void

no arguments

Description

Determine if current task is a workqueue worker and what it’s working on. Useful to find out the context that the current task is running in.

Return

work struct if current task is a workqueue worker, NULL otherwise.

bool workqueue_congested(int cpu, struct workqueue_struct * wq)

test whether a workqueue is congested

Parameters

int cpu

CPU in question

struct workqueue_struct * wq

target workqueue

Description

Test whether wq’s cpu workqueue for cpu is congested. There is no synchronization around this function and the test result is unreliable and only useful as advisory hints or for debugging.

If cpu is WORK_CPU_UNBOUND, the test is performed on the local CPU. Note that both per-cpu and unbound workqueues may be associated with multiple pool_workqueues which have separate congested states. A workqueue being congested on one CPU doesn’t mean the workqueue is also contested on other CPUs / NUMA nodes.

Return

true if congested, false otherwise.

unsigned int work_busy(struct work_struct * work)

test whether a work is currently pending or running

Parameters

struct work_struct * work

the work to be tested

Description

Test whether work is currently pending or running. There is no synchronization around this function and the test result is unreliable and only useful as advisory hints or for debugging.

Return

OR’d bitmask of WORK_BUSY_* bits.

void set_worker_desc(const char * fmt, ...)

set description for the current work item

Parameters

const char * fmt

printf-style format string

...

arguments for the format string

Description

This function can be called by a running work function to describe what the work item is about. If the worker task gets dumped, this information will be printed out together to help debugging. The description can be at most WORKER_DESC_LEN including the trailing ‘0’.

long work_on_cpu(int cpu, long (*fn) (void *, void * arg)

run a function in thread context on a particular cpu

Parameters

int cpu

the cpu to run on

long (*)(void *) fn

the function to run

void * arg

the function arg

Description

It is up to the caller to ensure that the cpu doesn’t go offline. The caller must not hold any locks which would prevent fn from completing.

Return

The value fn returns.

long work_on_cpu_safe(int cpu, long (*fn) (void *, void * arg)

run a function in thread context on a particular cpu

Parameters

int cpu

the cpu to run on

long (*)(void *) fn

the function to run

void * arg

the function argument

Description

Disables CPU hotplug and calls work_on_cpu(). The caller must not hold any locks which would prevent fn from completing.

Return

The value fn returns.

Internal Functions

int wait_task_stopped(struct wait_opts * wo, int ptrace, struct task_struct * p)

Wait for TASK_STOPPED or TASK_TRACED

Parameters

struct wait_opts * wo

wait options

int ptrace

is the wait for ptrace

struct task_struct * p

task to wait for

Description

Handle sys_wait4() work for p in state TASK_STOPPED or TASK_TRACED.

Context

read_lock(tasklist_lock), which is released if return value is non-zero. Also, grabs and releases p->sighand->siglock.

Return

0 if wait condition didn’t exist and search for other wait conditions should continue. Non-zero return, -errno on failure and p’s pid on success, implies that tasklist_lock is released and wait condition search should terminate.

bool task_set_jobctl_pending(struct task_struct * task, unsigned long mask)

set jobctl pending bits

Parameters

struct task_struct * task

target task

unsigned long mask

pending bits to set

Description

Clear mask from task->jobctl. mask must be subset of JOBCTL_PENDING_MASK | JOBCTL_STOP_CONSUME | JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING. If stop signo is being set, the existing signo is cleared. If task is already being killed or exiting, this function becomes noop.

Context

Must be called with task->sighand->siglock held.

Return

true if mask is set, false if made noop because task was dying.

void task_clear_jobctl_trapping(struct task_struct * task)

clear jobctl trapping bit

Parameters

struct task_struct * task

target task

Description

If JOBCTL_TRAPPING is set, a ptracer is waiting for us to enter TRACED. Clear it and wake up the ptracer. Note that we don’t need any further locking. task->siglock guarantees that task->parent points to the ptracer.

Context

Must be called with task->sighand->siglock held.

void task_clear_jobctl_pending(struct task_struct * task, unsigned long mask)

clear jobctl pending bits

Parameters

struct task_struct * task

target task

unsigned long mask

pending bits to clear

Description

Clear mask from task->jobctl. mask must be subset of JOBCTL_PENDING_MASK. If JOBCTL_STOP_PENDING is being cleared, other STOP bits are cleared together.

If clearing of mask leaves no stop or trap pending, this function calls task_clear_jobctl_trapping().

Context

Must be called with task->sighand->siglock held.

bool task_participate_group_stop(struct task_struct * task)

participate in a group stop

Parameters

struct task_struct * task

task participating in a group stop

Description

task has JOBCTL_STOP_PENDING set and is participating in a group stop. Group stop states are cleared and the group stop count is consumed if JOBCTL_STOP_CONSUME was set. If the consumption completes the group stop, the appropriate ``SIGNAL_``* flags are set.

Context

Must be called with task->sighand->siglock held.

Return

true if group stop completion should be notified to the parent, false otherwise.

void ptrace_trap_notify(struct task_struct * t)

schedule trap to notify ptracer

Parameters

struct task_struct * t

tracee wanting to notify tracer

Description

This function schedules sticky ptrace trap which is cleared on the next TRAP_STOP to notify ptracer of an event. t must have been seized by ptracer.

If t is running, STOP trap will be taken. If trapped for STOP and ptracer is listening for events, tracee is woken up so that it can re-trap for the new event. If trapped otherwise, STOP trap will be eventually taken without returning to userland after the existing traps are finished by PTRACE_CONT.

Context

Must be called with task->sighand->siglock held.

void do_notify_parent_cldstop(struct task_struct * tsk, bool for_ptracer, int why)

notify parent of stopped/continued state change

Parameters

struct task_struct * tsk

task reporting the state change

bool for_ptracer

the notification is for ptracer

int why

CLD_{CONTINUED|STOPPED|TRAPPED} to report

Description

Notify tsk’s parent that the stopped/continued state has changed. If for_ptracer is false, tsk’s group leader notifies to its real parent. If true, tsk reports to tsk->parent which should be the ptracer.

Context

Must be called with tasklist_lock at least read locked.

bool do_signal_stop(int signr)

handle group stop for SIGSTOP and other stop signals

Parameters

int signr

signr causing group stop if initiating

Description

If JOBCTL_STOP_PENDING is not set yet, initiate group stop with signr and participate in it. If already set, participate in the existing group stop. If participated in a group stop (and thus slept), true is returned with siglock released.

If ptraced, this function doesn’t handle stop itself. Instead, JOBCTL_TRAP_STOP is scheduled and false is returned with siglock untouched. The caller must ensure that INTERRUPT trap handling takes places afterwards.

Context

Must be called with current->sighand->siglock held, which is released on true return.

Return

false if group stop is already cancelled or ptrace trap is scheduled. true if participated in group stop.

void do_jobctl_trap(void)

take care of ptrace jobctl traps

Parameters

void

no arguments

Description

When PT_SEIZED, it’s used for both group stop and explicit SEIZE/INTERRUPT traps. Both generate PTRACE_EVENT_STOP trap with accompanying siginfo. If stopped, lower eight bits of exit_code contain the stop signal; otherwise, SIGTRAP.

When !PT_SEIZED, it’s used only for group stop trap with stop signal number as exit_code and no siginfo.

Context

Must be called with current->sighand->siglock held, which may be released and re-acquired before returning with intervening sleep.

void do_freezer_trap(void)

handle the freezer jobctl trap

Parameters

void

no arguments

Description

Puts the task into frozen state, if only the task is not about to quit. In this case it drops JOBCTL_TRAP_FREEZE.

Context

Must be called with current->sighand->siglock held, which is always released before returning.

void signal_delivered(struct ksignal * ksig, int stepping)

Parameters

struct ksignal * ksig

kernel signal struct

int stepping

nonzero if debugger single-step or block-step in use

Description

This function should be called when a signal has successfully been delivered. It updates the blocked signals accordingly (ksig->ka.sa.sa_mask is always blocked, and the signal itself is blocked unless SA_NODEFER is set in ksig->ka.sa.sa_flags. Tracing is notified.

long sys_restart_syscall(void)

restart a system call

Parameters

void

no arguments

void set_current_blocked(sigset_t * newset)

change current->blocked mask

Parameters

sigset_t * newset

new mask

Description

It is wrong to change ->blocked directly, this helper should be used to ensure the process can’t miss a shared signal we are going to block.

long sys_rt_sigprocmask(int how, sigset_t __user * nset, sigset_t __user * oset, size_t sigsetsize)

change the list of currently blocked signals

Parameters

int how

whether to add, remove, or set signals

sigset_t __user * nset

stores pending signals

sigset_t __user * oset

previous value of signal mask if non-null

size_t sigsetsize

size of sigset_t type

long sys_rt_sigpending(sigset_t __user * uset, size_t sigsetsize)

examine a pending signal that has been raised while blocked

Parameters

sigset_t __user * uset

stores pending signals

size_t sigsetsize

size of sigset_t type or larger

int do_sigtimedwait(const sigset_t * which, kernel_siginfo_t * info, const struct timespec64 * ts)

wait for queued signals specified in which

Parameters

const sigset_t * which

queued signals to wait for

kernel_siginfo_t * info

if non-null, the signal’s siginfo is returned here

const struct timespec64 * ts

upper bound on process time suspension

long sys_rt_sigtimedwait(const sigset_t __user * uthese, siginfo_t __user * uinfo, const struct __kernel_timespec __user * uts, size_t sigsetsize)

synchronously wait for queued signals specified in uthese

Parameters

const sigset_t __user * uthese

queued signals to wait for

siginfo_t __user * uinfo

if non-null, the signal’s siginfo is returned here

const struct __kernel_timespec __user * uts

upper bound on process time suspension

size_t sigsetsize

size of sigset_t type

long sys_kill(pid_t pid, int sig)

send a signal to a process

Parameters

pid_t pid

the PID of the process

int sig

signal to be sent

long sys_pidfd_send_signal(int pidfd, int sig, siginfo_t __user * info, unsigned int flags)

Signal a process through a pidfd

Parameters

int pidfd

file descriptor of the process

int sig

signal to send

siginfo_t __user * info

signal info

unsigned int flags

future flags

Description

The syscall currently only signals via PIDTYPE_PID which covers kill(<positive-pid>, <signal>. It does not signal threads or process groups. In order to extend the syscall to threads and process groups the flags argument should be used. In essence, the flags argument will determine what is signaled and not the file descriptor itself. Put in other words, grouping is a property of the flags argument not a property of the file descriptor.

Return

0 on success, negative errno on failure

long sys_tgkill(pid_t tgid, pid_t pid, int sig)

send signal to one specific thread

Parameters

pid_t tgid

the thread group ID of the thread

pid_t pid

the PID of the thread

int sig

signal to be sent

Description

This syscall also checks the tgid and returns -ESRCH even if the PID exists but it’s not belonging to the target process anymore. This method solves the problem of threads exiting and PIDs getting reused.

long sys_tkill(pid_t pid, int sig)

send signal to one specific task

Parameters

pid_t pid

the PID of the task

int sig

signal to be sent

Description

Send a signal to only one task, even if it’s a CLONE_THREAD task.

long sys_rt_sigqueueinfo(pid_t pid, int sig, siginfo_t __user * uinfo)

send signal information to a signal

Parameters

pid_t pid

the PID of the thread

int sig

signal to be sent

siginfo_t __user * uinfo

signal info to be sent

long sys_sigpending(old_sigset_t __user * uset)

examine pending signals

Parameters

old_sigset_t __user * uset

where mask of pending signal is returned

long sys_sigprocmask(int how, old_sigset_t __user * nset, old_sigset_t __user * oset)

examine and change blocked signals

Parameters

int how

whether to add, remove, or set signals

old_sigset_t __user * nset

signals to add or remove (if non-null)

old_sigset_t __user * oset

previous value of signal mask if non-null

Description

Some platforms have their own version with special arguments; others support only sys_rt_sigprocmask.

long sys_rt_sigaction(int sig, const struct sigaction __user * act, struct sigaction __user * oact, size_t sigsetsize)

alter an action taken by a process

Parameters

int sig

signal to be sent

const struct sigaction __user * act

new sigaction

struct sigaction __user * oact

used to save the previous sigaction

size_t sigsetsize

size of sigset_t type

long sys_rt_sigsuspend(sigset_t __user * unewset, size_t sigsetsize)

replace the signal mask for a value with the unewset value until a signal is received

Parameters

sigset_t __user * unewset

new signal mask value

size_t sigsetsize

size of sigset_t type

kthread_create(threadfn, data, namefmt, arg…)

create a kthread on the current node

Parameters

threadfn

the function to run in the thread

data

data pointer for threadfn()

namefmt

printf-style format string for the thread name

arg...

arguments for namefmt.

Description

This macro will create a kthread on the current node, leaving it in the stopped state. This is just a helper for kthread_create_on_node(); see the documentation there for more details.

kthread_run(threadfn, data, namefmt, )

create and wake a thread.

Parameters

threadfn

the function to run until signal_pending(current).

data

data ptr for threadfn.

namefmt

printf-style name for the thread.

...

variable arguments

Description

Convenient wrapper for kthread_create() followed by wake_up_process(). Returns the kthread or ERR_PTR(-ENOMEM).

bool kthread_should_stop(void)

should this kthread return now?

Parameters

void

no arguments

Description

When someone calls kthread_stop() on your kthread, it will be woken and this will return true. You should then return, and your return value will be passed through to kthread_stop().

bool kthread_should_park(void)

should this kthread park now?

Parameters

void

no arguments

Description

When someone calls kthread_park() on your kthread, it will be woken and this will return true. You should then do the necessary cleanup and call kthread_parkme()

Similar to kthread_should_stop(), but this keeps the thread alive and in a park position. kthread_unpark() “restarts” the thread and calls the thread function again.

bool kthread_freezable_should_stop(bool * was_frozen)

should this freezable kthread return now?

Parameters

bool * was_frozen

optional out parameter, indicates whether current was frozen

Description

kthread_should_stop() for freezable kthreads, which will enter refrigerator if necessary. This function is safe from kthread_stop() / freezer deadlock and freezable kthreads should use this function instead of calling try_to_freeze() directly.

struct task_struct * kthread_create_on_node(int (*threadfn) (void *data, void * data, int node, const char namefmt, ...)

create a kthread.

Parameters

int (*)(void *data) threadfn

the function to run until signal_pending(current).

void * data

data ptr for threadfn.

int node

task and thread structures for the thread are allocated on this node

const char namefmt

printf-style name for the thread.

...

variable arguments

Description

This helper function creates and names a kernel thread. The thread will be stopped: use wake_up_process() to start it. See also kthread_run(). The new thread has SCHED_NORMAL policy and is affine to all CPUs.

If thread is going to be bound on a particular cpu, give its node in node, to get NUMA affinity for kthread stack, or else give NUMA_NO_NODE. When woken, the thread will run threadfn() with data as its argument. threadfn() can either call do_exit() directly if it is a standalone thread for which no one will call kthread_stop(), or return when ‘kthread_should_stop()’ is true (which means kthread_stop() has been called). The return value should be zero or a negative error number; it will be passed to kthread_stop().

Returns a task_struct or ERR_PTR(-ENOMEM) or ERR_PTR(-EINTR).

void kthread_bind(struct task_struct * p, unsigned int cpu)

bind a just-created kthread to a cpu.

Parameters

struct task_struct * p

thread created by kthread_create().

unsigned int cpu

cpu (might not be online, must be possible) for k to run on.

Description

This function is equivalent to set_cpus_allowed(), except that cpu doesn’t need to be online, and the thread must be stopped (i.e., just returned from kthread_create()).

void kthread_unpark(struct task_struct * k)

unpark a thread created by kthread_create().

Parameters

struct task_struct * k

thread created by kthread_create().

Description

Sets kthread_should_park() for k to return false, wakes it, and waits for it to return. If the thread is marked percpu then its bound to the cpu again.

int kthread_park(struct task_struct * k)

park a thread created by kthread_create().

Parameters

struct task_struct * k

thread created by kthread_create().

Description

Sets kthread_should_park() for k to return true, wakes it, and waits for it to return. This can also be called after kthread_create() instead of calling wake_up_process(): the thread will park without calling threadfn().

Returns 0 if the thread is parked, -ENOSYS if the thread exited. If called by the kthread itself just the park bit is set.

int kthread_stop(struct task_struct * k)

stop a thread created by kthread_create().

Parameters

struct task_struct * k

thread created by kthread_create().

Description

Sets kthread_should_stop() for k to return true, wakes it, and waits for it to exit. This can also be called after kthread_create() instead of calling wake_up_process(): the thread will exit without calling threadfn().

If threadfn() may call do_exit() itself, the caller must ensure task_struct can’t go away.

Returns the result of threadfn(), or -EINTR if wake_up_process() was never called.

int kthread_worker_fn(void * worker_ptr)

kthread function to process kthread_worker

Parameters

void * worker_ptr

pointer to initialized kthread_worker

Description

This function implements the main cycle of kthread worker. It processes work_list until it is stopped with kthread_stop(). It sleeps when the queue is empty.

The works are not allowed to keep any locks, disable preemption or interrupts when they finish. There is defined a safe point for freezing when one work finishes and before a new one is started.

Also the works must not be handled by more than one worker at the same time, see also kthread_queue_work().

struct kthread_worker * kthread_create_worker(unsigned int flags, const char namefmt, ...)

create a kthread worker

Parameters

unsigned int flags

flags modifying the default behavior of the worker

const char namefmt

printf-style name for the kthread worker (task).

...

variable arguments

Description

Returns a pointer to the allocated worker on success, ERR_PTR(-ENOMEM) when the needed structures could not get allocated, and ERR_PTR(-EINTR) when the worker was SIGKILLed.

struct kthread_worker * kthread_create_worker_on_cpu(int cpu, unsigned int flags, const char namefmt, ...)

create a kthread worker and bind it it to a given CPU and the associated NUMA node.

Parameters

int cpu

CPU number

unsigned int flags

flags modifying the default behavior of the worker

const char namefmt

printf-style name for the kthread worker (task).

...

variable arguments

Description

Use a valid CPU number if you want to bind the kthread worker to the given CPU and the associated NUMA node.

A good practice is to add the cpu number also into the worker name. For example, use kthread_create_worker_on_cpu(cpu, “helper/d”, cpu).

Returns a pointer to the allocated worker on success, ERR_PTR(-ENOMEM) when the needed structures could not get allocated, and ERR_PTR(-EINTR) when the worker was SIGKILLed.

bool kthread_queue_work(struct kthread_worker * worker, struct kthread_work * work)

queue a kthread_work

Parameters

struct kthread_worker * worker

target kthread_worker

struct kthread_work * work

kthread_work to queue

Description

Queue work to work processor task for async execution. task must have been created with kthread_worker_create(). Returns true if work was successfully queued, false if it was already pending.

Reinitialize the work if it needs to be used by another worker. For example, when the worker was stopped and started again.

void kthread_delayed_work_timer_fn(struct timer_list * t)

callback that queues the associated kthread delayed work when the timer expires.

Parameters

struct timer_list * t

pointer to the expired timer

Description

The format of the function is defined by struct timer_list. It should have been called from irqsafe timer with irq already off.

bool kthread_queue_delayed_work(struct kthread_worker * worker, struct kthread_delayed_work * dwork, unsigned long delay)

queue the associated kthread work after a delay.

Parameters

struct kthread_worker * worker

target kthread_worker

struct kthread_delayed_work * dwork

kthread_delayed_work to queue

unsigned long delay

number of jiffies to wait before queuing

Description

If the work has not been pending it starts a timer that will queue the work after the given delay. If delay is zero, it queues the work immediately.

Return

false if the work has already been pending. It means that either the timer was running or the work was queued. It returns true otherwise.

void kthread_flush_work(struct kthread_work * work)

flush a kthread_work

Parameters

struct kthread_work * work

work to flush

Description

If work is queued or executing, wait for it to finish execution.

bool kthread_mod_delayed_work(struct kthread_worker * worker, struct kthread_delayed_work * dwork, unsigned long delay)

modify delay of or queue a kthread delayed work

Parameters

struct kthread_worker * worker

kthread worker to use

struct kthread_delayed_work * dwork

kthread delayed work to queue

unsigned long delay

number of jiffies to wait before queuing

Description

If dwork is idle, equivalent to kthread_queue_delayed_work(). Otherwise, modify dwork’s timer so that it expires after delay. If delay is zero, work is guaranteed to be queued immediately.

Return

true if dwork was pending and its timer was modified, false otherwise.

A special case is when the work is being canceled in parallel. It might be caused either by the real kthread_cancel_delayed_work_sync() or yet another kthread_mod_delayed_work() call. We let the other command win and return false here. The caller is supposed to synchronize these operations a reasonable way.

This function is safe to call from any context including IRQ handler. See __kthread_cancel_work() and kthread_delayed_work_timer_fn() for details.

bool kthread_cancel_work_sync(struct kthread_work * work)

cancel a kthread work and wait for it to finish

Parameters

struct kthread_work * work

the kthread work to cancel

Description

Cancel work and wait for its execution to finish. This function can be used even if the work re-queues itself. On return from this function, work is guaranteed to be not pending or executing on any CPU.

kthread_cancel_work_sync(delayed_work->work) must not be used for delayed_work’s. Use kthread_cancel_delayed_work_sync() instead.

The caller must ensure that the worker on which work was last queued can’t be destroyed before this function returns.

Return

true if work was pending, false otherwise.

bool kthread_cancel_delayed_work_sync(struct kthread_delayed_work * dwork)

cancel a kthread delayed work and wait for it to finish.

Parameters

struct kthread_delayed_work * dwork

the kthread delayed work to cancel

Description

This is kthread_cancel_work_sync() for delayed works.

Return

true if dwork was pending, false otherwise.

void kthread_flush_worker(struct kthread_worker * worker)

flush all current works on a kthread_worker

Parameters

struct kthread_worker * worker

worker to flush

Description

Wait until all currently executing or pending works on worker are finished.

void kthread_destroy_worker(struct kthread_worker * worker)

destroy a kthread worker

Parameters

struct kthread_worker * worker

worker to be destroyed

Description

Flush and destroy worker. The simple flush is enough because the kthread worker API is used only in trivial scenarios. There are no multi-step state machines needed.

void kthread_associate_blkcg(struct cgroup_subsys_state * css)

associate blkcg to current kthread

Parameters

struct cgroup_subsys_state * css

the cgroup info

Description

Current thread must be a kthread. The thread is running jobs on behalf of other threads. In some cases, we expect the jobs attach cgroup info of original threads instead of that of current thread. This function stores original thread’s cgroup info in current kthread context for later retrieval.

struct cgroup_subsys_state * kthread_blkcg(void)

get associated blkcg css of current kthread

Parameters

void

no arguments

Description

Current thread must be a kthread.

Reference counting

struct refcount_struct

variant of atomic_t specialized for reference counts

Definition

struct refcount_struct {
  atomic_t refs;
};

Members

refs

atomic_t counter field

Description

The counter saturates at UINT_MAX and will not move once there. This avoids wrapping the counter and causing ‘spurious’ use-after-free bugs.

void refcount_set(refcount_t * r, unsigned int n)

set a refcount’s value

Parameters

refcount_t * r

the refcount

unsigned int n

value to which the refcount will be set

unsigned int refcount_read(const refcount_t * r)

get a refcount’s value

Parameters

const refcount_t * r

the refcount

Return

the refcount’s value

bool refcount_add_not_zero_checked(unsigned int i, refcount_t * r)

add a value to a refcount unless it is 0

Parameters

unsigned int i

the value to add to the refcount

refcount_t * r

the refcount

Description

Will saturate at UINT_MAX and WARN.

Provides no memory ordering, it is assumed the caller has guaranteed the object memory to be stable (RCU, etc.). It does provide a control dependency and thereby orders future stores. See the comment on top.

Use of this function is not recommended for the normal reference counting use case in which references are taken and released one at a time. In these cases, refcount_inc(), or one of its variants, should instead be used to increment a reference count.

Return

false if the passed refcount is 0, true otherwise

void refcount_add_checked(unsigned int i, refcount_t * r)

add a value to a refcount

Parameters

unsigned int i

the value to add to the refcount

refcount_t * r

the refcount

Description

Similar to atomic_add(), but will saturate at UINT_MAX and WARN.

Provides no memory ordering, it is assumed the caller has guaranteed the object memory to be stable (RCU, etc.). It does provide a control dependency and thereby orders future stores. See the comment on top.

Use of this function is not recommended for the normal reference counting use case in which references are taken and released one at a time. In these cases, refcount_inc(), or one of its variants, should instead be used to increment a reference count.

bool refcount_inc_not_zero_checked(refcount_t * r)

increment a refcount unless it is 0

Parameters

refcount_t * r

the refcount to increment

Description

Similar to atomic_inc_not_zero(), but will saturate at UINT_MAX and WARN.

Provides no memory ordering, it is assumed the caller has guaranteed the object memory to be stable (RCU, etc.). It does provide a control dependency and thereby orders future stores. See the comment on top.

Return

true if the increment was successful, false otherwise

void refcount_inc_checked(refcount_t * r)

increment a refcount

Parameters

refcount_t * r

the refcount to increment

Description

Similar to atomic_inc(), but will saturate at UINT_MAX and WARN.

Provides no memory ordering, it is assumed the caller already has a reference on the object.

Will WARN if the refcount is 0, as this represents a possible use-after-free condition.

bool refcount_sub_and_test_checked(unsigned int i, refcount_t * r)

subtract from a refcount and test if it is 0

Parameters

unsigned int i

amount to subtract from the refcount

refcount_t * r

the refcount

Description

Similar to atomic_dec_and_test(), but it will WARN, return false and ultimately leak on underflow and will fail to decrement when saturated at UINT_MAX.

Provides release memory ordering, such that prior loads and stores are done before, and provides an acquire ordering on success such that free() must come after.

Use of this function is not recommended for the normal reference counting use case in which references are taken and released one at a time. In these cases, refcount_dec(), or one of its variants, should instead be used to decrement a reference count.

Return

true if the resulting refcount is 0, false otherwise

bool refcount_dec_and_test_checked(refcount_t * r)

decrement a refcount and test if it is 0

Parameters

refcount_t * r

the refcount

Description

Similar to atomic_dec_and_test(), it will WARN on underflow and fail to decrement when saturated at UINT_MAX.

Provides release memory ordering, such that prior loads and stores are done before, and provides an acquire ordering on success such that free() must come after.

Return

true if the resulting refcount is 0, false otherwise

void refcount_dec_checked(refcount_t * r)

decrement a refcount

Parameters

refcount_t * r

the refcount

Description

Similar to atomic_dec(), it will WARN on underflow and fail to decrement when saturated at UINT_MAX.

Provides release memory ordering, such that prior loads and stores are done before.

bool refcount_dec_if_one(refcount_t * r)

decrement a refcount if it is 1

Parameters

refcount_t * r

the refcount

Description

No atomic_t counterpart, it attempts a 1 -> 0 transition and returns the success thereof.

Like all decrement operations, it provides release memory order and provides a control dependency.

It can be used like a try-delete operator; this explicit case is provided and not cmpxchg in generic, because that would allow implementing unsafe operations.

Return

true if the resulting refcount is 0, false otherwise

bool refcount_dec_not_one(refcount_t * r)

decrement a refcount if it is not 1

Parameters

refcount_t * r

the refcount

Description

No atomic_t counterpart, it decrements unless the value is 1, in which case it will return false.

Was often done like: atomic_add_unless(var, -1, 1)

Return

true if the decrement operation was successful, false otherwise

bool refcount_dec_and_mutex_lock(refcount_t * r, struct mutex * lock)

return holding mutex if able to decrement refcount to 0

Parameters

refcount_t * r

the refcount

struct mutex * lock

the mutex to be locked

Description

Similar to atomic_dec_and_mutex_lock(), it will WARN on underflow and fail to decrement when saturated at UINT_MAX.

Provides release memory ordering, such that prior loads and stores are done before, and provides a control dependency such that free() must come after. See the comment on top.

Return

true and hold mutex if able to decrement refcount to 0, false

otherwise

bool refcount_dec_and_lock(refcount_t * r, spinlock_t * lock)

return holding spinlock if able to decrement refcount to 0

Parameters

refcount_t * r

the refcount

spinlock_t * lock

the spinlock to be locked

Description

Similar to atomic_dec_and_lock(), it will WARN on underflow and fail to decrement when saturated at UINT_MAX.

Provides release memory ordering, such that prior loads and stores are done before, and provides a control dependency such that free() must come after. See the comment on top.

Return

true and hold spinlock if able to decrement refcount to 0, false

otherwise

bool refcount_dec_and_lock_irqsave(refcount_t * r, spinlock_t * lock, unsigned long * flags)

return holding spinlock with disabled interrupts if able to decrement refcount to 0

Parameters

refcount_t * r

the refcount

spinlock_t * lock

the spinlock to be locked

unsigned long * flags

saved IRQ-flags if the is acquired

Description

Same as refcount_dec_and_lock() above except that the spinlock is acquired with disabled interupts.

Return

true and hold spinlock if able to decrement refcount to 0, false

otherwise

Atomics

int arch_atomic_read(const atomic_t * v)

read atomic variable

Parameters

const atomic_t * v

pointer of type atomic_t

Description

Atomically reads the value of v.

void arch_atomic_set(atomic_t * v, int i)

set atomic variable

Parameters

atomic_t * v

pointer of type atomic_t

int i

required value

Description

Atomically sets the value of v to i.

void arch_atomic_add(int i, atomic_t * v)

add integer to atomic variable

Parameters

int i

integer value to add

atomic_t * v

pointer of type atomic_t

Description

Atomically adds i to v.

void arch_atomic_sub(int i, atomic_t * v)

subtract integer from atomic variable

Parameters

int i

integer value to subtract

atomic_t * v

pointer of type atomic_t

Description

Atomically subtracts i from v.

bool arch_atomic_sub_and_test(int i, atomic_t * v)

subtract value from variable and test result

Parameters

int i

integer value to subtract

atomic_t * v

pointer of type atomic_t

Description

Atomically subtracts i from v and returns true if the result is zero, or false for all other cases.

void arch_atomic_inc(atomic_t * v)

increment atomic variable

Parameters

atomic_t * v

pointer of type atomic_t

Description

Atomically increments v by 1.

void arch_atomic_dec(atomic_t * v)

decrement atomic variable

Parameters

atomic_t * v

pointer of type atomic_t

Description

Atomically decrements v by 1.

bool arch_atomic_dec_and_test(atomic_t * v)

decrement and test

Parameters

atomic_t * v

pointer of type atomic_t

Description

Atomically decrements v by 1 and returns true if the result is 0, or false for all other cases.

bool arch_atomic_inc_and_test(atomic_t * v)

increment and test

Parameters

atomic_t * v

pointer of type atomic_t

Description

Atomically increments v by 1 and returns true if the result is zero, or false for all other cases.

bool arch_atomic_add_negative(int i, atomic_t * v)

add and test if negative

Parameters

int i

integer value to add

atomic_t * v

pointer of type atomic_t

Description

Atomically adds i to v and returns true if the result is negative, or false when result is greater than or equal to zero.

int arch_atomic_add_return(int i, atomic_t * v)

add integer and return

Parameters

int i

integer value to add

atomic_t * v

pointer of type atomic_t

Description

Atomically adds i to v and returns i + v

int arch_atomic_sub_return(int i, atomic_t * v)

subtract integer and return

Parameters

int i

integer value to subtract

atomic_t * v

pointer of type atomic_t

Description

Atomically subtracts i from v and returns v - i

Kernel objects manipulation

char * kobject_get_path(struct kobject * kobj, gfp_t gfp_mask)

Allocate memory and fill in the path for kobj.

Parameters

struct kobject * kobj

kobject in question, with which to build the path

gfp_t gfp_mask

the allocation type used to allocate the path

Return

The newly allocated memory, caller must free with kfree().

int kobject_set_name(struct kobject * kobj, const char * fmt, ...)

Set the name of a kobject.

Parameters

struct kobject * kobj

struct kobject to set the name of

const char * fmt

format string used to build the name

...

variable arguments

Description

This sets the name of the kobject. If you have already added the kobject to the system, you must call kobject_rename() in order to change the name of the kobject.

void kobject_init(struct kobject * kobj, struct kobj_type * ktype)

Initialize a kobject structure.

Parameters

struct kobject * kobj

pointer to the kobject to initialize

struct kobj_type * ktype

pointer to the ktype for this kobject.

Description

This function will properly initialize a kobject such that it can then be passed to the kobject_add() call.

After this function is called, the kobject MUST be cleaned up by a call to kobject_put(), not by a call to kfree directly to ensure that all of the memory is cleaned up properly.

int kobject_add(struct kobject * kobj, struct kobject * parent, const char * fmt, ...)

The main kobject add function.

Parameters

struct kobject * kobj

the kobject to add

struct kobject * parent

pointer to the parent of the kobject.

const char * fmt

format to name the kobject with.

...

variable arguments

Description

The kobject name is set and added to the kobject hierarchy in this function.

If parent is set, then the parent of the kobj will be set to it. If parent is NULL, then the parent of the kobj will be set to the kobject associated with the kset assigned to this kobject. If no kset is assigned to the kobject, then the kobject will be located in the root of the sysfs tree.

Note, no “add” uevent will be created with this call, the caller should set up all of the necessary sysfs files for the object and then call kobject_uevent() with the UEVENT_ADD parameter to ensure that userspace is properly notified of this kobject’s creation.

Return

If this function returns an error, kobject_put() must be

called to properly clean up the memory associated with the object. Under no instance should the kobject that is passed to this function be directly freed with a call to kfree(), that can leak memory.

If this function returns success, kobject_put() must also be called in order to properly clean up the memory associated with the object.

In short, once this function is called, kobject_put() MUST be called when the use of the object is finished in order to properly free everything.

int kobject_init_and_add(struct kobject * kobj, struct kobj_type * ktype, struct kobject * parent, const char * fmt, ...)

Initialize a kobject structure and add it to the kobject hierarchy.

Parameters

struct kobject * kobj

pointer to the kobject to initialize

struct kobj_type * ktype

pointer to the ktype for this kobject.

struct kobject * parent

pointer to the parent of this kobject.

const char * fmt

the name of the kobject.

...

variable arguments

Description

This function combines the call to kobject_init() and kobject_add().

If this function returns an error, kobject_put() must be called to properly clean up the memory associated with the object. This is the same type of error handling after a call to kobject_add() and kobject lifetime rules are the same here.

int kobject_rename(struct kobject * kobj, const char * new_name)

Change the name of an object.

Parameters

struct kobject * kobj

object in question.

const char * new_name

object’s new name

Description

It is the responsibility of the caller to provide mutual exclusion between two different calls of kobject_rename on the same kobject and to ensure that new_name is valid and won’t conflict with other kobjects.

int kobject_move(struct kobject * kobj, struct kobject * new_parent)

Move object to another parent.

Parameters

struct kobject * kobj

object in question.

struct kobject * new_parent

object’s new parent (can be NULL)

void kobject_del(struct kobject * kobj)

Unlink kobject from hierarchy.

Parameters

struct kobject * kobj

object.

Description

This is the function that should be called to delete an object successfully added via kobject_add().

struct kobject * kobject_get(struct kobject * kobj)

Increment refcount for object.

Parameters

struct kobject * kobj

object.

void kobject_put(struct kobject * kobj)

Decrement refcount for object.

Parameters

struct kobject * kobj

object.

Description

Decrement the refcount, and if 0, call kobject_cleanup().

struct kobject * kobject_create_and_add(const char * name, struct kobject * parent)

Create a struct kobject dynamically and register it with sysfs.

Parameters

const char * name

the name for the kobject

struct kobject * parent

the parent kobject of this kobject, if any.

Description

This function creates a kobject structure dynamically and registers it with sysfs. When you are finished with this structure, call kobject_put() and the structure will be dynamically freed when it is no longer being used.

If the kobject was not able to be created, NULL will be returned.

int kset_register(struct kset * k)

Initialize and add a kset.

Parameters

struct kset * k

kset.

void kset_unregister(struct kset * k)

Remove a kset.

Parameters

struct kset * k

kset.

struct kobject * kset_find_obj(struct kset * kset, const char * name)

Search for object in kset.

Parameters

struct kset * kset

kset we’re looking in.

const char * name

object’s name.

Description

Lock kset via kset->subsys, and iterate over kset->list, looking for a matching kobject. If matching object is found take a reference and return the object.

struct kset * kset_create_and_add(const char * name, const struct kset_uevent_ops * uevent_ops, struct kobject * parent_kobj)

Create a struct kset dynamically and add it to sysfs.

Parameters

const char * name

the name for the kset

const struct kset_uevent_ops * uevent_ops

a struct kset_uevent_ops for the kset

struct kobject * parent_kobj

the parent kobject of this kset, if any.

Description

This function creates a kset structure dynamically and registers it with sysfs. When you are finished with this structure, call kset_unregister() and the structure will be dynamically freed when it is no longer being used.

If the kset was not able to be created, NULL will be returned.

Kernel utility functions

REPEAT_BYTE(x)

repeat the value x multiple times as an unsigned long value

Parameters

x

value to repeat

NOTE

x is not checked for > 0xff; larger values produce odd results.

ARRAY_SIZE(arr)

get the number of elements in array arr

Parameters

arr

array to be sized

round_up(x, y)

round up to next specified power of 2

Parameters

x

the value to round

y

multiple to round up to (must be a power of 2)

Description

Rounds x up to next multiple of y (which must be a power of 2). To perform arbitrary rounding up, use roundup() below.

round_down(x, y)

round down to next specified power of 2

Parameters

x

the value to round

y

multiple to round down to (must be a power of 2)

Description

Rounds x down to next multiple of y (which must be a power of 2). To perform arbitrary rounding down, use rounddown() below.

FIELD_SIZEOF(t, f)

get the size of a struct’s field

Parameters

t

the target struct

f

the target struct’s field

Return

the size of f in the struct definition without having a declared instance of t.

roundup(x, y)

round up to the next specified multiple

Parameters

x

the value to up

y

multiple to round up to

Description

Rounds x up to next multiple of y. If y will always be a power of 2, consider using the faster round_up().

rounddown(x, y)

round down to next specified multiple

Parameters

x

the value to round

y

multiple to round down to

Description

Rounds x down to next multiple of y. If y will always be a power of 2, consider using the faster round_down().

upper_32_bits(n)

return bits 32-63 of a number

Parameters

n

the number we’re accessing

Description

A basic shift-right of a 64- or 32-bit quantity. Use this to suppress the “right shift count >= width of type” warning when that quantity is 32-bits.

lower_32_bits(n)

return bits 0-31 of a number

Parameters

n

the number we’re accessing

might_sleep()

annotation for functions that can sleep

Parameters

Description

this macro will print a stack trace if it is executed in an atomic context (spinlock, irq-handler, …).

This is a useful debugging help to be able to catch problems early and not be bitten later when the calling function happens to sleep when it is not supposed to.

cant_sleep()

annotation for functions that cannot sleep

Parameters

Description

this macro will print a stack trace if it is executed with preemption enabled

abs(x)

return absolute value of an argument

Parameters

x

the value. If it is unsigned type, it is converted to signed type first. char is treated as if it was signed (regardless of whether it really is) but the macro’s return type is preserved as char.

Return

an absolute value of x.

u32 reciprocal_scale(u32 val, u32 ep_ro)

“scale” a value into range [0, ep_ro)

Parameters

u32 val

value

u32 ep_ro

right open interval endpoint

Description

Perform a “reciprocal multiplication” in order to “scale” a value into range [0, ep_ro), where the upper interval endpoint is right-open. This is useful, e.g. for accessing a index of an array containing ep_ro elements, for example. Think of it as sort of modulus, only that the result isn’t that of modulo. ;) Note that if initial input is a small value, then result will return 0.

Return

a result based on val in interval [0, ep_ro).

int kstrtoul(const char * s, unsigned int base, unsigned long * res)

convert a string to an unsigned long

Parameters

const char * s

The start of the string. The string must be null-terminated, and may also include a single newline before its terminating null. The first character may also be a plus sign, but not a minus sign.

unsigned int base

The number base to use. The maximum supported base is 16. If base is given as 0, then the base of the string is automatically detected with the conventional semantics - If it begins with 0x the number will be parsed as a hexadecimal (case insensitive), if it otherwise begins with 0, it will be parsed as an octal number. Otherwise it will be parsed as a decimal.

unsigned long * res

Where to write the result of the conversion on success.

Description

Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. Used as a replacement for the obsolete simple_strtoull. Return code must be checked.

int kstrtol(const char * s, unsigned int base, long * res)

convert a string to a long

Parameters

const char * s

The start of the string. The string must be null-terminated, and may also include a single newline before its terminating null. The first character may also be a plus sign or a minus sign.

unsigned int base

The number base to use. The maximum supported base is 16. If base is given as 0, then the base of the string is automatically detected with the conventional semantics - If it begins with 0x the number will be parsed as a hexadecimal (case insensitive), if it otherwise begins with 0, it will be parsed as an octal number. Otherwise it will be parsed as a decimal.

long * res

Where to write the result of the conversion on success.

Description

Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. Used as a replacement for the obsolete simple_strtoull. Return code must be checked.

trace_printk(fmt, )

printf formatting in the ftrace buffer

Parameters

fmt

the printf format for printing

...

variable arguments

Note

__trace_printk is an internal function for trace_printk() and

the ip is passed in via the trace_printk() macro.

This function allows a kernel developer to debug fast path sections that printk is not appropriate for. By scattering in various printk like tracing in the code, a developer can quickly see where problems are occurring.

This is intended as a debugging tool for the developer only. Please refrain from leaving trace_printks scattered around in your code. (Extra memory is used for special buffers that are allocated when trace_printk() is used.)

A little optimization trick is done here. If there’s only one argument, there’s no need to scan the string for printf formats. The trace_puts() will suffice. But how can we take advantage of using trace_puts() when trace_printk() has only one argument? By stringifying the args and checking the size we can tell whether or not there are args. __stringify((__VA_ARGS__)) will turn into “()0” with a size of 3 when there are no args, anything else will be bigger. All we need to do is define a string to this, and then take its size and compare to 3. If it’s bigger, use do_trace_printk() otherwise, optimize it to trace_puts(). Then just let gcc optimize the rest.

trace_puts(str)

write a string into the ftrace buffer

Parameters

str

the string to record

Note

__trace_bputs is an internal function for trace_puts and

the ip is passed in via the trace_puts macro.

This is similar to trace_printk() but is made for those really fast paths that a developer wants the least amount of “Heisenbug” effects, where the processing of the print format is still too much.

This function allows a kernel developer to debug fast path sections that printk is not appropriate for. By scattering in various printk like tracing in the code, a developer can quickly see where problems are occurring.

This is intended as a debugging tool for the developer only. Please refrain from leaving trace_puts scattered around in your code. (Extra memory is used for special buffers that are allocated when trace_puts() is used.)

Return

0 if nothing was written, positive # if string was.

(1 when __trace_bputs is used, strlen(str) when __trace_puts is used)

min(x, y)

return minimum of two values of the same or compatible types

Parameters

x

first value

y

second value

max(x, y)

return maximum of two values of the same or compatible types

Parameters

x

first value

y

second value

min3(x, y, z)

return minimum of three values

Parameters

x

first value

y

second value

z

third value

max3(x, y, z)

return maximum of three values

Parameters

x

first value

y

second value

z

third value

min_not_zero(x, y)

return the minimum that is _not_ zero, unless both are zero

Parameters

x

value1

y

value2

clamp(val, lo, hi)

return a value clamped to a given range with strict typechecking

Parameters

val

current value

lo

lowest allowable value

hi

highest allowable value

Description

This macro does strict typechecking of lo/hi to make sure they are of the same type as val. See the unnecessary pointer comparisons.

min_t(type, x, y)

return minimum of two values, using the specified type

Parameters

type

data type to use

x

first value

y

second value

max_t(type, x, y)

return maximum of two values, using the specified type

Parameters

type

data type to use

x

first value

y

second value

clamp_t(type, val, lo, hi)

return a value clamped to a given range using a given type

Parameters

type

the type of variable to use

val

current value

lo

minimum allowable value

hi

maximum allowable value

Description

This macro does no typechecking and uses temporary variables of type type to make all the comparisons.

clamp_val(val, lo, hi)

return a value clamped to a given range using val’s type

Parameters

val

current value

lo

minimum allowable value

hi

maximum allowable value

Description

This macro does no typechecking and uses temporary variables of whatever type the input argument val is. This is useful when val is an unsigned type and lo and hi are literals that will otherwise be assigned a signed integer type.

swap(a, b)

swap values of a and b

Parameters

a

first value

b

second value

container_of(ptr, type, member)

cast a member of a structure out to the containing structure

Parameters

ptr

the pointer to the member.

type

the type of the container struct this is embedded in.

member

the name of the member within the struct.

container_of_safe(ptr, type, member)

cast a member of a structure out to the containing structure

Parameters

ptr

the pointer to the member.

type

the type of the container struct this is embedded in.

member

the name of the member within the struct.

Description

If IS_ERR_OR_NULL(ptr), ptr is returned unchanged.

__visible int printk(const char * fmt, ...)

print a kernel message

Parameters

const char * fmt

format string

...

variable arguments

Description

This is printk(). It can be called from any context. We want it to work.

We try to grab the console_lock. If we succeed, it’s easy - we log the output and call the console drivers. If we fail to get the semaphore, we place the output into the log buffer and return. The current holder of the console_sem will notice the new output in console_unlock(); and will send it to the consoles before releasing the lock.

One effect of this deferred printing is that code which calls printk() and then changes console_loglevel may break. This is because console_loglevel is inspected when the actual printing occurs.

See also: printf(3)

See the vsnprintf() documentation for format string extensions over C99.

void console_lock(void)

lock the console system for exclusive use.

Parameters

void

no arguments

Description

Acquires a lock which guarantees that the caller has exclusive access to the console system and the console_drivers list.

Can sleep, returns nothing.

int console_trylock(void)

try to lock the console system for exclusive use.

Parameters

void

no arguments

Description

Try to acquire a lock which guarantees that the caller has exclusive access to the console system and the console_drivers list.

returns 1 on success, and 0 on failure to acquire the lock.

void console_unlock(void)

unlock the console system

Parameters

void

no arguments

Description

Releases the console_lock which the caller holds on the console system and the console driver list.

While the console_lock was held, console output may have been buffered by printk(). If this is the case, console_unlock(); emits the output prior to releasing the lock.

If there is output waiting, we wake /dev/kmsg and syslog() users.

console_unlock(); may be called from any context.

void console_conditional_schedule(void)

yield the CPU if required

Parameters

void

no arguments

Description

If the console code is currently allowed to sleep, and if this CPU should yield the CPU to another task, do so here.

Must be called within console_lock();.

bool printk_timed_ratelimit(unsigned long * caller_jiffies, unsigned int interval_msecs)

caller-controlled printk ratelimiting

Parameters

unsigned long * caller_jiffies

pointer to caller’s state

unsigned int interval_msecs

minimum interval between prints

Description

printk_timed_ratelimit() returns true if more than interval_msecs milliseconds have elapsed since the last time printk_timed_ratelimit() returned true.

int kmsg_dump_register(struct kmsg_dumper * dumper)

register a kernel log dumper.

Parameters

struct kmsg_dumper * dumper

pointer to the kmsg_dumper structure

Description

Adds a kernel log dumper to the system. The dump callback in the structure will be called when the kernel oopses or panics and must be set. Returns zero on success and -EINVAL or -EBUSY otherwise.

int kmsg_dump_unregister(struct kmsg_dumper * dumper)

unregister a kmsg dumper.

Parameters

struct kmsg_dumper * dumper

pointer to the kmsg_dumper structure

Description

Removes a dump device from the system. Returns zero on success and -EINVAL otherwise.

bool kmsg_dump_get_line(struct kmsg_dumper * dumper, bool syslog, char * line, size_t size, size_t * len)

retrieve one kmsg log line

Parameters

struct kmsg_dumper * dumper

registered kmsg dumper

bool syslog

include the “<4>” prefixes

char * line

buffer to copy the line to

size_t size

maximum size of the buffer

size_t * len

length of line placed into buffer

Description

Start at the beginning of the kmsg buffer, with the oldest kmsg record, and copy one record into the provided buffer.

Consecutive calls will return the next available record moving towards the end of the buffer with the youngest messages.

A return value of FALSE indicates that there are no more records to read.

bool kmsg_dump_get_buffer(struct kmsg_dumper * dumper, bool syslog, char * buf, size_t size, size_t * len)

copy kmsg log lines

Parameters

struct kmsg_dumper * dumper

registered kmsg dumper

bool syslog

include the “<4>” prefixes

char * buf

buffer to copy the line to

size_t size

maximum size of the buffer

size_t * len

length of line placed into buffer

Description

Start at the end of the kmsg buffer and fill the provided buffer with as many of the the youngest kmsg records that fit into it. If the buffer is large enough, all available kmsg records will be copied with a single call.

Consecutive calls will fill the buffer with the next block of available older records, not including the earlier retrieved ones.

A return value of FALSE indicates that there are no more records to read.

void kmsg_dump_rewind(struct kmsg_dumper * dumper)

reset the interator

Parameters

struct kmsg_dumper * dumper

registered kmsg dumper

Description

Reset the dumper’s iterator so that kmsg_dump_get_line() and kmsg_dump_get_buffer() can be called again and used multiple times within the same dumper.dump() callback.

void panic(const char * fmt, ...)

halt the system

Parameters

const char * fmt

The text string to print

...

variable arguments

Description

Display a message, then perform cleanups.

This function never returns.

void add_taint(unsigned flag, enum lockdep_ok lockdep_ok)

Parameters

unsigned flag

one of the TAINT_* constants.

enum lockdep_ok lockdep_ok

whether lock debugging is still OK.

Description

If something bad has gone wrong, you’ll want lockdebug_ok = false, but for some notewortht-but-not-corrupting cases, it can be set to true.

bool notrace rcu_is_watching(void)

see if RCU thinks that the current CPU is not idle

Parameters

void

no arguments

Description

Return true if RCU is watching the running CPU, which means that this CPU can safely enter RCU read-side critical sections. In other words, if the current CPU is not in its idle loop or is in an interrupt or NMI handler, return true.

void call_rcu(struct rcu_head * head, rcu_callback_t func)

Queue an RCU callback for invocation after a grace period.

Parameters

struct rcu_head * head

structure to be used for queueing the RCU updates.

rcu_callback_t func

actual callback function to be invoked after the grace period

Description

The callback function will be invoked some time after a full grace period elapses, in other words after all pre-existing RCU read-side critical sections have completed. However, the callback function might well execute concurrently with RCU read-side critical sections that started after call_rcu() was invoked. RCU read-side critical sections are delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested. In addition, regions of code across which interrupts, preemption, or softirqs have been disabled also serve as RCU read-side critical sections. This includes hardware interrupt handlers, softirq handlers, and NMI handlers.

Note that all CPUs must agree that the grace period extended beyond all pre-existing RCU read-side critical section. On systems with more than one CPU, this means that when “func()” is invoked, each CPU is guaranteed to have executed a full memory barrier since the end of its last RCU read-side critical section whose beginning preceded the call to call_rcu(). It also means that each CPU executing an RCU read-side critical section that continues beyond the start of “func()” must have executed a memory barrier after the call_rcu() but before the beginning of that RCU read-side critical section. Note that these guarantees include CPUs that are offline, idle, or executing in user mode, as well as CPUs that are executing in the kernel.

Furthermore, if CPU A invoked call_rcu() and CPU B invoked the resulting RCU callback function “func()”, then both CPU A and CPU B are guaranteed to execute a full memory barrier during the time interval between the call to call_rcu() and the invocation of “func()” – even if CPU A and CPU B are the same CPU (but again only if the system has more than one CPU).

void synchronize_rcu(void)

wait until a grace period has elapsed.

Parameters

void

no arguments

Description

Control will return to the caller some time after a full grace period has elapsed, in other words after all currently executing RCU read-side critical sections have completed. Note, however, that upon return from synchronize_rcu(), the caller might well be executing concurrently with new RCU read-side critical sections that began while synchronize_rcu() was waiting. RCU read-side critical sections are delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested. In addition, regions of code across which interrupts, preemption, or softirqs have been disabled also serve as RCU read-side critical sections. This includes hardware interrupt handlers, softirq handlers, and NMI handlers.

Note that this guarantee implies further memory-ordering guarantees. On systems with more than one CPU, when synchronize_rcu() returns, each CPU is guaranteed to have executed a full memory barrier since the end of its last RCU read-side critical section whose beginning preceded the call to synchronize_rcu(). In addition, each CPU having an RCU read-side critical section that extends beyond the return from synchronize_rcu() is guaranteed to have executed a full memory barrier after the beginning of synchronize_rcu() and before the beginning of that RCU read-side critical section. Note that these guarantees include CPUs that are offline, idle, or executing in user mode, as well as CPUs that are executing in the kernel.

Furthermore, if CPU A invoked synchronize_rcu(), which returned to its caller on CPU B, then both CPU A and CPU B are guaranteed to have executed a full memory barrier during the execution of synchronize_rcu() – even if CPU A and CPU B are the same CPU (but again only if the system has more than one CPU).

unsigned long get_state_synchronize_rcu(void)

Snapshot current RCU state

Parameters

void

no arguments

Description

Returns a cookie that is used by a later call to cond_synchronize_rcu() to determine whether or not a full grace period has elapsed in the meantime.

void cond_synchronize_rcu(unsigned long oldstate)

Conditionally wait for an RCU grace period

Parameters

unsigned long oldstate

return value from earlier call to get_state_synchronize_rcu()

Description

If a full RCU grace period has elapsed since the earlier call to get_state_synchronize_rcu(), just return. Otherwise, invoke synchronize_rcu() to wait for a full grace period.

Yes, this function does not take counter wrap into account. But counter wrap is harmless. If the counter wraps, we have waited for more than 2 billion grace periods (and way more on a 64-bit system!), so waiting for one additional grace period should be just fine.

void rcu_barrier(void)

Wait until all in-flight call_rcu() callbacks complete.

Parameters

void

no arguments

Description

Note that this primitive does not necessarily wait for an RCU grace period to complete. For example, if there are no RCU callbacks queued anywhere in the system, then rcu_barrier() is within its rights to return immediately, without waiting for anything, much less an RCU grace period.

int rcu_read_lock_sched_held(void)

might we be in RCU-sched read-side critical section?

Parameters

void

no arguments

Description

If CONFIG_DEBUG_LOCK_ALLOC is selected, returns nonzero iff in an RCU-sched read-side critical section. In absence of CONFIG_DEBUG_LOCK_ALLOC, this assumes we are in an RCU-sched read-side critical section unless it can prove otherwise. Note that disabling of preemption (including disabling irqs) counts as an RCU-sched read-side critical section. This is useful for debug checks in functions that required that they be called within an RCU-sched read-side critical section.

Check debug_lockdep_rcu_enabled() to prevent false positives during boot and while lockdep is disabled.

Note that if the CPU is in the idle loop from an RCU point of view (ie: that we are in the section between rcu_idle_enter() and rcu_idle_exit()) then rcu_read_lock_held() returns false even if the CPU did an rcu_read_lock(). The reason for this is that RCU ignores CPUs that are in such a section, considering these as in extended quiescent state, so such a CPU is effectively never in an RCU read-side critical section regardless of what RCU primitives it invokes. This state of affairs is required — we need to keep an RCU-free window in idle where the CPU may possibly enter into low power mode. This way we can notice an extended quiescent state to other CPUs that started a grace period. Otherwise we would delay any grace period as long as we run in the idle task.

Similarly, we avoid claiming an SRCU read lock held if the current CPU is offline.

void rcu_expedite_gp(void)

Expedite future RCU grace periods

Parameters

void

no arguments

Description

After a call to this function, future calls to synchronize_rcu() and friends act as the corresponding synchronize_rcu_expedited() function had instead been called.

void rcu_unexpedite_gp(void)

Cancel prior rcu_expedite_gp() invocation

Parameters

void

no arguments

Description

Undo a prior call to rcu_expedite_gp(). If all prior calls to rcu_expedite_gp() are undone by a subsequent call to rcu_unexpedite_gp(), and if the rcu_expedited sysfs/boot parameter is not set, then all subsequent calls to synchronize_rcu() and friends will return to their normal non-expedited behavior.

int rcu_read_lock_held(void)

might we be in RCU read-side critical section?

Parameters

void

no arguments

Description

If CONFIG_DEBUG_LOCK_ALLOC is selected, returns nonzero iff in an RCU read-side critical section. In absence of CONFIG_DEBUG_LOCK_ALLOC, this assumes we are in an RCU read-side critical section unless it can prove otherwise. This is useful for debug checks in functions that require that they be called within an RCU read-side critical section.

Checks debug_lockdep_rcu_enabled() to prevent false positives during boot and while lockdep is disabled.

Note that rcu_read_lock() and the matching rcu_read_unlock() must occur in the same context, for example, it is illegal to invoke rcu_read_unlock() in process context if the matching rcu_read_lock() was invoked from within an irq handler.

Note that rcu_read_lock() is disallowed if the CPU is either idle or offline from an RCU perspective, so check for those as well.

int rcu_read_lock_bh_held(void)

might we be in RCU-bh read-side critical section?

Parameters

void

no arguments

Description

Check for bottom half being disabled, which covers both the CONFIG_PROVE_RCU and not cases. Note that if someone uses rcu_read_lock_bh(), but then later enables BH, lockdep (if enabled) will show the situation. This is useful for debug checks in functions that require that they be called within an RCU read-side critical section.

Check debug_lockdep_rcu_enabled() to prevent false positives during boot.

Note that rcu_read_lock_bh() is disallowed if the CPU is either idle or offline from an RCU perspective, so check for those as well.

void wakeme_after_rcu(struct rcu_head * head)

Callback function to awaken a task after grace period

Parameters

struct rcu_head * head

Pointer to rcu_head member within rcu_synchronize structure

Description

Awaken the corresponding task now that a grace period has elapsed.

void init_rcu_head_on_stack(struct rcu_head * head)

initialize on-stack rcu_head for debugobjects

Parameters

struct rcu_head * head

pointer to rcu_head structure to be initialized

Description

This function informs debugobjects of a new rcu_head structure that has been allocated as an auto variable on the stack. This function is not required for rcu_head structures that are statically defined or that are dynamically allocated on the heap. This function has no effect for !CONFIG_DEBUG_OBJECTS_RCU_HEAD kernel builds.

void destroy_rcu_head_on_stack(struct rcu_head * head)

destroy on-stack rcu_head for debugobjects

Parameters

struct rcu_head * head

pointer to rcu_head structure to be initialized

Description

This function informs debugobjects that an on-stack rcu_head structure is about to go out of scope. As with init_rcu_head_on_stack(), this function is not required for rcu_head structures that are statically defined or that are dynamically allocated on the heap. Also as with init_rcu_head_on_stack(), this function has no effect for !CONFIG_DEBUG_OBJECTS_RCU_HEAD kernel builds.

void call_rcu_tasks(struct rcu_head * rhp, rcu_callback_t func)

Queue an RCU for invocation task-based grace period

Parameters

struct rcu_head * rhp

structure to be used for queueing the RCU updates.

rcu_callback_t func

actual callback function to be invoked after the grace period

Description

The callback function will be invoked some time after a full grace period elapses, in other words after all currently executing RCU read-side critical sections have completed. call_rcu_tasks() assumes that the read-side critical sections end at a voluntary context switch (not a preemption!), cond_resched_rcu_qs(), entry into idle, or transition to usermode execution. As such, there are no read-side primitives analogous to rcu_read_lock() and rcu_read_unlock() because this primitive is intended to determine that all tasks have passed through a safe state, not so much for data-strcuture synchronization.

See the description of call_rcu() for more detailed information on memory ordering guarantees.

void synchronize_rcu_tasks(void)

wait until an rcu-tasks grace period has elapsed.

Parameters

void

no arguments

Description

Control will return to the caller some time after a full rcu-tasks grace period has elapsed, in other words after all currently executing rcu-tasks read-side critical sections have elapsed. These read-side critical sections are delimited by calls to schedule(), cond_resched_tasks_rcu_qs(), idle execution, userspace execution, calls to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().

This is a very specialized primitive, intended only for a few uses in tracing and other situations requiring manipulation of function preambles and profiling hooks. The synchronize_rcu_tasks() function is not (yet) intended for heavy use from multiple CPUs.

Note that this guarantee implies further memory-ordering guarantees. On systems with more than one CPU, when synchronize_rcu_tasks() returns, each CPU is guaranteed to have executed a full memory barrier since the end of its last RCU-tasks read-side critical section whose beginning preceded the call to synchronize_rcu_tasks(). In addition, each CPU having an RCU-tasks read-side critical section that extends beyond the return from synchronize_rcu_tasks() is guaranteed to have executed a full memory barrier after the beginning of synchronize_rcu_tasks() and before the beginning of that RCU-tasks read-side critical section. Note that these guarantees include CPUs that are offline, idle, or executing in user mode, as well as CPUs that are executing in the kernel.

Furthermore, if CPU A invoked synchronize_rcu_tasks(), which returned to its caller on CPU B, then both CPU A and CPU B are guaranteed to have executed a full memory barrier during the execution of synchronize_rcu_tasks() – even if CPU A and CPU B are the same CPU (but again only if the system has more than one CPU).

void rcu_barrier_tasks(void)

Wait for in-flight call_rcu_tasks() callbacks.

Parameters

void

no arguments

Description

Although the current implementation is guaranteed to wait, it is not obligated to, for example, if there are no pending callbacks.

size_t array_size(size_t a, size_t b)

Calculate size of 2-dimensional array.

Parameters

size_t a

dimension one

size_t b

dimension two

Description

Calculates size of 2-dimensional array: a * b.

Return

number of bytes needed to represent the array or SIZE_MAX on overflow.

size_t array3_size(size_t a, size_t b, size_t c)

Calculate size of 3-dimensional array.

Parameters

size_t a

dimension one

size_t b

dimension two

size_t c

dimension three

Description

Calculates size of 3-dimensional array: a * b * c.

Return

number of bytes needed to represent the array or SIZE_MAX on overflow.

struct_size(p, member, n)

Calculate size of structure with trailing array.

Parameters

p

Pointer to the structure.

member

Name of the array member.

n

Number of elements in the array.

Description

Calculates size of memory needed for structure p followed by an array of n member elements.

Return

number of bytes needed or SIZE_MAX on overflow.

Device Resource Management

void * devres_alloc_node(dr_release_t release, size_t size, gfp_t gfp, int nid)

Allocate device resource data

Parameters

dr_release_t release

Release function devres will be associated with

size_t size

Allocation size

gfp_t gfp

Allocation flags

int nid

NUMA node

Description

Allocate devres of size bytes. The allocated area is zeroed, then associated with release. The returned pointer can be passed to other devres_*() functions.

Return

Pointer to allocated devres on success, NULL on failure.

void devres_for_each_res(struct device * dev, dr_release_t release, dr_match_t match, void * match_data, void (*fn) (struct device *, void *, void *, void * data)

Resource iterator

Parameters

struct device * dev

Device to iterate resource from

dr_release_t release

Look for resources associated with this release function

dr_match_t match

Match function (optional)

void * match_data

Data for the match function

void (*)(struct device *, void *, void *) fn

Function to be called for each matched resource.

void * data

Data for fn, the 3rd parameter of fn

Description

Call fn for each devres of dev which is associated with release and for which match returns 1.

Return

void

void devres_free(void * res)

Free device resource data

Parameters

void * res

Pointer to devres data to free

Description

Free devres created with devres_alloc().

void devres_add(struct device * dev, void * res)

Register device resource

Parameters

struct device * dev

Device to add resource to

void * res

Resource to register

Description

Register devres res to dev. res should have been allocated using devres_alloc(). On driver detach, the associated release function will be invoked and devres will be freed automatically.

void * devres_find(struct device * dev, dr_release_t release, dr_match_t match, void * match_data)

Find device resource

Parameters

struct device * dev

Device to lookup resource from

dr_release_t release

Look for resources associated with this release function

dr_match_t match

Match function (optional)

void * match_data

Data for the match function

Description

Find the latest devres of dev which is associated with release and for which match returns 1. If match is NULL, it’s considered to match all.

Return

Pointer to found devres, NULL if not found.

void * devres_get(struct device * dev, void * new_res, dr_match_t match, void * match_data)

Find devres, if non-existent, add one atomically

Parameters

struct device * dev

Device to lookup or add devres for

void * new_res

Pointer to new initialized devres to add if not found

dr_match_t match

Match function (optional)

void * match_data

Data for the match function

Description

Find the latest devres of dev which has the same release function as new_res and for which match return 1. If found, new_res is freed; otherwise, new_res is added atomically.

Return

Pointer to found or added devres.

void * devres_remove(struct device * dev, dr_release_t release, dr_match_t match, void * match_data)

Find a device resource and remove it

Parameters

struct device * dev

Device to find resource from

dr_release_t release

Look for resources associated with this release function

dr_match_t match

Match function (optional)

void * match_data

Data for the match function

Description

Find the latest devres of dev associated with release and for which match returns 1. If match is NULL, it’s considered to match all. If found, the resource is removed atomically and returned.

Return

Pointer to removed devres on success, NULL if not found.

int devres_destroy(struct device * dev, dr_release_t release, dr_match_t match, void * match_data)

Find a device resource and destroy it

Parameters

struct device * dev

Device to find resource from

dr_release_t release

Look for resources associated with this release function

dr_match_t match

Match function (optional)

void * match_data

Data for the match function

Description

Find the latest devres of dev associated with release and for which match returns 1. If match is NULL, it’s considered to match all. If found, the resource is removed atomically and freed.

Note that the release function for the resource will not be called, only the devres-allocated data will be freed. The caller becomes responsible for freeing any other data.

Return

0 if devres is found and freed, -ENOENT if not found.

int devres_release(struct device * dev, dr_release_t release, dr_match_t match, void * match_data)

Find a device resource and destroy it, calling release

Parameters

struct device * dev

Device to find resource from

dr_release_t release

Look for resources associated with this release function

dr_match_t match

Match function (optional)

void * match_data

Data for the match function

Description

Find the latest devres of dev associated with release and for which match returns 1. If match is NULL, it’s considered to match all. If found, the resource is removed atomically, the release function called and the resource freed.

Return

0 if devres is found and freed, -ENOENT if not found.

void * devres_open_group(struct device * dev, void * id, gfp_t gfp)

Open a new devres group

Parameters

struct device * dev

Device to open devres group for

void * id

Separator ID

gfp_t gfp

Allocation flags

Description

Open a new devres group for dev with id. For id, using a pointer to an object which won’t be used for another group is recommended. If id is NULL, address-wise unique ID is created.

Return

ID of the new group, NULL on failure.

void devres_close_group(struct device * dev, void * id)

Close a devres group

Parameters

struct device * dev

Device to close devres group for

void * id

ID of target group, can be NULL

Description

Close the group identified by id. If id is NULL, the latest open group is selected.

void devres_remove_group(struct device * dev, void * id)

Remove a devres group

Parameters

struct device * dev

Device to remove group for

void * id

ID of target group, can be NULL

Description

Remove the group identified by id. If id is NULL, the latest open group is selected. Note that removing a group doesn’t affect any other resources.

int devres_release_group(struct device * dev, void * id)

Release resources in a devres group

Parameters

struct device * dev

Device to release group for

void * id

ID of target group, can be NULL

Description

Release all resources in the group identified by id. If id is NULL, the latest open group is selected. The selected group and groups properly nested inside the selected group are removed.

Return

The number of released non-group resources.

int devm_add_action(struct device * dev, void (*action) (void *, void * data)

add a custom action to list of managed resources

Parameters

struct device * dev

Device that owns the action

void (*)(void *) action

Function that should be called

void * data

Pointer to data passed to action implementation

Description

This adds a custom action to the list of managed resources so that it gets executed as part of standard resource unwinding.

void devm_remove_action(struct device * dev, void (*action) (void *, void * data)

removes previously added custom action

Parameters

struct device * dev

Device that owns the action

void (*)(void *) action

Function implementing the action

void * data

Pointer to data passed to action implementation

Description

Removes instance of action previously added by devm_add_action(). Both action and data should match one of the existing entries.

void devm_release_action(struct device * dev, void (*action) (void *, void * data)

release previously added custom action

Parameters

struct device * dev

Device that owns the action

void (*)(void *) action

Function implementing the action

void * data

Pointer to data passed to action implementation

Description

Releases and removes instance of action previously added by devm_add_action(). Both action and data should match one of the existing entries.

void * devm_kmalloc(struct device * dev, size_t size, gfp_t gfp)

Resource-managed kmalloc

Parameters

struct device * dev

Device to allocate memory for

size_t size

Allocation size

gfp_t gfp

Allocation gfp flags

Description

Managed kmalloc. Memory allocated with this function is automatically freed on driver detach. Like all other devres resources, guaranteed alignment is unsigned long long.

Return

Pointer to allocated memory on success, NULL on failure.

char * devm_kstrdup(struct device * dev, const char * s, gfp_t gfp)

Allocate resource managed space and copy an existing string into that.

Parameters

struct device * dev

Device to allocate memory for

const char * s

the string to duplicate

gfp_t gfp

the GFP mask used in the devm_kmalloc() call when allocating memory

Return

Pointer to allocated string on success, NULL on failure.

const char * devm_kstrdup_const(struct device * dev, const char * s, gfp_t gfp)

resource managed conditional string duplication

Parameters

struct device * dev

device for which to duplicate the string

const char * s

the string to duplicate

gfp_t gfp

the GFP mask used in the kmalloc() call when allocating memory

Description

Strings allocated by devm_kstrdup_const will be automatically freed when the associated device is detached.

Return

Source string if it is in .rodata section otherwise it falls back to devm_kstrdup.

char * devm_kvasprintf(struct device * dev, gfp_t gfp, const char * fmt, va_list ap)

Allocate resource managed space and format a string into that.

Parameters

struct device * dev

Device to allocate memory for

gfp_t gfp

the GFP mask used in the devm_kmalloc() call when allocating memory

const char * fmt

The printf()-style format string

va_list ap

Arguments for the format string

Return

Pointer to allocated string on success, NULL on failure.

char * devm_kasprintf(struct device * dev, gfp_t gfp, const char * fmt, ...)

Allocate resource managed space and format a string into that.

Parameters

struct device * dev

Device to allocate memory for

gfp_t gfp

the GFP mask used in the devm_kmalloc() call when allocating memory

const char * fmt

The printf()-style format string

...

Arguments for the format string

Return

Pointer to allocated string on success, NULL on failure.

void devm_kfree(struct device * dev, const void * p)

Resource-managed kfree

Parameters

struct device * dev

Device this memory belongs to

const void * p

Memory to free

Description

Free memory allocated with devm_kmalloc().

void * devm_kmemdup(struct device * dev, const void * src, size_t len, gfp_t gfp)

Resource-managed kmemdup

Parameters

struct device * dev

Device this memory belongs to

const void * src

Memory region to duplicate

size_t len

Memory region length

gfp_t gfp

GFP mask to use

Description

Duplicate region of a memory using resource managed kmalloc

unsigned long devm_get_free_pages(struct device * dev, gfp_t gfp_mask, unsigned int order)

Resource-managed __get_free_pages

Parameters

struct device * dev

Device to allocate memory for

gfp_t gfp_mask

Allocation gfp flags

unsigned int order

Allocation size is (1 << order) pages

Description

Managed get_free_pages. Memory allocated with this function is automatically freed on driver detach.

Return

Address of allocated memory on success, 0 on failure.

void devm_free_pages(struct device * dev, unsigned long addr)

Resource-managed free_pages

Parameters

struct device * dev

Device this memory belongs to

unsigned long addr

Memory to free

Description

Free memory allocated with devm_get_free_pages(). Unlike free_pages, there is no need to supply the order.

void __percpu * __devm_alloc_percpu(struct device * dev, size_t size, size_t align)

Resource-managed alloc_percpu

Parameters

struct device * dev

Device to allocate per-cpu memory for

size_t size

Size of per-cpu memory to allocate

size_t align

Alignment of per-cpu memory to allocate

Description

Managed alloc_percpu. Per-cpu memory allocated with this function is automatically freed on driver detach.

Return

Pointer to allocated memory on success, NULL on failure.

void devm_free_percpu(struct device * dev, void __percpu * pdata)

Resource-managed free_percpu

Parameters

struct device * dev

Device this memory belongs to

void __percpu * pdata

Per-cpu memory to free

Description

Free memory allocated with devm_alloc_percpu().