Linux kernel thread 内核线程

Linux中无论是用户态进程、线程，还是内核线程，在内核中都使用同样的结构task_struct，可以看出内核都是以任务这个概念对待这些不同名字的事物。至于为什么会有kernel thread或者说内核线程这个词，个人认为应该是因为所有内核线程共享内核空间资源，因此有线程之名。

本文内容参考内核版本 3.10.0-862.el7.x86_64

例子

用户态进程运行在用户空间，可以通过系统调用陷入内核调用内核资源。Linux中用户态线程与用户态进程基本无异，称之为线程是因为它们有共同的线程组id，并共享一部分资源。内核线程运行在内核空间，可以直接访问内核资源，创建内核线程需要调用内核api，因此我们创建一个内核模块来演示内核线程，模块加载时启动一个内核线程，这个内核线程每隔五秒打印一条消息，打印结束后主动让出cpu，模块卸载时停止该内核线程。

因为内核线程上下文不属于中断上下文，因此可以使用调度类睡眠操作主动让出cpu，这点在输出中可以看到在没有设定cpu亲和的情况下存在唤醒在不同cpu的情况。

kthread_test.c

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/smp.h>
#include <linux/sched.h>
#include <linux/kthread.h>
#include <linux/delay.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Hu Yu <hyuuhit@gmail.com>");
MODULE_DESCRIPTION("kthread test");
MODULE_VERSION("0.1");

static struct task_struct *my_task = NULL;

static int my_kthread(void *data) {
    char *str = (char *)data;

    pr_info("my kthread data: %s\n", str);
    pr_info("my kthread smp_processor_id %d\n", smp_processor_id());
    while(!kthread_should_stop()) {
        msleep(5000);
        pr_info("my kthread: living. smp_processor_id %d\n", smp_processor_id());
        pr_info("=========================================\n");
    }
    pr_info("my kthread: stop\n");
    return 0;
}

static int __init my_init(void)
{
    pr_info("my init.\n");
    pr_info("smp_processor_id %d\n", smp_processor_id());

    my_task = kthread_run(my_kthread, "hello my kthread", "mykthread-%s", "test");

    pr_info("my init finish.\n");
    pr_info("=========================================\n");
    return 0;
}

static void __exit my_exit(void)
{
    pr_info("my exit.\n");
    pr_info("smp_processor_id %d\n", smp_processor_id());

    if (my_task) {
        pr_info("stop kthread\n");
        kthread_stop(my_task);
    }

    pr_info("my exit finish.\n");
    pr_info("=========================================\n");
}

module_init(my_init);
module_exit(my_exit);

Makefile

obj-m := kthread_test.o

PWD:=$(shell pwd)
KVER:=$(shell uname -r)
KDIR:=/lib/modules/$(KVER)/build

EXTRA_CFLAGS += -Wall -g

all:
    $(MAKE) -C $(KDIR) M=$(PWD) modules

clean:
    $(MAKE) -C $(KDIR) M=$(PWD) clean

输出

[57272.252152] my init.
[57272.252156] smp_processor_id 1
[57272.252222] my init finish.
[57272.252224] =========================================
[57272.252229] my kthread data: hello my kthread
[57272.252232] my kthread smp_processor_id 7
[57277.252530] my kthread: living. smp_processor_id 7
[57277.252534] =========================================
[57282.253374] my kthread: living. smp_processor_id 7
[57282.253379] =========================================
[57287.254189] my kthread: living. smp_processor_id 7
[57287.254193] =========================================
[57292.255016] my kthread: living. smp_processor_id 7
[57292.255020] =========================================
[57292.410807] my exit.
[57292.410810] smp_processor_id 1
[57292.410811] stop kthread
[57297.256008] my kthread: living. smp_processor_id 4
[57297.256013] =========================================
[57297.256014] my kthread: stop
[57297.256050] my exit finish.
[57297.256053] =========================================

内核线程 api

上面例子代码中创建内核线程调用了kthread_run，停止内核线程调用了函数kthread_stop，内线线程自身调用kthread_should_stop判断是否应该退出，相关的还有其他api。这一组api在创建内核线程时依赖kthreadd内核线程，下面具体介绍这一组api。

以下api位于include/linux/kthread.h。

kthread_create_on_node
创建一个内核线程执行指定的函数，从指定numa node上分配内存，如果不指定则传入-1。内核线程创建完毕后进入TASK_UNINTERRUPTIBLE状态并让出cpu，等待人为唤醒。
kthread_create
kthread_create_on_node的一个宏包装，numa node指定为-1。
kthread_run
调用kthread_create创建内核线程后调用wake_up_process唤醒该线程。
kthread_stop
通知一个内核线程可以停止运行，并等待其停止。这个函数不是强制性的，需要线程内自身代码检查kthread_should_stop返回是否为真并主动返回或退出。如果线程函数自身调用了do_exit，那么需要kthread_stop的调用者确保线程的task_struct结构内存依然被持有，否则会访问无效内存。
kthread_should_stop
线程内自身调用该函数检查是否外部调用了kthread_stop。
kthread_freezable_should_stop
应用于可以冻结的线程，由内核线程自身调用，当系统处于挂起状态时，该函数可以冻结自身直到挂起状态解除。参数的引用用于表示是否从冻结状态返回，返回值与kthread_should_stop一致。
kthread_park
通知内核线程进入park（停靠，可以理解为暂定）状态，并等待其park完成。这个函数同样不是强制性的，需要线程自身代码检查kthread_should_park返回真后调用kthread_parkme将自身置为park状态并让出cpu，直到park状态解除。
kthread_should_park
线程自身调用该函数检查是否应当进入park状态。
kthread_parkme
线程自身调用该函数令自身进入park状态。
kthread_unpark
与kthread_park对应，解除线程park状态。
kthread_data
返回内核线程创建时设定的运行函数的参数。调用者需要确保传入的task_struct的确是一个kthread。
probe_kthread_data
返回内核线程创建时设定的运行函数的参数。如果传入的task_struct不是一个kthread或参数无法访问那么返回NULL。
kthread_bind
将一个内核线程绑定到指定的cpu上。该线程必须处于TASK_UNINTERRUPTIBLE状态，kthread_create刚刚创建的线程符合该要求。获取cpu相关的api在include/linux/cpumask.h文件中，numa node相关api在include/linux/nodemask.h中，用户态查看cpu和numa node相关信息可以使用命令lscpu和cat /proc/cpuinfo。

生命周期流程

在看具体代码前看一下内核线程生命周期中的大体流程。

内核线程生命周期

内核线程创建

内核线程的创建与中断处理的处理思路很相似，尽量将占用cpu时间长的处理逻辑推到下半部在另外的时间执行，尽快完成上半部操作，以让出cpu执行优先级更高的任务。

上半部

通过上文的api可以看到创建内核线程最后都会落到函数kthread_create_on_node，这个函数的逻辑很简单：

在栈上分配kthread_create_info类型结构体，填充必要成员，将其放入队列kthread_create_list中
唤醒kthreadd线程（kthreadd_task就是kthreadd守护线程的task_struct）
等待kthread_create_info中成员done标记被设置（这里将由kthreadd守护线程设置）
上面的标记设置完成，标志着result成员可以被访问了，这是新内核线程的task_struct结构。设置线程名、调度策略和运行cpu。
返回其task_struct结构体指针。此时新内核线程处于TASK_UNINTERRUPTIBLE状态等待被wake_up_process。

/**
 * kthread_create_on_node - create a kthread.
 * @threadfn: the function to run until signal_pending(current).
 * @data: data ptr for @threadfn.
 * @node: memory node number.
 * @namefmt: printf-style name for the thread.
 *
 * Description: This helper function creates and names a kernel
 * thread.  The thread will be stopped: use wake_up_process() to start
 * it.  See also kthread_run().
 *
 * If thread is going to be bound on a particular cpu, give its node
 * in @node, to get NUMA affinity for kthread stack, or else give -1.
 * When woken, the thread will run @threadfn() with @data as its
 * argument. @threadfn() can either call do_exit() directly if it is a
 * standalone thread for which no one will call kthread_stop(), or
 * return when 'kthread_should_stop()' is true (which means
 * kthread_stop() has been called).  The return value should be zero
 * or a negative error number; it will be passed to kthread_stop().
 *
 * Returns a task_struct or ERR_PTR(-ENOMEM).
 */
struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
        void *data, int node,
        const char namefmt[],
        ...)
{
    struct kthread_create_info create;

    create.threadfn = threadfn;
    create.data = data;
    create.node = node;
    init_completion(&create.done);

    spin_lock(&kthread_create_lock);
    list_add_tail(&create.list, &kthread_create_list);
    spin_unlock(&kthread_create_lock);

    wake_up_process(kthreadd_task);
    wait_for_completion(&create.done);

    if (!IS_ERR(create.result)) {
        static const struct sched_param param = { .sched_priority = 0 };
        va_list args;

        va_start(args, namefmt);
        vsnprintf(create.result->comm, sizeof(create.result->comm),
                namefmt, args);
        va_end(args);
        /*
         * root may have changed our (kthreadd's) priority or CPU mask.
         * The kernel thread should not inherit these properties.
         */
        sched_setscheduler_nocheck(create.result, SCHED_NORMAL, &param);
        set_cpus_allowed_ptr(create.result, cpu_all_mask);
    }
    return create.result;
}

这里有一个结构体kthread_create_info，看一下。

struct kthread_create_info
{
    /* Information passed to kthread() from kthreadd. */
    /* 内核线程需要执行的函数 */
    int (*threadfn)(void *data);
    /* 执行函数的唯一一个参数 */
    void *data;
    /* numa node */
    int node;

    /* Result passed back to kthread_create() from kthreadd. */
    /* kthread_create_on_node函数的返回值 */
    struct task_struct *result;
    /* kthreadd创建内核线程完成的标识 */
    struct completion done;

    /* 用于链接到kthread_create_list队列 */
    struct list_head list;
};

下半部

内核线程创建的下半部分工作在kthreadd守护线程中完成，这里介绍其创建位置及工作逻辑。

kthreadd是一个内核守护线程，pid为2，用于处理创建内核线程的请求，是其他内核线程的父线程。一个例外是1号线程，后面可以看到为什么。（也可以不通过kthreadd创建内核线程，kernel_thread函数就用于创建内核线程，kthreadd也是调用该函数，但是并不推荐直接使用，而且此文参考的内核版本没有导出该符号，内核模块无法直接调用）

kthreadd的启动代码调用路径是 start_kernel -> rest_init。再之前的部分涉及系统启动，这里不关注。

static noinline void __init_refok rest_init(void)
{
    int pid;

    rcu_scheduler_starting();
    /*
     * We need to spawn init first so that it obtains pid 1, however
     * the init task will end up wanting to create kthreads, which, if
     * we schedule it before we create kthreadd, will OOPS.
     */
    // 这里创建出了1号线程，优先于kthreadd创建，而且其父线程为0号线程
    kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);
    numa_default_policy();
    // 这里创建除了2号线程，kthreadd
    pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
    rcu_read_lock();
    kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
    rcu_read_unlock();
    // 全局标记kthreadd是否启动完成，1号线程需要等待kthreadd启动完成才能进行后续工作。
    complete(&kthreadd_done);

    /*
     * The boot idle thread must execute schedule()
     * at least once to get things moving:
     */
    init_idle_bootup_task(current);
    schedule_preempt_disabled();
    /* Call into cpu_idle with preempt disabled */
    cpu_startup_entry(CPUHP_ONLINE);
}

先不关注kernel_thread的具体实现，只需要知道创建了新的线程并执行传入的函数指针就可以了。2号线程执行的函数为kthreadd。

int kthreadd(void *unused)
{
    struct task_struct *tsk = current;

    /* Setup a clean context for our children to inherit. */
    /* 设置进程名 */
    set_task_comm(tsk, "kthreadd");
    /* 忽略所有信号 */
    ignore_signals(tsk);
    /* 设置该进程cpu亲和度 */
    set_cpus_allowed_ptr(tsk, cpu_all_mask);
    /* 设置该进程内存分配 */
    set_mems_allowed(node_states[N_MEMORY]);

    /* 设置该进程不允许冻结 */
    current->flags |= PF_NOFREEZE;

    for (;;) {
        set_current_state(TASK_INTERRUPTIBLE);
        /* 如果链表kthread_create_list为空则主动让出cpu */
        if (list_empty(&kthread_create_list))
            schedule();
        /* 重新唤醒运行了 */
        __set_current_state(TASK_RUNNING);

        /* 为了防止冲突，链表操作需要上锁 */
        spin_lock(&kthread_create_lock);
        while (!list_empty(&kthread_create_list)) {
            struct kthread_create_info *create;

            /* 如果链表不空，每次取出一个kthread_create_info结构的实例 */
            create = list_entry(kthread_create_list.next,
                    struct kthread_create_info, list);
            /* 从链表摘除 */
            list_del_init(&create->list);
            spin_unlock(&kthread_create_lock);

            /* 创建内核线程 */
            create_kthread(create);

            spin_lock(&kthread_create_lock);
        }
        spin_unlock(&kthread_create_lock);
    }

    return 0;
}

可以看到kthreadd中调用create_kthread创建内核线程，参数为kthread_create_info结构体（位于kthread_create_on_node栈上内存）。继续看create_kthread。

static void create_kthread(struct kthread_create_info *create)
{
    int pid;

#ifdef CONFIG_NUMA
    current->pref_node_fork = create->node;
#endif
    /* We want our own signal handler (we take no signals by default). */
    pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
    if (pid < 0) {
        create->result = ERR_PTR(pid);
        complete(&create->done);
    }
}

可以看到create_kthread的代码很简单，调用kernel_thread创建一个内核线程并执行函数kthread。额外说一句kthread_create_info中类型为task_struct指针的成员result是在kernel_thread中创建并设置的。继续看kthread。

static int kthread(void *_create)
{
    /* Copy data: it's on kthread's stack */
    /*
     * 将参数kthread_create_info中的指针保存到当前函数栈上，
     * 因为参数内容所在的内存位于kthread_create_on_node函数的栈上，
     * 因为kthread_create_on_node还在阻塞等待内核线程创建完成，因此函数栈内存还是有效的，
     * 一旦complete(&create->done);这句运行完成。
     * kthread_create_on_node随时可能运行完成返回，这里的参数内容所在的栈内存将无效，
     * 将无法找到其中的指针数据。
     * result指针不需要保存是因为其作为返回值由kthread_create_on_node返回了。
     */
    struct kthread_create_info *create = _create;
    int (*threadfn)(void *data) = create->threadfn;
    void *data = create->data;
    /* 这里kthread类型的结构体，对内核线程的很多操作会用到 */
    struct kthread self;
    int ret;

    self.flags = 0;
    self.data = data;
    init_completion(&self.exited);
    init_completion(&self.parked);
    /* 这里的赋值使通过task_struct定位上面的kthread结构体地址成为可能 */
    current->vfork_done = &self.exited;

    /* OK, tell user we're spawned, wait for stop or wakeup */
    /* 设置当前线程状态 */
    __set_current_state(TASK_UNINTERRUPTIBLE);
    /* 用于提供给kthread_create_on_node的返回值 */
    create->result = current;
    /* 设置标识，通知kthread_create_on_node内核创建完成。个人理解这个操作应该包含了内存屏障，没有进一步验证*/
    complete(&create->done);
    /* 让出cpu，等待wake_up_process唤醒当前内核线程 */
    schedule();

    ret = -EINTR;

    /* 唤醒后首先检查是否应该停止 */
    if (!test_bit(KTHREAD_SHOULD_STOP, &self.flags)) {
        /* 其次检查是否应该进入park状态 */
        __kthread_parkme(&self);
        /* 运行创建内核线程时传入的指定函数，并传入指定的参数 */
        ret = threadfn(data);
    }
    /* we can't just return, we must preserve "self" on stack */
    /* 
     * 创建内核线程时传入的指定函数可以直接return，
     * 但kthread函数不能直接return，还需要负责清理线程相关的所有资源
     * 最后通过调度器让出cpu。
     */
    do_exit(ret);
}

struct kthread {
    unsigned long flags;
    unsigned int cpu;
    void *data;
    struct completion parked;
    struct completion exited;
};

到这里kthreadd的实现逻辑已经全部完成了。

内核线程停止

这里我们看kthread_stop函数逻辑：

首先增加task_struct结构体的引用计数，避免在内核线程停止后直接释放其内存，因为我们还需要其返回值。
取到kthread函数上栈上的kthread结构体。
设置标记KTHREAD_SHOULD_STOP。kthread_should_stop函数检查的就是该标记。
取消标记KTHREAD_SHOULD_PARK，因为线程之前可能处于park状态。
唤醒内核线程。
等待内核线程退出标识。这个标识的设置位置实际上是kthread -> do_exit中对task_struct成员vfork_done的设置，该指针指向了这里的exited标识变量。
取得内核线程的退出值。
最后释放task_struct结构体引用计数，根据情况释放其占用资源。
返回内核线程的退出值。

/**
 * kthread_stop - stop a thread created by kthread_create().
 * @k: thread created by kthread_create().
 *
 * Sets kthread_should_stop() for @k to return true, wakes it, and
 * waits for it to exit. This can also be called after kthread_create()
 * instead of calling wake_up_process(): the thread will exit without
 * calling threadfn().
 *
 * If threadfn() may call do_exit() itself, the caller must ensure
 * task_struct can't go away.
 *
 * Returns the result of threadfn(), or %-EINTR if wake_up_process()
 * was never called.
 */
int kthread_stop(struct task_struct *k)
{
    struct kthread *kthread;
    int ret;

    trace_sched_kthread_stop(k);

    get_task_struct(k);
    kthread = to_live_kthread(k);
    if (kthread) {
        set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
        __kthread_unpark(k, kthread);
        wake_up_process(k);
        wait_for_completion(&kthread->exited);
    }
    ret = k->exit_code;
    put_task_struct(k);

    trace_sched_kthread_stop_ret(ret);
    return ret;
}

在介绍内核线程创建的下半部kthreadd时说过kthread函数栈上有一个kthread结构体可以通过task_struct结构体获取到其地址（同名的函数和结构体有种绕口令的感觉）。实际就是通过内部成员地址以及其在结构体中偏移量的方式反向获取到结构体地址。对内核线程的操作都与该结构体有关。

#define container_of(ptr, type, member) ({                      \
    const typeof( ((type *)0)->member ) *__mptr = (ptr);    \
    (type *)( (char *)__mptr - offsetof(type,member) );})

#define __to_kthread(vfork) \
    container_of(vfork, struct kthread, exited)

static struct kthread *to_live_kthread(struct task_struct *k)
{
    struct completion *vfork = ACCESS_ONCE(k->vfork_done);
    if (likely(vfork))
        return __to_kthread(vfork);
    return NULL;
}