1200字范文 > Linux开机启动过程（16）：start_kernel()-＞rest_init()启动成功

Linux开机启动过程（16）：start_kernel()-＞rest_init()启动成功

时间：2021-09-09 01:27:51

Kernel initialization. Part 10.

在原文的基础上添加了5.10.13部分的源码解读。

End of the linux kernel initialization process

This is tenth part of the chapter about linux kernel initialization process and in the previous part we saw the initialization of the RCU and stopped on the call of theacpi_early_initfunction. This part will be the last part of the Kernel initialization process chapter, so let’s finish it.

After the call of theacpi_early_initfunction from the init/main.c, we can see the following code:

#ifdef CONFIG_X86_ESPFIX64init_espfix_bsp();#endif

Here we can see the call of theinit_espfix_bspfunction which depends on theCONFIG_X86_ESPFIX64kernel configuration option.

void __init init_espfix_bsp(void) /* */{pgd_t *pgd;p4d_t *p4d;/* Install the espfix pud into the kernel page directory */pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);p4d_populate(&init_mm, p4d, espfix_pud_page);/* Randomize the locations */init_espfix_random();/* The rest is the same as for any other processor */init_espfix_ap(0);}

As we can understand from the function name, it does something with the stack. This function is defined in the arch/x86/kernel/espfix_64.c and prevents leaking of31:16bits of theespregister during returning to 16-bit stack. First of all we installespfixpage upper directory into the kernel page directory in theinit_espfix_bs:

pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page);

WhereESPFIX_BASE_ADDRis:

#define PGDIR_SHIFT39#define ESPFIX_PGD_ENTRY _AC(-2, UL)#define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << PGDIR_SHIFT)

Also we can find it in the Documentation/x86/x86_64/mm:

... unused hole ...ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks... unused hole ...

After we’ve filled page global directory with theespfixpud, the next step is call of theinit_espfix_randomandinit_espfix_apfunctions. The first function returns random locations for theespfixpage and the second enables theespfixfor the current CPU.

After theinit_espfix_bspfinished the work, we can see the call of thethread_info_cache_initfunction which defined in the kernel/fork.c and allocates cache for thethread_infoifTHREAD_SIZEis less thanPAGE_SIZE:

# if THREAD_SIZE >= PAGE_SIZE.........void thread_info_cache_init(void){thread_info_cache = kmem_cache_create("thread_info", THREAD_SIZE,THREAD_SIZE, 0, NULL);BUG_ON(thread_info_cache == NULL);}.........#endif

5.10.13是：

void thread_stack_cache_init(void) /*线程栈 */{thread_stack_cache = kmem_cache_create_usercopy("thread_stack",THREAD_SIZE, THREAD_SIZE, 0, 0,THREAD_SIZE, NULL);BUG_ON(thread_stack_cache == NULL);}

As we already know thePAGE_SIZEis(_AC(1,UL) << PAGE_SHIFT)or4096bytes andTHREAD_SIZEis(PAGE_SIZE << THREAD_SIZE_ORDER)or16384bytes for thex86_64.

The next function after thethread_info_cache_initis thecred_initfrom the kernel/cred.c. This function just allocates cache for the credentials (likeuid,gid, etc.):

/** initialise the credentials stuff 初始化凭据的东西*/void __init cred_init(void) /* */{/* allocate a slab in which we can store credentials */cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred), 0,SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);}

more about credentials you can read in the Documentation/security/credentials.txt.

Next step is thefork_initfunction from the kernel/fork.c. Thefork_initfunction allocates cache for thetask_struct. Let’s look on the implementation of thefork_init.

First of all we can see definitions of theARCH_MIN_TASKALIGNmacro and creation of a slab where task_structs will be allocated:

#ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR#ifndef ARCH_MIN_TASKALIGN#define ARCH_MIN_TASKALIGNL1_CACHE_BYTES#endiftask_struct_cachep =kmem_cache_create("task_struct", sizeof(struct task_struct),ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_NOTRACK, NULL);#endif

As we can see this code depends on theCONFIG_ARCH_TASK_STRUCT_ACLLOCATORkernel configuration option. This configuration option shows the presence of thealloc_task_structfor the given architecture. Asx86_64has noalloc_task_structfunction, this code will not work and even will not be compiled on thex86_64.

Allocating cache for init task

After this we can see the call of thearch_task_cache_initfunction in thefork_init:

5.10.13中arch_task_cache_init为空。

void arch_task_cache_init(void){task_xstate_cachep =kmem_cache_create("task_xstate", xstate_size,__alignof__(union thread_xstate),SLAB_PANIC | SLAB_NOTRACK, NULL);setup_xstate_comp();}

Thearch_task_cache_initdoes initialization of the architecture-specific caches. In our case it isx86_64, so as we can see, thearch_task_cache_initallocates cache for thetask_xstatewhich represents FPU state and sets up offsets and sizes of all extended states in xsave area with the call of thesetup_xstate_compfunction. After thearch_task_cache_initwe calculate default maximum number of threads with the:

set_max_threads(MAX_THREADS);

where default maximum number of threads is:

#define FUTEX_TID_MASK 0x3fffffff#define MAX_THREADSFUTEX_TID_MASK

In the end of thefork_initfunction we initialize signal handler:

init_task.signal->rlim[RLIMIT_NPROC].rlim_cur = max_threads/2;init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;init_task.signal->rlim[RLIMIT_SIGPENDING] =init_task.signal->rlim[RLIMIT_NPROC];

As we know theinit_taskis an instance of thetask_structstructure, so it containssignalfield which represents signal handler. It has following typestruct signal_struct. On the first two lines we can see setting of the current and maximum limit of theresource limits. Every process has an associated set of resource limits. These limits specify amount of resources which current process can use. Hererlimis resource control limit and presented by the:

struct rlimit {__kernel_ulong_t rlim_cur;__kernel_ulong_t rlim_max;};

structure from the include/uapi/linux/resource.h. In our case the resource is theRLIMIT_NPROCwhich is the maximum number of processes that user can own andRLIMIT_SIGPENDING- the maximum number of pending signals. We can see it in the:

cat /proc/self/limitsLimit Soft Limit Hard Limit Units.........Max processes 6381563815processes Max pending signals 6381563815signals .........

Initialization of the caches

The next function after thefork_initis theproc_caches_initfrom the kernel/fork.c.

void __init proc_caches_init(void) /* /proc/slabinfo 中可查到的 */{unsigned int mm_size;sighand_cachep = kmem_cache_create("sighand_cache",sizeof(struct sighand_struct), 0,SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|SLAB_ACCOUNT, sighand_ctor);signal_cachep = kmem_cache_create("signal_cache",sizeof(struct signal_struct), 0,SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,NULL);files_cachep = kmem_cache_create("files_cache",sizeof(struct files_struct), 0,SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,NULL);fs_cachep = kmem_cache_create("fs_cache",sizeof(struct fs_struct), 0,SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,NULL);/** The mm_cpumask is located at the end of mm_struct, and is* dynamically sized based on the maximum CPU number this system* can have, taking hotplug into account (nr_cpu_ids).*/mm_size = sizeof(struct mm_struct) + cpumask_size();mm_cachep = kmem_cache_create_usercopy("mm_struct",mm_size, ARCH_MIN_MMSTRUCT_ALIGN,SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,offsetof(struct mm_struct, saved_auxv),sizeof_field(struct mm_struct, saved_auxv),NULL);vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);mmap_init(); /* 初始化percpu计数器 for VM 和 region 记录 slabs */nsproxy_cache_init(); /* namesapce proxy 缓存分配 */}

This function allocates caches for the memory descriptors (ormm_structstructure). At the beginning of theproc_caches_initwe can see allocation of the different SLAB caches with the call of thekmem_cache_create:

sighand_cachep- manage information about installed signal handlers;signal_cachep- manage information about process signal descriptor;files_cachep- manage information about opened files;fs_cachep- manage filesystem information.

在我的系统中：

[rongtao@localhost src]$ sudo cat /proc/slabinfo | grep -e signal -e fs_cache -e signal -e files_cache -e mm_structmm_struct 180 180 1600 20 8 : tunables 0 0 0 : slabdata990files_cache459 459 640 51 8 : tunables 0 0 0 : slabdata990signal_cache 560 560 1152 28 8 : tunables 0 0 0 : slabdata0

After this we allocateSLABcache for themm_structstructures:

mm_cachep = kmem_cache_create("mm_struct",sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);

After this we allocateSLABcache for the importantvm_area_structwhich used by the kernel to manage virtual memory space:

vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);

Note, that we useKMEM_CACHEmacro here instead of thekmem_cache_create. This macro is defined in the include/linux/slab.h and just expands to thekmem_cache_createcall:

#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\sizeof(struct __struct), __alignof__(struct __struct),\(__flags), NULL)

TheKMEM_CACHEhas one difference fromkmem_cache_create. Take a look on__alignof__operator. TheKMEM_CACHEmacro alignsSLABto the size of the given structure, butkmem_cache_createuses given value to align space.

After this we can see the call of themmap_initandnsproxy_cache_initfunctions. The first function initializes virtual memory areaSLABand the second function initializesSLABfor namespaces.

int __init nsproxy_cache_init(void) /* namespace proxy 缓存 */{nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);return 0;}

The next function after theproc_caches_initisbuffer_init. This function is defined in the fs/buffer.c source code file and allocate cache for thebuffer_head. Thebuffer_headis a special structure which defined in the include/linux/buffer_head.h and used for managing buffers.

$ sudo cat /proc/slabinfo | grep buffer[sudo] rongtao 的密码：buffer_head 486781 594984 104 39 1 : tunables 0 0 0 : slabdata 15256 152560

In the start of thebuffer_initfunction we allocate cache for thestruct buffer_headstructures with the call of thekmem_cache_createfunction as we did in the previous functions. And calculate the maximum size of the buffers in memory with:

nrpages = (nr_free_buffer_pages() * 10) / 100;max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));

which will be equal to the10%of theZONE_NORMAL(all RAM from the 4GB on thex86_64).

The next function after thebuffer_initis -vfs_caches_init.

void __init vfs_caches_init(void) /*虚拟文件系统缓存 */{names_cachep = kmem_cache_create_usercopy("names_cache", PATH_MAX, 0,SLAB_HWCACHE_ALIGN|SLAB_PANIC, 0, PATH_MAX, NULL);dcache_init(); /* 文件目录缓存 */inode_init(); /* inode 缓存 */files_init(); /* 文件缓存 */files_maxfiles_init(); /* */mnt_init();/* 挂载 */bdev_cache_init(); /* 块设备缓存 */chrdev_init(); /* 字符设备 */}

This function allocatesSLABcaches and hashtable for different VFS caches. We already saw thevfs_caches_init_earlyfunction in the eighth part of the linux kernel initialization process which initialized caches fordcache(or directory-cache) and inode cache.

Thevfs_caches_initfunction makes post-early initialization of thedcacheandinodecaches, private data cache, hash tables for the mount points, etc. More details about VFS will be described in the separate part.

After this we can seesignals_initfunction.

void __init signals_init(void) /* */{siginfo_buildtime_checks();sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);/* */}

This function is defined in the kernel/signal.c and allocates a cache for thesigqueuestructures which represents queue of the real time signals.

The next function ispage_writeback_init. This function initializes the ratio for the dirty pages. Every low-level page entry contains thedirtybit which indicates whether a page has been written to after been loaded into memory.

该函数已经转移至如下函数：

void __init pagecache_init(void) /* 页缓存 */{int i;for (i = 0; i < PAGE_WAIT_TABLE_SIZE; i++)init_waitqueue_head(&page_wait_table[i]);page_writeback_init(); /* 页回写 */}

Creation of the root for the procfs

After all of this preparations we need to create the root for the proc filesystem. We will do it with the call of theproc_root_initfunction from the fs/proc/root.c.

void __init proc_root_init(void) /* */{proc_init_kmemcache(); /* kmem_cache */set_proc_pid_nlink(); /* /proc/PID/ */proc_self_init(); /* /proc/self/ */proc_thread_self_init();/* */proc_symlink("mounts", NULL, "self/mounts"); /* /proc/PID/mounts */proc_net_init(); /* /proc/net/ */proc_mkdir("fs", NULL); /* /proc/fs/ */proc_mkdir("driver", NULL); /* /proc/driver/ */proc_create_mount_point("fs/nfsd"); /* somewhere for the nfsd filesystem to be mounted */#if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE)/* just give it a mountpoint */proc_create_mount_point("openprom");#endifproc_tty_init(); /* /proc/tty */proc_mkdir("bus", NULL);/* /proc/bus */proc_sys_init(); /* /proc/sys */register_filesystem(&proc_fs_type);}

At the start of theproc_root_initfunction we allocate the cache for the inodes and register a new filesystem in the system with the:

err = register_filesystem(&proc_fs_type);if (err)return;

proc_fs_type结构如下：

static struct file_system_type proc_fs_type = {.name= "proc",.init_fs_context= proc_init_fs_context,.parameters= proc_fs_parameters,.kill_sb= proc_kill_sb,.fs_flags= FS_USERNS_MOUNT | FS_DISALLOW_NOTIFY_PERM,};

As I wrote above we will not dive into details about VFS and different filesystems in this chapter, but will see it in the chapter about theVFS. After we’ve registered a new filesystem in our system, we call theproc_self_initfunction from the fs/proc/self.c and this function allocatesinodenumber for theself(/proc/selfdirectory refers to the process accessing the/procfilesystem). The next step after theproc_self_initisproc_setup_thread_selfwhich setups the/proc/thread-selfdirectory which contains information about current thread. After this we create/proc/self/mountssymlink which will contains mount points with the call of the

proc_symlink("mounts", NULL, "self/mounts");

and a couple of directories depends on the different configuration options:

#ifdef CONFIG_SYSVIPCproc_mkdir("sysvipc", NULL);#endifproc_mkdir("fs", NULL);proc_mkdir("driver", NULL);proc_mkdir("fs/nfsd", NULL);#if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE)proc_mkdir("openprom", NULL);#endifproc_mkdir("bus", NULL);.........if (!proc_mkdir("tty", NULL))return;proc_mkdir("tty/ldisc", NULL);.........

In the end of theproc_root_initwe call theproc_sys_initfunction which creates/proc/sysdirectory and initializes the Sysctl.

It is the end ofstart_kernelfunction. I did not describe all functions which are called in thestart_kernel. I skipped them, because they are not important for the generic kernel initialization stuff and depend on only different kernel configurations.

taskstats_init_earlywhich exports per-task statistic to the user-space,delayacct_init- initializes per-task delay accounting,key_initandsecurity_initinitialize different security stuff,check_bugs- fix some architecture-dependent bugs,ftrace_initfunction executes initialization of the ftrace,cgroup_initmakes initialization of the rest of the cgroup subsystem,etc.

Many of these parts and subsystems will be described in the other chapters.

That’s all.

Finally we have passed through the long-longstart_kernelfunction. But it is not the end of the linux kernel initialization process. We haven’t run the first process yet. In the end of thestart_kernelwe can see the last call of the -rest_initfunction. Let’s go ahead.

void __init __weak arch_call_rest_init(void) /* */{rest_init(); /* 在linux启动的阶段start_kernel()的最后，rest_init()会开启两个进程：kernel_init，kthreadd，之后主线程变成idle线程，init/main.c。linux下的3个特殊的进程：idle进程（PID=0），init进程（PID=1）和kthreadd（PID=2） */}

First steps after the start_kernel

Therest_initfunction is defined in the same source code file asstart_kernelfunction, and this file is init/main.c. In the beginning of therest_initwe can see call of the two following functions:

rcu_scheduler_starting();smpboot_thread_init();

在5.10.13中为：

noinline void __ref rest_init(void) /* */{struct task_struct *tsk;int pid;rcu_scheduler_starting(); /* 调度器启动 *//** We need to spawn init first so that it obtains pid 1, however* the init task will end up wanting to create kthreads, which, if* we schedule it before we create kthreadd, will OOPS.*//* 创建内核线程 */pid = kernel_thread(kernel_init, NULL, CLONE_FS);/* init/systemd 内核线程 PID=1*//** Pin init on the boot CPU. Task migration is not properly working* until sched_init_smp() has been run. It will set the allowed* CPUs for init to the non isolated CPUs.*/rcu_read_lock();tsk = find_task_by_pid_ns(pid, &init_pid_ns);set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));rcu_read_unlock();numa_default_policy();pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); /* kthreadd 内核线程 PID=2 */rcu_read_lock();kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);rcu_read_unlock();/** Enable might_sleep() and smp_processor_id() checks.* They cannot be enabled earlier because with CONFIG_PREEMPTION=y* kernel_thread() would trigger might_sleep() splats. With* CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled* already, but it's stuck on the kthreadd_done completion.*/system_state = SYSTEM_SCHEDULING;complete(&kthreadd_done); /* kernel_init 中等待此处完成 *//** The boot idle thread must execute schedule()* at least once to get things moving:*/schedule_preempt_disabled(); /* *//* Call into cpu_idle with preempt disabled */cpu_startup_entry(CPUHP_ONLINE);}

The firstrcu_scheduler_startingmakes RCU scheduler active and the secondsmpboot_thread_initregisters thesmpboot_thread_notifierCPU notifier (more about it you can read in the CPU hotplug documentation. After this we can see the following calls:

pid = kernel_thread(kernel_init, NULL, CLONE_FS);pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);

Here thekernel_threadfunction (defined in the kernel/fork.c) creates new kernel thread. As we can see thekernel_threadfunction takes three arguments:

Function which will be executed in a new thread;Parameter for thekernel_initfunction;Flags.

We will not dive into details aboutkernel_threadimplementation (we will see it in the chapter which describe scheduler, just need to say thatkernel_threadinvokes clone).

Now we only need to know that we create new kernel thread withkernel_threadfunction, parent and child of the thread will use shared information about filesystem and it will start to executekernel_initfunction.

A kernel thread differs from a user thread that it runs in kernel mode. So with these twokernel_threadcalls we create two new kernel threads with the

PID = 1forinitprocess, 在CentOS中是 systemd线程；PID = 2forkthreadd.

We already know what isinitprocess. Let’s look on thekthreadd. It is a special kernel thread which manages and helps different parts of the kernel to create another kernel thread. We can see it in the output of thepsutil:

[rongtao@localhost src]$ ps -ef | grep -e kthread -e systemdroot10 0 3月02 ? 00:05:39 systemd --switched-root --system --deserialize 21root20 0 3月02 ? 00:00:00 [kthreadd]

Let’s postponekernel_initandkthreaddfor now and go ahead in therest_init. In the next step after we have created two new kernel threads we can see the following code:

rcu_read_lock();kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);rcu_read_unlock();

The firstrcu_read_lockfunction marks the beginning of an RCU read-side critical section and thercu_read_unlockmarks the end of an RCU read-side critical section. We call these functions because we need to protect thefind_task_by_pid_ns.

Thefind_task_by_pid_nsreturns pointer to thetask_structby the given pid. So, here we are getting the pointer to thetask_structforPID = 2(we got it afterkthreaddcreation with thekernel_thread). In the next step we callcompletefunction

complete(&kthreadd_done);

and pass address of thekthreadd_done. Thekthreadd_donedefined as

static __initdata DECLARE_COMPLETION(kthreadd_done);

whereDECLARE_COMPLETIONmacro defined as:

#define DECLARE_COMPLETION(work) \struct completion work = COMPLETION_INITIALIZER(work)

and expands to the definition of thecompletionstructure. This structure is defined in the include/linux/completion.h and presentscompletionsconcept.

/** struct completion - structure used to maintain state for a "completion"** This is the opaque structure used to maintain the state for a "completion".* Completions currently use a FIFO to queue threads that have to wait for* the "completion" event.** See also: complete(), wait_for_completion() (and friends _timeout,* _interruptible, _interruptible_timeout, and _killable), init_completion(),* reinit_completion(), and macros DECLARE_COMPLETION(),* DECLARE_COMPLETION_ONSTACK().*/struct completion {/* */unsigned int done;struct swait_queue_head wait;};

Completions is a code synchronization mechanism which provides race-free solution for the threads that must wait for some process to have reached a point or a specific state.

Using completions consists of three parts:

The first is definition of thecompletestructure and we did it with theDECLARE_COMPLETION.The second is call of thewait_for_completion.After the call of this function, a thread which called it will not continue to execute and will wait while other thread did not callcompletefunction.

Note that we callwait_for_completionwith thekthreadd_donein the beginning of thekernel_init_freeable:

wait_for_completion(&kthreadd_done);

And the last step is to callcompletefunction as we saw it above. After this thekernel_init_freeablefunction will not be executed whilekthreaddthread will not be set. After thekthreaddwas set, we can see three following functions in therest_init:

init_idle_bootup_task(current);schedule_preempt_disabled();cpu_startup_entry(CPUHP_ONLINE);

5.10.13中没有 init_idle_bootup_task。

The firstinit_idle_bootup_taskfunction from the kernel/sched/core.c sets the Scheduling class for the current process (idleclass in our case):

void init_idle_bootup_task(struct task_struct *idle){idle->sched_class = &idle_sched_class;}

whereidleclass is a low task priority and tasks can be run only when the processor doesn’t have anything to run besides this tasks.

The second functionschedule_preempt_disableddisables preempt inidletasks.

/*** schedule_preempt_disabled - called with preemption disabled** Returns with preemption disabled. Note: preempt_count must be 1*/void __sched schedule_preempt_disabled(void){sched_preempt_enable_no_resched();schedule();preempt_disable();}

And the third functioncpu_startup_entryis defined in the kernel/sched/idle.c and callscpu_idle_loopfrom the kernel/sched/idle.c.

在5.10.13中，该函数为：

void cpu_startup_entry(enum cpuhp_state state){arch_cpu_idle_prepare();cpuhp_online_idle(state);while (1)do_idle();}

Thecpu_idle_loopfunction works as process withPID = 0and works in the background. Main purpose of thecpu_idle_loopis to consume the idle CPU cycles.

When there is no process to run, this process starts to work.

We have one process withidlescheduling class (we just set thecurrenttask to theidlewith the call of theinit_idle_bootup_taskfunction), so theidlethread does not do useful work but just checks if there is an active task to switch to:

static void cpu_idle_loop(void){.........while (1) {while (!need_resched()) {.........}...}

在5.10.13中对应的是：

/** Generic idle loop implementation** Called with polling cleared.*/static void do_idle(void){int cpu = smp_processor_id();/** If the arch has a polling bit, we maintain an invariant:** Our polling bit is clear if we're not scheduled (i.e. if rq->curr !=* rq->idle). This means that, if rq->idle has the polling bit set,* then setting need_resched is guaranteed to cause the CPU to* reschedule.*/__current_set_polling();tick_nohz_idle_enter();while (!need_resched()) {rmb();local_irq_disable();if (cpu_is_offline(cpu)) {tick_nohz_idle_stop_tick();cpuhp_report_idle_dead();arch_cpu_idle_dead();}arch_cpu_idle_enter();/** In poll mode we reenable interrupts and spin. Also if we* detected in the wakeup from idle path that the tick* broadcast device expired for us, we don't want to go deep* idle as we know that the IPI is going to arrive right away.*/if (cpu_idle_force_poll || tick_check_broadcast_expired()) {tick_nohz_idle_restart_tick();cpu_idle_poll();} else {cpuidle_idle_call();}arch_cpu_idle_exit();}/** Since we fell out of the loop above, we know TIF_NEED_RESCHED must* be set, propagate it into PREEMPT_NEED_RESCHED.** This is required because for polling idle loops we will not have had* an IPI to fold the state for us.*/preempt_set_need_resched();tick_nohz_idle_exit();__current_clr_polling();/** We promise to call sched_ttwu_pending() and reschedule if* need_resched() is set while polling is set. That means that clearing* polling needs to be visible before doing these things.*/smp_mb__after_atomic();/** RCU relies on this call to be done outside of an RCU read-side* critical section.*/flush_smp_call_function_from_idle();schedule_idle();if (unlikely(klp_patch_pending(current)))klp_update_patch_state(current);}

More about it will be in the chapter about scheduler. So for this moment thestart_kernelcalls therest_initfunction which spawns aninit(kernel_initfunction) process and becomeidleprocess itself.

Now is time to look on thekernel_init. Execution of thekernel_initfunction starts from the call of thekernel_init_freeablefunction. Thekernel_init_freeablefunction first of all waits for the completion of thekthreaddsetup. I already wrote about it above:

wait_for_completion(&kthreadd_done);

After this we setgfp_allowed_maskto__GFP_BITS_MASKwhich means that system is already running,

/* Now the scheduler is fully set up and can do blocking allocations *///>>>>>>means that system is already running<<<<<<<<gfp_allowed_mask = __GFP_BITS_MASK;

set allowed cpus/mems to all CPUs and NUMA nodes with theset_mems_allowedfunction,

allowinitprocess to run on any CPU with theset_cpus_allowed_ptr,

/** init can allocate pages on any node** allow `init` process to run on any CPU with the `set_cpus_allowed_ptr`*/set_mems_allowed(node_states[N_MEMORY]);

set pid for thecadorCtrl-Alt-Delete, do preparation for booting of the other CPUs with the call of thesmp_prepare_cpus, call early initcalls with thedo_pre_smp_initcalls, initializeSMPwith thesmp_initand initialize lockup_detector with the call of thelockup_detector_initand initialize scheduler with thesched_init_smp.

After this we can see the call of the following functions -do_basic_setup. Before we will call thedo_basic_setupfunction, our kernel already initialized for this moment. As comment says:

Now we can finally start doing some real work..

do_basic_setup函数定义：

/** Ok, the machine is now initialized. None of the devices* have been touched yet, but the CPU subsystem is up and* running, and memory and process management works.** Now we can finally start doing some real work..*/static void __init do_basic_setup(void){cpuset_init_smp(); /* reinitialize [cpuset] */driver_init();/* */init_irq_proc(); /* */do_ctors(); /* */usermodehelper_enable(); /* */do_initcalls();/* xxx_initcall() */}

Thedo_basic_setupwill reinitialize cpuset to the active CPUs, initialize thekhelper- which is a kernel thread which used for making calls out to userspace from within the kernel, initialize tmpfs, initializedriverssubsystem, enable the user-mode helperworkqueueand make post-early call of theinitcalls.

static initcall_entry_t __initdata*initcall_levels[] = {__initcall0_start,__initcall1_start,__initcall2_start,__initcall3_start,__initcall4_start,__initcall5_start,__initcall6_start,__initcall7_start,__initcall_end,};static void __init do_initcalls(void){int level;size_t len = strlen(saved_command_line) + 1;char *command_line;command_line = kzalloc(len, GFP_KERNEL);if (!command_line)panic("%s: Failed to allocate %zu bytes\n", __func__, len);for (level = 0; level < ARRAY_SIZE(initcall_levels) - 1; level++) {/* Parser modifies command_line, restore it each time */strcpy(command_line, saved_command_line);do_initcall_level(level, command_line);}kfree(command_line);}

We can see opening of thedev/consoleand dup twice file descriptors from0to2after thedo_basic_setup:

if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)pr_err("Warning: unable to open an initial console.\n");(void) sys_dup(0);(void) sys_dup(0);

5.10.13中是：

/* Open /dev/console, for stdin/stdout/stderr, this should never fail opening of the `dev/console` and dup twice file descriptors from `0` to `2` */void __init console_on_rootfs(void){struct file *file = filp_open("/dev/console", O_RDWR, 0);if (IS_ERR(file)) {pr_err("Warning: unable to open an initial console.\n");return;}init_dup(file);init_dup(file);init_dup(file);fput(file);}

We are using two system calls heresys_openandsys_dup. In the next chapters we will see explanation and implementation of the different system calls. After we opened initial console, we check thatrdinit=option was passed to the kernel command line or set default path of the ramdisk:

if (!ramdisk_execute_command)ramdisk_execute_command = "/init";

同时：

static int __init rdinit_setup(char *str){unsigned int i;ramdisk_execute_command = str;/* See "auto" comment in init_setup */for (i = 1; i < MAX_INIT_ARGS; i++)argv_init[i] = NULL;return 1;}__setup("rdinit=", rdinit_setup);

Check user’s permissions for theramdiskand call theprepare_namespacefunction from the init/do_mounts.c which checks and mounts the initrd:

if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {ramdisk_execute_command = NULL;prepare_namespace();}

This is the end of thekernel_init_freeablefunction and we need return to thekernel_init.

The next step after thekernel_init_freeablefinished its execution is theasync_synchronize_full. This function waits until all asynchronous function calls have been done 。

and after it we will call thefree_initmemwhich will release all memory occupied by the initialization stuff which located between__init_beginand__init_end. After this we protect.rodatawith themark_rodata_roand update state of the system from theSYSTEM_BOOTINGto the

static void mark_readonly(void){if (rodata_enabled) {/** load_module() results in W+X mappings, which are cleaned* up with call_rcu(). Let's make sure that queued work is* flushed so that we don't hit false positives looking for* insecure pages which are W+X.*/rcu_barrier();mark_rodata_ro();rodata_test();} elsepr_info("Kernel memory protection disabled.\n");}

system_state = SYSTEM_RUNNING;

And tries to run theinitprocess:

if (ramdisk_execute_command) {ret = run_init_process(ramdisk_execute_command);if (!ret)return 0;pr_err("Failed to execute %s (error %d)\n",ramdisk_execute_command, ret);}

First of all it checks theramdisk_execute_commandwhich we set in thekernel_init_freeablefunction and it will be equal to the value of therdinit=kernel command line parameters or/initby default. Therun_init_processfunction fills the first element of theargv_initarray:

static const char *argv_init[MAX_INIT_ARGS+2] = { "init", NULL, };

which represents arguments of theinitprogram and calldo_execvefunction:

argv_init[0] = init_filename;return do_execve(getname_kernel(init_filename),(const char __user *const __user *)argv_init,(const char __user *const __user *)envp_init);

static int run_init_process(const char *init_filename){const char *const *p;argv_init[0] = init_filename;pr_info("Run %s as init process\n", init_filename);pr_debug(" with arguments:\n");for (p = argv_init; *p; p++)pr_debug(" %s\n", *p);pr_debug(" with environment:\n");for (p = envp_init; *p; p++)pr_debug(" %s\n", *p);return kernel_execve(init_filename, argv_init, envp_init);}

Thedo_execvefunction is defined in the include/linux/sched.h and runs program with the given file name and arguments. If we did not passrdinit=option to the kernel command line, kernel starts to check theexecute_commandwhich is equal to value of theinit=kernel command line parameter:

if (execute_command) {ret = run_init_process(execute_command);if (!ret)return 0;panic("Requested init %s failed (error %d).",execute_command, ret);}

If we did not passinit=kernel command line parameter either, kernel tries to run one of the following executable files:

//If we did not pass `init=` kernel command line parameter either, //kernel tries to run one of the following executable files// //[rongtao@localhost src]$ ll /sbin/init//lrwxrwxrwx 1 root root 22 1月 28 11:18 /sbin/init -> ../lib/systemd/systemdif (!try_to_run_init_process("/sbin/init") ||!try_to_run_init_process("/etc/init") ||!try_to_run_init_process("/bin/init") ||!try_to_run_init_process("/bin/sh"))return 0;

Otherwise we finish with panic:

panic("No working init found. Try passing init= option to kernel. ""See Linux Documentation/init.txt for guidance.");

That’s all! Linux kernel initialization process is finished!

Conclusion

It is the end of the tenth part about the linux kernel initialization process. It is not only thetenthpart, but also is the last part which describes initialization of the linux kernel. As I wrote in the first part of this chapter, we will go through all steps of the kernel initialization and we did it. We started at the first architecture-independent function -start_kerneland finished with the launch of the firstinitprocess in the our system. I skipped details about different subsystem of the kernel, for example I almost did not cover scheduler, interrupts, exception handling, etc. From the next part we will start to dive to the different kernel subsystems. Hope it will be interesting.

If you have any questions or suggestions write me a comment or ping me at twitter.

Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.

Links

SLABxsaveFPUDocumentation/security/credentials.txtDocumentation/x86/x86_64/mmRCUVFSinodeprocman procSysctlftracecgroupCPU hotplug documentationcompletions - wait for completion handlingNUMAcpus/memsinitcallsTmpfsinitrdpanicPrevious part

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。