Process Family Tree
All process are descendants of the init process, whose pid = 1. The init process, in turn, reads the system initscripts and executes more programs, create more child process to finsh the boot .
The relationship between processes is stored in the process descriptor. Each task_struct has a pointer to the parent’s task_struct , named parent , and a list of children, named children.
We can iterate over a process'children with the code below
struct task_struct *task;
struct list_head *list;
list_for_each(list, ¤t->children) {
task = list_entry(list, struct task_struct, sibling);
/* task now points to one of current’s children */
}
We also can get init process pointer with the code below:
struct task_struct *task;
for (task = current; task != &init_task; task = task->parent);
/* task now points to init */
In fact, you can follow the process hierarchy from any one process in the system to any
other. Oftentimes, however, it is desirable simply to iterate over all processes in the system.
This is easy because the task list is a circular, doubly linked list.To obtain the next task or previous task in the list, given any valid task, use
list_entry(task->tasks.next, struct task_struct, tasks);
list_entry(task->tasks.prev, struct task_struct, tasks);
These two routines are provided by the macros next_task(task) and prev_task(task).
Finally, the macro for_each_process(task) is provided, which iterates over the entire task list. On each iteration, task points to the next task in the list.
struct task_struct *task;
for_each_process(task) {
/* this pointlessly prints the name and PID of each task */
printk(“%s[%d]\n”, task->comm, task->pid);
}
Process Creation
Copy-on-Write
Linux use fork() to create new process just duplicating the parent. This approach is naive and inefficient since it copies much data that might otherwise be shared. Worse still, if the new process were to immediately execute a new image, all the copying is wasted. So fork() is actually implemented through a mechinism called copy-on-write pages. COW is a tech to delay or altogether prevent copying of the data. Rather than duplicate the process address space, the parent and the child can share a single copy. The duplication of resource occurs only when they are written; until then, they are shared read-only. This technique delays the copying of each pages until it is written to. In the case that the pages are never written--- for example in the common sense, if exec() is called immediately after fork() ---- they never need to be copied. The only overhead incurred by fork() is the duplication of the parent’s page tables and the creation of a unique process descriptor for the child. This is an important optimization because the Unix philosophy encourages quick process execution.
fork()
Linux implements fork() via the clone() system call.This call takes a series of flags that specify which resources, if any, the parent and child process should share. The clone() system call, in turn, calls do_fork() .
/*
* Ok, this is the main fork-routine.
*
* It copies the process, and if successful kick-starts
* it and waits for it to finish using the VM if required.
*/
long do_fork(unsigned long clone_flags,
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
int __user *parent_tidptr,
int __user *child_tidptr)
{
struct task_struct *p;
int trace = 0;
long nr;
/*
* Do some preliminary argument and permissions checking before we
* actually start allocating stuff
*/
if (clone_flags & CLONE_NEWUSER) {
if (clone_flags & CLONE_THREAD)
return -EINVAL;
/* hopefully this check will go away when userns support is
* complete
*/
if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) ||
!capable(CAP_SETGID))
return -EPERM;
}
/*
* When called from kernel_thread, don't do user tracing stuff.
*/
if (likely(user_mode(regs)))
trace = tracehook_prepare_clone(clone_flags);
p = copy_process(clone_flags, stack_start, regs, stack_size,
child_tidptr, NULL, trace);
/*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
*/
if (!IS_ERR(p)) {
struct completion vfork;
trace_sched_process_fork(current, p);
nr = task_pid_vnr(p);
if (clone_flags & CLONE_PARENT_SETTID)
put_user(nr, parent_tidptr);
if (clone_flags & CLONE_VFORK) {
p->vfork_done = &vfork;
init_completion(&vfork);
}
audit_finish_fork(p);
tracehook_report_clone(regs, clone_flags, nr, p);
/*
* We set PF_STARTING at creation in case tracing wants to
* use this to distinguish a fully live task from one that
* hasn't gotten to tracehook_report_clone() yet. Now we
* clear it and set the child going.
*/
p->flags &= ~PF_STARTING;
wake_up_new_task(p, clone_flags);
tracehook_report_clone_complete(trace, regs,
clone_flags, nr, p);
if (clone_flags & CLONE_VFORK) {
freezer_do_not_count();
wait_for_completion(&vfork);
freezer_count();
tracehook_report_vfork_done(p, nr);
}
} else {
nr = PTR_ERR(p);
}
return nr;
}
This function calls copy_process(). We can figure out what copy_process do as below.
- It calls dup_task_struct() , which creates a new kernel stack, thread_info structure, and task_struct for the new process.The new values are identical to those of the current task.At this point, the child and parent process descriptors are identical.
- It then checks that the new child will not exceed the resource limits on the number of processes for the current user.
- The child needs to differentiate itself from its parent.Various members of the
process descriptor are cleared or set to initial values. Members of the process
descriptor not inherited are primarily statistically information.The bulk of the values in task_struct remain unchanged. - The child’s state is set to TASK_UNINTERRUPTIBLE to ensure that it does not yet run.
- copy_process() calls copy_flags() to update the flags member of the task_struct .The PF_SUPERPRIV flag, which denotes whether a task used superuser privileges, is cleared.The PF_FORKNOEXEC flag, which denotes a process that has not called exec() , is set.
- It calls alloc_pid() to assign an available PID to the new task.
- Depending on the flags passed to clone() , copy_process() either duplicates or shares open files, filesystem information, signal handlers, process address space, and namespace.These resources are typically shared between threads in a given process; otherwise they are unique and thus copied here.
- Finally, copy_process() cleans up and returns to the caller a pointer to the new child.
Back in do_fork() , if copy_process() returns successfully, the new child is woken up and run. Deliberately, the kernel runs the child process first. 8 In the common case of the child simply calling exec() immediately, this eliminates any copy-on-write overhead that would occur if the parent ran first and began writing to the address space.
vfork()
The vfork() system call has the same effect as fork() , except that the page table entries of the parent process are not copied. Instead, the child executes as the sole thread in the parent’s address space, and the parent is blocked until the child either calls exec() or exits. The child is not allowed to write to the address space.Today, with copy-on-write and child-runs-first semantics, the only benefit to vfork() is not copying the parent page tables entries.
Thread
Linux has a unique implementation of threads.To the Linux kernel, there is no concept of a thread. Linux implements all threads as standard processes.The Linux kernel does not provide any special scheduling semantics or data structures to represent threads. Instead, a thread is merely a process that shares certain resources with other processes. Each thread has a unique task_struct and appears to the kernel as a normal process— threads just happen to share resources, such as an address space, with other processes.
Creating Thread
Threads are created the same as normal tasks, with the exception that the clone() system call is passed flags corresponding to the specific resources to be shared:
clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
The previous code results in behavior identical to a normal fork() , except that the address space, filesystem resources, file descriptors, and signal handlers are shared. In other words, the new task and its parent are what are popularly called threads.
In contrast, a normal fork() can be implemented as
clone(SIGCHLD, 0);
The flags provided to clone() help specify the behavior of the new process and detail what resources the parent and child will share.The table below lists the clone flags, which are defined in <linux/sched.h> , and their effect.
Kernel Thread
- Exist and operate solely in kernel space.
- Do NOT have a address space: *mm pointer, which point to address space, is NULL.
3.Kernel threads are created on system boot by other kernel threads. Indeed, a kernel thread can be created only by another kernel thread.
Process Termination
It occurs when the process calls the exit() system call, The bulk of the work is handled by do_exit() ,defined in kernel/exit.c , which completes a number of chores:
- It sets the PF_EXITING flag in the flags member of the task_struct .
- It calls del_timer_sync() to remove any kernel timers. Upon return, it is guaranteed that no timer is queued and that no timer handler is running.
- If BSD process accounting is enabled, do_exit() call acct_update_integrals()
to write out accounting information. - It calls exit_mm() to release the mm_struct held by this process. If no other process is using this address space—that it, if the address space is not shared—the kernel then destroys it.
- It calls exit_sem() . If the process is queued waiting for an IPC semaphore, it is dequeued here.
- It then calls exit_files() and exit_fs() to decrement the usage count of objects related to file descriptors and filesystem data, respectively. If either usage counts reach zero, the object is no longer in use by any process, and it is destroyed.
- It sets the task’s exit code, stored in the exit_code member of the task_struct , to the code provided by exit() or whatever kernel mechanism forced the termination.The exit code is stored here for optional retrieval by the parent.
- It calls exit_notify() to send signals to the task’s parent, reparents any of the task’s children to another thread in their thread group or the init process, and sets the task’s exit state, stored in exit_state in the task_struct structure, to EXIT_ZOMBIE .
- calls schedule() to switch to a new process (see Chapter 4). Because
the process is now not schedulable, this is the last code the task will ever execute. do_exit() never returns.
At this point, all objects associated with the task are freed. The task is not runnable and is in the EXIT_ZOMBIE exit state. The only memory it occupies is its kernel stack, the thread_info structure, and the task_struct structure. The task exists solely to provide information to its parent.After the parent retrieves the information, or notifies the kernel that it is uninterested, the remaining memory held by the process is freed and returned to the system for use.
Removing the Process Descripter
After the parent has obtained information on its terminated child, or signified to the kernel that it does not care, the child’s task_struct is deallocated.
When it is time to finally deallocate the process descriptor, release_task() is invoked. It does the following:
- It calls __exit_signal() , which calls __unhash_process() , which in turns calls detach_pid() to remove the process from the pidhash and remove the process from the task list.
2.__exit_signal() releases any remaining resources used by the now dead process
and finalizes statistics and bookkeeping. - If the task was the last member of a thread group, and the leader is a zombie, then release_task() notifies the zombie leader’s parent.
- release_task() calls put_task_struct() to free the pages containing the
process’s kernel stack and thread_info structure and deallocate the slab cache containing the task_struct .
At this point, the process descriptor and all resources belonging solely to the process has been freed.