Skip to content

User Process Management CLI

Generally, it's common to have a cli tool to manage the user processes, with respective commands to help start, stop, restart processes like systemctl. This blog records some experience building a CLI tool in Go to manage processes. Moreover, it lists some sceneries to make it work well as a PID 1 process.

Here are some summary items in this blog:

  • signal handling, wait system calls and process management
  • processes reaping as the init process
  • start, restart and exit handling

Start: Functions to Trigger Shell

When a command is given to the CLI tool, how should we trigger it in Go code? We have several ways, from high level to very primitive level.

func main() {
    cmd := exec.Command("/bin/bash", "-c", "echo hello")
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr
    _ = cmd.Start()
    fmt.Println(cmd.Process.Pid)
    _ = cmd.Wait()
}
func main() {
    p, _ := os.StartProcess("/bin/bash", []string{"/bin/bash", "-c", "echo hello"}, &os.ProcAttr{
        Files: []*os.File{os.Stdout, os.Stderr},
    })

    re, err := p.Wait()
    fmt.Println(re, err)
}
func main() {
    p, h, e := syscall.StartProcess("/bin/bash", []string{"/bin/bash", "-c", "echo hello"}, &syscall.ProcAttr{
        Files: []uintptr{0, 1, 2},
    })
    fmt.Println(p, h, e)
    time.Sleep(1 * time.Second) // wait 1s to let the new process output into the stdout
}

Those APIs are wrapped in each layer one by one. Moreover, exec keeps the compatibility with the low-level APIs.

exec.Command#Run -> os.StartProcess -> syscall.StartProcess

Hence, we can always use the exec, as the document states:

Package exec runs external commands. It wraps os.StartProcess to make it easier to remap stdin and stdout, connect I/O with pipes, and do other adjustments.

Reaping: Init Process Need to Clean Zombie Processes

As a process is terminated but its parent hasn't wait for it, the process becomes a zombie process. Moreover, once its parent process dies, it becomes an orphan and is adopted by init process. As the Linux man page states:

A child that terminates, but has not been waited for becomes a "zombie". The kernel maintains a minimal set of information about the zombie process (PID, termination status, resource usage information) in order to allow the parent to later perform a wait to obtain information about the child. As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. If a parent process terminates, then its "zombie" children (if any) are adopted by init(1), (or by the nearest "subreaper" process as defined through the use of the prctl(2) PR_SET_CHILD_SUBREAPER operation); init(1) automatically performs a wait to remove the zombies.

It's not a problem because the operating system cleans zombie processes. For example, in macOs, it's /sbin/launchd clears the zombie processes. However, the given command is usually the init process inside a container. The implication is that the init process doesn't clear the spawned processes as no special clear logic is provided.

Reaping Processes

Consider a case in the CLI tool runs a command to start a long-running server. Inside the server, it will fork and exec several ephemeral processes. Once the long-running server command returns(the process ends), all processes it spawned become orphans and are adopted by the init process(in this scenario, the CLI tool process).

We have two possible ways for reaping, periodically or wait when the child process is down. In this topic, I focus on the latter method.

The first thing to know is a child process stops or terminates, SIGCHLD is sent to the parent process. Then, we need to know wait, which waits for state changes in a child of the calling process.

In the case of a terminated child, performing a wait allows the system to release the resources associated with the child; if a wait is not performed, then the terminated child remains in a "zombie" state.

For reaping purposes, we only call wait in the parent process.

Restarting Service in A Container

The k8s ensures the desired number of instances to run on the platform. Once a service is down, which means the container will exit as well, the platform starts a new container to reach the expected running instance number.

The restart here means that restart a service without restarting the container. In this topic, I will introduce how it is achieved.

Because the restart is an event triggered from outside, another process that is more concise, we need some communication mechanism to do the communication between processes. What's more, during restarting, not only the process of user's command but spawned processes should be killed.

Because the CLI tool is running as init process with PID 1, we can send signals to the destination directly. The signal we choose to send is SIGHUP.

   SIGHUP       P1990      Term    Hangup detected on controlling terminal
                                   or death of controlling process

The term "hang up" (HUP) comes from the early days of computing when it was used to notify processes associated with a terminal that the user had "hung up" or disconnected.

Over time, this signal evolved to serve additional purposes, and one common convention emerged: u sing SIGHUP to instruct a process to re-read its configuration files.

In the init process code, the SIGHUP signal should be respected to trigger the handler. The handler should kill all descendants of the user's process and it requires us some work to do this. Here is a simple demo to list:

package main

import (
    "bytes"
    "fmt"
    "github.com/mitchellh/go-ps"
    "io"
    "os"
)

func main() {
    printProcessTree()
}

func printProcessTree() {
    var w bytes.Buffer
    entry := []int{1}
    printLayer(New(), entry, &w, 0)
    w.WriteTo(os.Stdout)
}

type Processes struct {
    descents   map[int][]int  // ppid -> []pid
    executable map[int]string // pid -> executable string
}

func New() *Processes {
    pros, _ := ps.Processes()
    m := make(map[int][]int)         // ppid -> []pid
    processM := make(map[int]string) // pid -> string

    for _, p := range pros {
        processM[p.Pid()] = p.Executable()

        v, ok := m[p.PPid()]
        if !ok {
            m[p.PPid()] = []int{p.Pid()}
            continue
        }
        m[p.PPid()] = append(v, p.Pid())
    }
    return &Processes{
        descents:   m,
        executable: processM,
    }
}

func printLayer(p *Processes, entry []int, w io.Writer, ident int) {
    for _, e := range entry {
        strLine := fmt.Sprintf("%s%d: %s\n",
            printIndent(ident), e, p.executable[e])

        w.Write([]byte(strLine))
        list, ok := p.descents[e]
        if !ok {
            continue
        }
        printLayer(p, list, w, ident+1)
    }
}

func printIndent(number int) string {
    var idents string
    for i := 0; i < number-1; i++ {
        idents += "\t"
    }
    idents += "|_______"
    return idents
}

The output looks like:

|_______7988: chrome_crashpad_
|_______7986: Electron
        |_______64602: Code Helper (Plu
                |_______16545: Code Helper (Plu
                |_______64711: Code Helper (Plu
                |_______64605: rust-analyzer
                        |_______64964: rust-analyzer-pr
                        |_______64962: rust-analyzer-pr
                |_______64603: Code Helper (Plu
        |_______64601: Code Helper
        |_______64582: Code Helper (Ren
        |_______16695: Code Helper
                |_______17569: zsh
                |_______47274: zsh
        |_______8036: Code Helper
        |_______7990: Code Helper
        |_______7989: Code Helper (GPU

Exit

The CLI needs to handle the kill signal as well. The SIGKILL sent by kill -9 cannot be handled. We can only handle SIGTERM, SIGINT and SIGQUIT signals. We should kill all children processes and the process created by the user's command, which is the same procedure as restarting.

Conclusion

This blog introduces some experiences about how to write a CLI tool as the init process(PID 1). It reveals reaping, starting/restarting users' commands, and exiting.