Prerequisite of Go Sync Mutex Implementation¶
This blog is split from the blog of sync mutex implementation due to its length. For better reading experience and clearer organization, it's reasonable to make these background knowledge a separated article.
This blog introduces the language level instructions/functions such as atomic and synchronization, the runtime spinning, sleeping and awoken. It also briefs the memory barrier and assembly code in golang.
Share Memory by Communicating¶
Go guides us to share memory by communicating, and don't communicate by sharing memory.
The latter part is easy: don't read-write the same memory to communicate among go routines. The "share memory by communicating" says you need to share memory(such as a pointer) through the communicating mechanism, even though it doesn't state clearly what the "communicating" way will be.
It's not a rule that describes the same operation with positive and negative statements, but a guide about how to share memory and caution when sharing it.
Atomic¶
As the name refers, atomic means that several instructions will be executed as a batch without getting disturbed by the other instructions inserted after compiler re-arrangement. I won't discuss the CPU instruction like a barrier to ensure the atomic. Instead, we focus on the language level API on the top of the OS, which communicates through system calls. The atomic package provides several APIs, which are the foundation of implementing a language level mutex.
- the swap operation, is the atomic equivalent of:
- the compare-and-swap operation, is the atomic equivalent of:
- the add operation, is the atomic equivalent of:
- the load operation, is the atomic equivalent of:
- the store operation, is the atomic equivalent of:
Note that the atomic is highly affected by the memory model as it's the primitive way to make sure how a memory is accessed and modified without affecting by others. As the Go documentation states:
In the terminology of the Go memory model, if the effect of an atomic operation A is observed by atomic operation B, then A “synchronizes before” B.
Additionally, all the atomic operations executed in a program behave as though executed in some sequentially consistent order.
This definition provides the same semantics as C++'s sequentially consistent atomics and Java's volatile variables.
The memory model is too far out of the current discussion scope, hence, I won't talk it much here.
Assembly as Go Func Implementation¶
Before diving deeper, it's better to know about how Go code calls assembly code. To highlight, this blog doesn't take care about the go assembly code, the data layout(abi) at all. It records the convention that Go function calls several lines of assembly code.
See this blog.
The introduction here is necessary as the later CAS is implemented by the assembly as well instead of the system call.
CAS: Implement by Assembly¶
The CAS(CompareAndSwapInt32 func) compares the contents of a memory location with a given value(old) here, and, only if they are the same, modifies the contents of that memory location to a new given value as an atomic operation.
The atomic is done through the architecture-specific assembly code. The signature is defined at src/sync/atomic/doc.go
:
// CompareAndSwapInt32 executes the compare-and-swap operation for
// an int32 value. Consider using the more ergonomic and
// less error-prone [Int32.CompareAndSwap] instead.
func CompareAndSwapInt32(addr *int32, old, new int32) (swapped bool)
The func body, which contains a jump instruction, is in src/sync/atomic/asm.s
:
The real implementation is stored in the src/runtime/internal/atomic/atomic_arm64.s. We only show the arm64 version here, but note it's architecture-specified.
// bool Cas(uint32 *ptr, uint32 old, uint32 new)
// Atomically:
// if(*val == old){
// *val = new;
// return 1;
// } else
// return 0;
TEXT ·Cas(SB), NOSPLIT, $0-17
MOVD ptr+0(FP), R0
MOVW old+8(FP), R1
MOVW new+12(FP), R2
MOVBU internal∕cpu·ARM64+const_offsetARM64HasATOMICS(SB), R4
CBZ R4, load_store_loop
MOVD R1, R3
CASALW R3, (R0), R2
CMP R1, R3
CSET EQ, R0
MOVB R0, ret+16(FP)
RET
load_store_loop:
LDAXRW (R0), R3
CMPW R1, R3
BNE ok
STLXRW R2, (R0), R3
CBNZ R3, load_store_loop
ok:
CSET EQ, R0
MOVB R0, ret+16(FP)
RET
In the end, we need to highlight the atomic of CAS is guaranteed by the assembly instruction, which is a still a higher level because it doesn't case the CPU cores.
runtime doSpin and canSpin¶
The mutex tries to spin for locking, and this feature is provided by the golang runtime package. In this section we take an eye on it to learn more about the runtime spinning.
runtime_doSpin¶
The runtime_doSpin
is declared in src/sync/runtime.go
but the definition lives in src/runtime/proc.go due to the go:linkname
directive.
The proc.go
provides many functions that take charge of the Goroutine scheduler, the famous GMP model in golang runtime.
The definition of procyield
is writen by assembly in src/runtime/asm_arm64.s as well:
TEXT runtime·procyield(SB),NOSPLIT,$0-0
MOVWU cycles+0(FP), R0
again:
YIELD
SUBW $1, R0
CBNZ R0, again
RET
The parameter cycles
is moved to R0
, and keeps subtracting 1 from R0
until it's not zero. The yield
instruction hints to the processor that the current thread is willing to yield its execution resources. This can be used for cooperative multitasking or other similar scenarios.
In short, it does a specified number of cycles of subtraction operations and returns.
runtime_canSpin¶
Comparing to runtime_doSpin, the runtime_canSpin is much complex as it tightly couples with the processor part. It's better to explain it in the future in another passage after learning about the GMP.
One thing interesting is that if the program is running with GOMAXPROCS<=1
or one core machine, the canSpin will always be false. In the other words, it won't spin on these circumstances.
// Active spinning for sync.Mutex.
//
//go:linkname sync_runtime_canSpin sync.runtime_canSpin
//go:nosplit
func sync_runtime_canSpin(i int) bool {
// sync.Mutex is cooperative, so we are conservative with spinning.
// Spin only few times and only if running on a multicore machine and
// GOMAXPROCS>1 and there is at least one other running P and local runq is empty.
// As opposed to runtime mutex we don't do passive spinning here,
// because there can be work on global runq or on other Ps.
if i >= active_spin || ncpu <= 1 || gomaxprocs <= sched.npidle.Load()+sched.nmspinning.Load()+1 {
return false
}
if p := getg().m.p.ptr(); !runqempty(p) {
return false
}
return true
}
runtime SemacquireMutex and Semrelease¶
Similarly, the function runtime_SemacquireMutex
is linked to function sync_runtime_SemacquireMutex
inside src/runtime/sema.go
through the go:linkname
. The semacquire1
concerns about how woken the pending go routines.
The sema
, shorter as Semaphore
,is a mechanism to implement sleep and wakeup in the Go language runtime level. It concerns about acquiring the mutex in a block mode in a go routine level.
To clarify, the sleeping and waking up on the language runtime level could be split into two aspects:
- 1: the schedule of go routine sleep and awoken
- 2: the condition when a go routine sleep and awoken
It's similar to the discussion about function call time.Sleep
, we concern about what happens after sleeping(how go scheduler manages it) and when it's awoken(sleep time passed).
To learn sync_runtime_SemacquireMutex, we need to see the semacquire1 to focus the question 2. The first question requires a deep understanding about the Go runtime scheduler, which is not necessary to understand for the mutex.
The family APIs of runtime_Semacquire work a single task: ~it waits until s > 0 and then atomically decrements it~. It is intended as a simple sleep primitive for use by the synchronization library and should not be used directly. As the documentation states, *it targets the same goal as Linux's futex.
That is, don't think of these as semaphores.
Think of them as a way to implement sleep and wakeup such that every sleep is paired with a single wakeup, even if, due to races, the wakeup happens before the sleep.
Hence, once a go routine calls the API, it will sleep and blocked there until it's awoken in the future uncertainly. From this view, how the go routines are scheduled is a less important problem, and we needn't consider it much. If multiple go routines call this API, it will be executed in a sequence. One by one, the blocked go routines unblock and keep running the code after API.
Then, we see runtime_SemacquireMutex
, slightly more complex than the runtime_Semacquire
as it exposes more details to control the go routine scheduling(behaves as the priority of unblocking) during awoken.
// SemacquireMutex is like Semacquire, but for profiling contended
// Mutexes.
// If lifo is true, queue waiter at the head of wait queue.
// skipframes is the number of frames to omit during tracing, counting from
// runtime_SemacquireMutex's caller.
func runtime_SemacquireMutex(s *uint32, lifo bool, skipframes int)
The lifo
is a flag to introduce the priority
of a go routine, and the skipframes
are the argument for tracing. If the lifo
is true, the queue waiter is at the head of the waiting queue so we can assume it will unblock earlier than the go routines added with a false value. But suppose we only try to understand the API behavior briefly. In that case, we can even ignore the underlying "queue" and just think about it providing a new way to wait with higher priority.
Reversely, Semrelease atomically increments *s
and notifies a waiting goroutine if one gets blocked in Semacquire. The implementation is out of scope, so we shouldn't pay much attention. How it works deserves another blog to introduce it.
Cache Coherency Mechanism¶
Atomic operations are guaranteed by assembly instructions which are architecture specific.
The compiler respects the atomic operations as well, and it won't insert the other instructions among the atomic instructions.
The CAS works fine on a single CPU core machine as one instruction could interact with the memory at a point in time.
However, in the multiple CPU cores, each of them might execute the instructions to access memory; here, the atomic isn't helpful for memory access. As a result, we need a new As it's the (one) CPU core to execute the assembly code; there might be multiple instructions that want to modify memory for the same time, hence,
there is a mechanism for CPU to control it called "cache coherency mechanism".
The cache coherency mechanism provides some memory barriers such as read barrier_, write barrier, and so forth. We need to know the CAS relies on a cache coherency mechanism to ensure the correctness.
However, this mechanism is out of topic a bit as it's not limited to the mutex itself.