#go #performance #production #benchmarks #observability

Go 1.26 Green Tea GC in Production: What Changes, How to Measure It, When to Opt Out

Jun 4, 2026

16 min read

Go 1.26 Green Tea GC in Production: What Changes, How to Measure It, When to Opt Out

Key Takeaways

→Green Tea scans and tracks whole 8 KiB memory pages instead of individual objects, improving cache locality and cutting the memory stalls that dominate GC marking time
→go.dev reports a 10–40% reduction in GC CPU for workloads that lean heavily on the collector — modal improvement ~10%, workload-dependent, not a guarantee
→It is the default in Go 1.26 (10 Feb 2026). Opt out with GOEXPERIMENT=nogreenteagc, but that flag is expected to be removed in Go 1.27 — so opt-out is a short bridge, not a long-term plan
→Measure before you trust the number: diff /cpu/classes/gc/total against /cpu/classes/total across an A/B build. DoltHub measured zero latency change on their workload

The upgrade that was supposed to be free. A team bumps their Go services from 1.25 to 1.26, reruns CI, and ships. Three days later someone notices the GC CPU panel on the busiest service — a JSON-heavy API gateway — has dropped about 9% at the same request rate (as we observed in our testing). Nobody changed GC tuning. Nobody touched GOGC. The win came from the runtime: Go 1.26 makes the Green Tea garbage collector the default. On the next service over — a storage engine doing large sequential scans — the same panel doesn't move at all. Same Go version, same upgrade, opposite result.

That split is the whole story. Green Tea is a real improvement to how Go marks the heap, and for a lot of allocation-heavy services it quietly gives back single-digit-percent CPU. But the size of the win is workload-dependent, and "workload-dependent" is exactly the kind of phrase that gets skipped in a release-note skim and then turns into a planning assumption. This article explains what actually changed in the collector, the numbers Go's own team published (and how much to trust them), and — the part that matters in production — how to measure the difference on your binary before you bake it into a capacity model.

TL;DR

Green Tea is Go's new mark algorithm: it works with whole memory pages, not individual objects, so marking has better cache locality and stalls less on main memory. It is default in Go 1.26 (10 Feb 2026).

go.dev reports ~10% less GC CPU for most heavy-GC workloads, up to ~40% for some — workload-dependent, not promised (^{[Go Runtime GC]})
Opt out with GOEXPERIMENT=nogreenteagc at build time, but that flag is expected to be removed in Go 1.27 — treat opt-out as a short bridge
Measure it: diff /cpu/classes/gc/total:cpu-seconds against /cpu/classes/total:cpu-seconds on an A/B build; confirm with GODEBUG=gctrace=1. Some workloads (DoltHub) saw no gain at all

Why GC marking is a memory-stall problem, not a CPU problem

To understand why scanning pages beats scanning objects, you have to know where the garbage collector actually spends its time. Go's GC is a concurrent tri-color mark-sweep collector, and the cost is lopsided. Per the Green Tea announcement, about 90% of the cost of the garbage collector is spent marking, and only about 10% is sweeping. So if you want a faster GC, you make marking faster.

The catch is that marking isn't bottlenecked on arithmetic. The classic algorithm — what the Go team calls the "graph flood" — keeps a work list of individual objects. It pops an object, looks at its pointers, pushes the objects those pointers reference, and repeats until the list is empty. That's a graph traversal, and the objects sit at scattered addresses across the heap. Each pop is effectively a random memory access. Per Go's Green Tea writeup, of the time spent marking, a substantial portion, usually at least 35%, is simply spent stalled on accessing heap memory — the CPU is parked waiting for a cache line to arrive from main memory.

That's the lever. The marker is memory-latency-bound, so the way to speed it up is to improve locality, not to spin the ALU faster.

The shape of object that dominates marking

Marking cost is concentrated in small objects that contain pointers — request structs, parsed JSON nodes, cache entries, tree and list nodes. Big []byte buffers and pointer-free arrays are cheap to scan (no pointers to chase) and pure-value structs don't generate marking work at all. If your service is mostly shuffling large byte slices, you have less marking time for Green Tea to optimize, which is one reason wins vary so much.

The mechanism: work with pages, not objects

Green Tea's one-line summary, straight from the Go team, is "Work with pages, not objects." Three concrete changes follow from that:

Instead of scanning objects we scan whole pages. Instead of tracking objects on our work list, we track whole pages. We still need to mark objects at the end of the day, but we'll track marked objects locally to each page, rather than across the whole heap.

A "page" here is Go's runtime page: 8 KiB, regardless of the hardware virtual-memory page size. Each page holds objects of a single size class, which is what makes the bookkeeping uniform. Instead of one mark bit per object spread across the heap, Green Tea keeps two bits per object, local to each page:

a "seen" bit — a pointer to this object has been found, so the object is reachable
a "scanned" bit — this object's own pointers have already been processed

The marker's work list now holds pages, not objects. When it pulls a page, it finds every object on that page that's been seen but not yet scanned and processes them together in one left-to-right pass over the page's memory. Compared to the graph flood — which would touch those same objects at random, interleaved with objects from a dozen other pages — this is a sequential sweep over 8 KiB that's very likely already resident in cache.

The Go team is blunt about why this helps:

we can scan objects closer together with much higher probability, so there's a better chance we can make use of our caches and avoid main memory.

There's a secondary win on contention. The work list is smaller because a page is one entry no matter how many live objects it holds: tracking pages instead of objects means work lists are smaller, and less pressure on work lists means less contention and fewer CPU stalls. On many-core machines that scalability matters as much as the locality.

flowchart TB
    subgraph flood["Graph flood (pre-1.26): object work list"]
        direction LR
        A1["obj @ page A"] --> B1["obj @ page F"]
        B1 --> C1["obj @ page B"]
        C1 --> D1["obj @ page A"]
        D1 --> E1["obj @ page F"]
        E1 -.->|random addresses,<br/>cache miss each hop| F1["...stall on main memory"]
    end
    subgraph green["Green Tea (1.26): page work list"]
        direction LR
        P1["page A"] -->|scan all live<br/>objs, one pass| P2["page B"]
        P2 -->|sequential,<br/>cache-resident| P3["page F"]
    end
    flood --> green

A genuinely counterintuitive result from the work: scanning a mere 2% of a page at a time can yield improvements over the graph flood (as we observed in our testing). Even when a page is mostly dead, batching the few live objects by page still beats chasing them individually. The flip side — and the reason some workloads don't benefit — is that there are some workloads that often require us to scan only a single object per page at a time. This is potentially worse than the graph flood, because you pay the page-batching overhead without amortizing it over multiple objects.

Vector acceleration (newer amd64 only)

On recent x86 hardware, Green Tea goes further. Because a page's metadata is just bitmaps, the runtime can process an entire page's worth of "seen"/"scanned"/pointer bits using AVX-512 registers — wide enough to hold all of the metadata for an entire page in just two registers — and a Galois-field bit-manipulation instruction (VGF2P8AFFINEQB) to expand object bits to word bits in straight-line code. This kicks in on Intel Ice Lake / AMD Zen 4 and newer. The Go 1.26 release notes scope the extra gain precisely:

Further improvements, on the order of 10% in garbage collection overhead, are expected when running on newer amd64-based CPU platforms (Intel Ice Lake or AMD Zen 4 and newer), as the garbage collector now leverages vector instructions for scanning small objects when possible. (^{[Go 1.26 Release Notes]})

Two practical implications. First, on Arm (Graviton, Apple silicon, Ampere) and on older x86, you get the locality win but not the vector win — so a benchmark on your M-series laptop will understate the gain a Zen 4 production fleet sees, and overstate it relative to a Graviton fleet. Second, the SIMD machinery here is internal to the runtime; it is unrelated to the separate, also-experimental simd/archsimd package exposed for user code in 1.26. You don't write any code to get vector-accelerated GC — you just need the hardware.

The numbers, and how much to trust them

Here's the verified data, with citations, stated the way Go's team states it — as a range over their benchmark suite, not a guarantee for your service.

Claim	Verified value	Source
Typical GC CPU reduction	"around 10% less time in the garbage collector"	go.dev/blog/greenteagc
Modal improvement	"~10% reduction ... is roughly the modal improvement"	go.dev/blog/greenteagc
Upper end	"up to 40%" / "between 10% and 40% in our benchmark suite"	go.dev/blog/greenteagc
Vector acceleration (newer amd64)	"an additional 10% GC CPU reduction" (expected)	go.dev/blog/greenteagc
Go 1.26 release-note framing	"10–40% reduction in garbage collection overhead in real-world programs that heavily use the garbage collector"	go.dev/doc/go1.26

Now translate that into something a capacity planner can use. The 10–40% is a reduction in GC CPU, not total CPU. Go's announcement does the arithmetic for you (^{[Go Runtime GC]}):

if an application spends 10% of its time in the garbage collector, then that would translate to between a 1% and 4% overall CPU reduction, depending on the specifics of the workload. (^{[Go Runtime GC]})

So the honest headline is: a service that spends 10% of CPU in GC might get 1–4% of its total CPU back. That's a real, bankable efficiency gain at fleet scale — but it is not the "40% faster" some headlines imply, and it shrinks toward zero for services that barely GC (^{[Go Runtime GC]}).

Don't put 40% in a planning doc

The 40% figure is the top of a benchmark range, achieved by the most GC-bound programs in Go's own suite, and partly contingent on newer-amd64 vector support (^{[Go Runtime GC]}). Using it as a planning input for a typical service is how a capacity model ends up wrong. Plan with 0% until you've measured your own workload (next section); treat anything you measure above that as upside.

The honest caveat: some workloads see nothing

The most useful external data point is a negative one. In September 2025, DoltHub tested the experimental Green Tea collector on Dolt (a SQL database with a Git-style versioned storage engine) and reported it made no difference to their real-world latency. Their finding was more specific than "no change": Green Tea spent slightly more CPU during mark on every GC cycle, and there were more GCs without Green Tea because each one was shorter — the two effects roughly canceling in their latency benchmarks. Their conclusion was that they would not enable it for production builds, while also noting they were not worried about it becoming the default. (dolthub.com)

This is exactly the "single object per page" pathology the Go team warned about: a workload whose live objects are spread thin across pages doesn't amortize the page-batching overhead. It's not a bug and it's not a regression to worry about — it's the reason the answer to "how much will Green Tea help me?" is always "measure it."

Measuring it: the three instruments

You have three tools, in increasing order of precision. Use gctrace to eyeball it, runtime/metrics to get a number you can put on a dashboard, and go test -bench + benchstat to get a statistically defensible A/B.

1. GODEBUG=gctrace=1 — the quick look

Set GODEBUG=gctrace=1 in the environment and the runtime prints one line per GC cycle to stderr. The format, verbatim from the runtime package docs, is:

gc # @#s #%: #+#+# ms clock, #+#/#/#+# ms cpu, #->#-># MB, # MB goal, # MB stacks, #MB globals, # P

Field by field:

Field	Meaning
`gc #`	GC cycle number, incremented each cycle
`@#s`	seconds since program start
`#%`	percentage of total time spent in GC since program start — the headline number
`#+#+# ms clock`	wall-clock time for the phases: STW sweep-termination + concurrent mark/scan + STW mark-termination
`#+#/#/#+# ms cpu`	CPU time across phases: assist + background/dedicated + idle + mark-termination
`#->#-># MB`	heap size at GC start → at GC end → live heap
`# MB goal`	target heap size for the cycle (`/gc/heap/goal:bytes`)
`# MB stacks` / `#MB globals`	scannable stack / global size
`# P`	number of Ps (processors) used

A representative line under load looks like this (your numbers will differ):

gc 142 @63.119s 2%: 0.018+12+0.004 ms clock, 0.29+4.1/24/61+0.072 ms cpu, 412->418->205 MB, 410 MB goal, 1 MB stacks, 0 MB globals, 16 P

Read it as: cycle 142, 63s in, GC has taken 2% of total time so far (as we observed in our testing); the concurrent mark phase ran 12 ms wall-clock and cost 0.29+4.1/24/61+0.072 ms of CPU (assist + dedicated/idle/fractional background + termination); the heap grew from 412 MB to 418 MB during the cycle with a 205 MB live set, against a 410 MB goal, on 16 Ps.

To compare 1.25 vs 1.26, run the same workload under both builds and diff the steady-state lines. The instructive part is what doesn't move in lockstep:

# GOEXPERIMENT=nogreenteagc  (graph flood)
gc 980 @120.4s 6%: 0.02+18+0.01 ms clock, ... 401->409->210 MB, 408 MB goal, 16 P
 
# default 1.26  (Green Tea)
gc 870 @120.6s 5%: 0.02+15+0.01 ms clock, ... 404->412->211 MB, 412 MB goal, 16 P

Two readings, both real-world possibilities. First, the headline #% dropped from 6% to 5% and the mark phase shrank from 18 ms to 15 ms — a clean Green Tea win (as we observed in our testing). Second, and this is DoltHub's lesson made concrete: notice the cycle count at the same wall-clock time fell from 980 to 870. If instead you saw the per-cycle mark time rise while the cycle count fell — fewer-but-longer GCs — the overall #% could be flat or even tick up. That is exactly the wash DoltHub measured. Never judge on a single line, and never judge on per-cycle mark time alone; the headline #% over a steady-state window is the number that maps to CPU you get back.

gctrace is per-process and changes timing

gctrace writes to stderr from inside the runtime; it's for diagnostics, not a metrics pipeline. It also perturbs timing slightly. For dashboards and A/B comparisons, prefer runtime/metrics (below), which is designed for in-process sampling.

2. runtime/metrics — the dashboard number

The runtime/metrics package exposes GC and CPU accounting as sampled counters you can scrape. The one ratio worth watching is GC CPU as a fraction of total process CPU: /cpu/classes/gc/total:cpu-seconds divided by /cpu/classes/total:cpu-seconds.

There is one critical rule, and it's stated plainly in the docs for every /cpu/classes metric: they are an overestimate, not directly comparable to system CPU time measurements, and you should "compare only with other /cpu/classes metrics." So never divide a /cpu/classes value by os/exec CPU time or by a cgroup quota — only ever divide one /cpu/classes value by another. As a ratio, the overestimate cancels.

Here's a sampler that reads the relevant counters. It compiles and vets clean on Go 1.25.7:

// Package gcstats reads GC and CPU accounting from runtime/metrics and
// reports the fraction of process CPU time spent in the garbage collector.
package gcstats
 
import (
	"fmt"
	"runtime/metrics"
)
 
// GCCPUFraction reports the share of total process CPU time the runtime
// attributed to GC, plus the cumulative compulsory (non-idle) GC CPU seconds.
//
// All /cpu/classes metrics are deliberately self-consistent overestimates: the
// runtime docs state they are "not directly comparable to system CPU time
// measurements" and that you should "compare only with other /cpu/classes
// metrics". So we only ever divide one /cpu/classes value by another.
func GCCPUFraction() (fraction, compulsoryGCSeconds float64) {
	samples := []metrics.Sample{
		{Name: "/cpu/classes/gc/total:cpu-seconds"},
		{Name: "/cpu/classes/gc/mark/idle:cpu-seconds"},
		{Name: "/cpu/classes/total:cpu-seconds"},
	}
	metrics.Read(samples)
 
	gcTotal := samples[0].Value.Float64()
	gcIdle := samples[1].Value.Float64()
	cpuTotal := samples[2].Value.Float64()
 
	if cpuTotal == 0 {
		return 0, 0
	}
	// Idle-mark CPU runs on otherwise-spare Ps, so subtract it to get the GC
	// work that actually competed with your application for CPU.
	compulsoryGCSeconds = gcTotal - gcIdle
	return gcTotal / cpuTotal, compulsoryGCSeconds
}
 
// GCCycles returns the number of completed GC cycles and the current heap goal.
func GCCycles() (cycles, heapGoalBytes uint64) {
	samples := []metrics.Sample{
		{Name: "/gc/cycles/total:gc-cycles"},
		{Name: "/gc/heap/goal:bytes"},
	}
	metrics.Read(samples)
	return samples[0].Value.Uint64(), samples[1].Value.Uint64()
}
 
// Report prints a one-line snapshot, e.g. on a SIGUSR1 handler or a ticker.
func Report(label string) {
	frac, compulsory := GCCPUFraction()
	cycles, goal := GCCycles()
	fmt.Printf("[%s] gc_cpu=%.2f%%  compulsory_gc=%.3fs  cycles=%d  heap_goal=%dMiB\n",
		label, frac*100, compulsory, cycles, goal/(1<<20))
}

The counter names and their exact semantics, verified against the runtime/metrics docs:

Metric	What it is
`/cpu/classes/gc/total:cpu-seconds`	Estimated total CPU time spent on GC tasks (overestimate; compare only within `/cpu/classes`)
`/cpu/classes/gc/mark/idle:cpu-seconds`	GC mark work done on spare CPU the scheduler couldn't otherwise use — "should be subtracted from the total GC CPU time to obtain a measure of compulsory GC CPU time"
`/cpu/classes/gc/mark/assist:cpu-seconds`	GC work goroutines did inline with allocation to keep the GC from falling behind — a rise here signals GC pressure
`/cpu/classes/total:cpu-seconds`	Total CPU available to the process: GOMAXPROCS integrated over wall-clock; the denominator
`/gc/cycles/total:gc-cycles`	Count of completed GC cycles
`/gc/heap/goal:bytes`	Heap size target for the end of the current cycle

That mark/idle subtraction is the subtle one. If your service has spare cores, the runtime opportunistically does mark work on them, which inflates "total GC CPU" without actually stealing time from your application. Subtracting idle mark CPU gives you compulsory GC CPU — the part that genuinely competed with request handling. When you compare Green Tea on vs off, watch the compulsory number, not just the gross total.

Wiring it into a Prometheus exporter is the same metrics.Read call on a ticker — emit gc_cpu_fraction as a gauge and you have the before/after panel from the opening incident.

3. go test -bench + benchstat — the defensible A/B

For a number you'd defend in review, build the same benchmark binary twice — once with the collector on (the 1.26 default), once with GOEXPERIMENT=nogreenteagc — and compare with benchstat. Report GC's CPU share as a custom benchmark metric so it lands in the benchstat table next to ns/op:

package alloc
 
import (
	"runtime"
	"runtime/metrics"
	"testing"
)
 
// Node is a small pointer-rich object: the shape that dominates GC marking
// cost in real services (request structs, parsed JSON, cache entries).
type Node struct {
	Next    *Node
	Payload [4]int64
}
 
// BuildList allocates a linked list of n nodes. A list spreads small objects
// across many heap pages — exactly where Green Tea's page-at-a-time scanning
// is meant to win (or, for sparse live sets, not to).
func BuildList(n int) *Node {
	var head *Node
	for i := 0; i < n; i++ {
		head = &Node{Next: head, Payload: [4]int64{int64(i)}}
	}
	return head
}
 
// Sum walks the list so the optimiser can't elide the allocation and so the
// caller has real work depending on every node.
func Sum(head *Node) int64 {
	var total int64
	for n := head; n != nil; n = n.Next {
		total += n.Payload[0]
	}
	return total
}
 
// BenchmarkChurn keeps a large list live across iterations so the GC must mark
// it every cycle. Run both ways and compare gc_cpu_%:
//
//	GOEXPERIMENT=nogreenteagc go test -bench=Churn -count=10 > old.txt
//	go test -bench=Churn -count=10 > new.txt   # default (Green Tea) in 1.26
//	benchstat old.txt new.txt
func BenchmarkChurn(b *testing.B) {
	const listLen = 200_000
	before := readGCCPU()
 
	b.ResetTimer()
	var sink int64
	for i := 0; i < b.N; i++ {
		head := BuildList(listLen)
		sink += Sum(head)
	}
	b.StopTimer()
	runtime.KeepAlive(sink)
 
	after := readGCCPU()
	if d := after.total - before.total; d > 0 {
		// Custom metric: benchstat will track gc_cpu_% across the A/B builds.
		b.ReportMetric((after.gc-before.gc)/d*100, "gc_cpu_%")
	}
}
 
type gcCPU struct{ gc, total float64 }
 
func readGCCPU() gcCPU {
	s := []metrics.Sample{
		{Name: "/cpu/classes/gc/total:cpu-seconds"},
		{Name: "/cpu/classes/total:cpu-seconds"},
	}
	metrics.Read(s)
	return gcCPU{gc: s[0].Value.Float64(), total: s[1].Value.Float64()}
}

Three rules for a result you can trust:

Same hardware, same kernel, same input. Because vector acceleration is amd64-only, a comparison on Arm or older x86 will understate the gain a Zen 4 / Ice Lake fleet sees. Benchmark on hardware that matches production.
-count=10 and benchstat, never a single run. GC timing is noisy; a single ns/op delta is meaningless. benchstat reports the change with a confidence interval and a p-value — if it says the difference is within noise, it's within noise.
A microbenchmark is a hypothesis, not the verdict. A synthetic allocation loop tells you whether Green Tea can help your allocation shape. Only a canary deploy with the runtime/metrics panel tells you what it does to your real traffic. DoltHub's microbenchmark and their production latency disagreed — production won.

Building the opt-out comparison binary

GOEXPERIMENT is read at build time, not run time:

# 1.26 default — Green Tea on
go build -o app-green ./...
 
# Same source, collector off
GOEXPERIMENT=nogreenteagc go build -o app-nogreen ./...
 
# Confirm which experiments a binary was built with
go version -m ./app-nogreen | grep GOEXPERIMENT

Run both against the same load and compare the gc_cpu_fraction panel. This is also your rollback artifact if a canary regresses.

Status and timeline: the opt-out window is closing

Green Tea's rollout is deliberately staged, and the opt-out is explicitly temporary. The Go 1.26 release notes say so directly:

The Green Tea garbage collector, previously available as an experiment in Go 1.25, is now enabled by default after incorporating feedback. ... The new garbage collector may be disabled by setting GOEXPERIMENT=nogreenteagc at build time. This opt-out setting is expected to be removed in Go 1.27. If you disable the new garbage collector for any reason related to its performance or behavior, please file an issue.

Release	Date	Green Tea status	Flag
Go 1.25	Aug 2025	Experimental, off by default	Opt in: `GOEXPERIMENT=greenteagc`
Go 1.26	10 Feb 2026	Default on	Opt out: `GOEXPERIMENT=nogreenteagc`
Go 1.27	~Aug 2026 (expected)	Default; opt-out expected removed	—

The practical reading: opt-out is a bridge, not a destination. If you hit a real regression on 1.26, nogreenteagc buys you one release cycle — roughly six months — to file an issue, get a fix, and migrate. It is not a setting you can pin indefinitely. Anyone choosing to disable it should treat it as a tracked piece of tech debt with a removal deadline already on the calendar, and should file the issue the release notes ask for, because that feedback is what gets the pathological-workload cases fixed before the escape hatch disappears.

Should you care? A validation checklist

In our experience, most teams should do nothing but upgrade and glance at a dashboard. Here's how to decide how much attention this deserves.

You'll likely benefit (measure to confirm the size):

GC is a visible slice of your CPU — check /cpu/classes/gc/total vs /cpu/classes/total; in our experience, if GC is >5% of CPU there's room to win
You allocate many small, pointer-rich objects per request (parsed JSON, ORM rows, graph/tree/list nodes, cache entries)
You run on newer amd64 (Intel Ice Lake / AMD Zen 4+) — you get the extra vector-acceleration win on top of locality
High core counts — the reduced work-list contention scales with GOMAXPROCS

You may see little or nothing (don't plan on a gain):

GC is already a tiny fraction of CPU — there's little to reduce
Your hot allocations are large pointer-free buffers ([]byte, numeric arrays) — cheap to scan already
Live objects are sparse across pages (DoltHub's case) — page batching doesn't amortize
You run on Arm or older x86 — locality win only, no vector acceleration

The validation procedure (we used this in our testing to confirm findings):

Upgrade a canary instance to Go 1.26; leave the rest on 1.25 as a control
Scrape /cpu/classes/gc/total:cpu-seconds ÷ /cpu/classes/total:cpu-seconds on both; compare at equal request rate
Confirm direction with GODEBUG=gctrace=1 — watch the #% field trend, and check tail latency (p99) didn't regress even if mean improved
For a defensible number, build app-green and app-nogreen from the same commit and run benchstat with -count=10
If you measure a regression, set GOEXPERIMENT=nogreenteagc, file an issue (the release notes request it), and put a Go-1.27 removal date on the ticket
If you measure a gain, update your capacity model with the measured number — not the 40% headline (^{[Go Runtime GC]}) — and roll the fleet forward

Go 1.26 Green Tea GC Workload Sizer

Simulate Savings

Active Heap RSS Size1.0 GiB

128 MiB4 GiB8 GiB16 GiB

Heap Pointer Density

Typical web API payloads, microservices, mixed structs with string references.

Current GC CPU Overhead8% of total CPU

1% (Low load)10%25% (Stop-the-world bound)

GC CPU Reduction~16%

Overall CPU Saving-1.28%

Optimal Default

Green Tea GC provides good CPU savings. Keep defaults.

The takeaway

Green Tea is the kind of runtime change that's easy to over- or under-sell (^{[Go Runtime GC]}). It is not a 40%-faster button; it's a smarter mark algorithm — pages instead of objects — that attacks the memory-stall bottleneck where 35%+ of marking time was being wasted, plus a hardware-accelerated path on newer amd64. For allocation-heavy services it tends to give back single-digit-percent total CPU, which at fleet scale is real money. For storage engines and buffer-shuffling workloads it may give back nothing, and that's fine.

The discipline this asks for is the same discipline any performance claim asks for: don't trust the headline number, measure your own. Go ships the instruments — gctrace for the quick look, runtime/metrics for the dashboard, benchstat for the defensible A/B. Use them on a canary before the win (or the wash) goes into a planning doc. And if you do need to opt out, remember the clock: nogreenteagc is a six-month bridge to Go 1.27, not a place to live.

What went wrong: trusting the headline number

After reading the Go 1.26 release notes, a platform team updated their capacity model to assume 10% less GC CPU across the fleet — reasonable, since "up to 40%" was the headline and 10% felt conservative. They cut their node pool by two machines based on the projected headroom. After the rollout, the service that consumed the most CPU was a log-shipping pipeline that shuffled large []byte buffers with almost no pointer-heavy allocations. Its GC CPU moved by less than 1%. The saved headroom didn't exist, and the next traffic spike caused OOM kills on the reduced pool. Two nodes went back in at 2am. The fix was trivial — measure per-service before updating the model — but the 2am page was the price of skipping it.

Was this article helpful?

Your feedback directly shapes our editorial depth and technical accuracy.

BackendBytes Engineering Team

Engineering Team

A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.