go-pipe v2: Stage API, pooled copies, MemoryWatch removal#51
Conversation
409297e to
fa6c12d
Compare
79f4a1a to
911ed5b
Compare
|
This PR is still in draft mode. Do you want review/feedback already? |
@mhagger I ran out of time to review the LLM output before vacation. I wanted to read through the changes more fully myself before inflicting them on anyone else so I left it in draft mode. |
911ed5b to
4764d95
Compare
Ported from version-2 branch commits: - 95dc2e8 pipeline_test.go: get rid of a bunch of unnecessary tmpdirs - 5fdc22a TestPipelineStdinThatIsNeverClosed(): create stdin more simply - c2c9802 pipeline_test.go: use WithStdoutCloser() to close stdout pipes Tests that don't run external commands (or whose commands don't need a specific working directory) don't need t.TempDir().
Add some benchmarks that move MB-scale data through pipelines consisting of alternating commands and functions, one in small writes, and one buffered into larger writes, then processing it one line at a time. This is not so efficient, because every transition from `Function` → `Command` requires an extra (hidden) goroutine that copies the data from an `io.Reader` to a `*os.File`. We can make this faster!
* Rename * `newNopCloser()` → `newReaderNopCloser()` * `nopCloser` → `readerNopCloser` * `nopCloserWriterTo` → `readerWriterToNopCloser` * `nopWriteCloser` → `writerNopCloser` to help keep readers and writers straight and because only the `Close()` part is a NOP. * Move `writerNopCloser` to `nop_closer.go` to be with its siblings.
4764d95 to
f97fddd
Compare
2ed608b to
d14ef7b
Compare
08a9cf4 to
fca1bfc
Compare
There was a problem hiding this comment.
Pull request overview
This PR modernizes the pipeline stage contract for v2 by letting stages declare I/O preferences and receive both stdin and stdout from the pipeline, enabling better pipe selection and removing the synthetic ioCopier stage.
Changes:
- Redesigns
StagewithPreferences()andStart(..., stdin, stdout)soPipeline.Start()can negotiateos.Pipevsio.Pipe. - Reworks command/function/memory-limit stages for the new interface, including pooled stdout copies for non-file command destinations.
- Updates module path to
/v2and adds regression/benchmark coverage for pipe matching, empty pipelines, fast-path stdout, and start-failure cleanup.
Show a summary per file
| File | Description |
|---|---|
README.md |
Updates documentation links for the v2 module path. |
go.mod |
Changes the module path to github.com/github/go-pipe/v2. |
internal/ptree/ptree_test.go |
Updates internal import path for v2. |
pipe/stage.go |
Redefines the public Stage interface and adds I/O preference types. |
pipe/pipeline.go |
Reworks pipeline startup to negotiate pipe types and pass stdout directly. |
pipe/command.go |
Adapts command stages to the new interface and adds pooled stdout copy handling. |
pipe/function.go |
Adapts function stages to receive caller-provided stdout and panic handling. |
pipe/filter-error.go |
Forwards panic handlers through error-filtering wrappers. |
pipe/memorylimit.go |
Ports memory-watching wrappers to the new stage interface. |
pipe/nop_closer.go |
Splits reader/writer nop closers and adds test unwrapping support. |
pipe/copy_pool.go |
Adds pooled-buffer copy helper with ReaderFrom fast-path support. |
pipe/iocopier.go |
Removes the old synthetic copier stage. |
pipe/scanner.go |
Simplifies scanner error return. |
pipe/command_linux.go |
Updates internal import path for v2. |
pipe/command_test.go |
Applies formatting cleanup. |
pipe/command_nil_panic_test.go |
Updates direct Start call for the new signature. |
pipe/pipeline_test.go |
Updates tests/benchmarks for v2 behavior, empty pipelines, and panic forwarding. |
pipe/memorylimit_test.go |
Reworks memory-limit tests for the new pipeline flow. |
pipe/pipe_matching_test.go |
Adds coverage for negotiated stdin/stdout pipe types. |
pipe/export_test.go |
Exposes nop-closer unwrapping for external package tests. |
pipe/command_stdout_fastpath_test.go |
Adds tests pinning direct *os.File stdout handoff. |
pipe/command_starterror_test.go |
Adds regression coverage for start-failure copy-goroutine cleanup. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 22/22 changed files
- Comments generated: 2
d7948da to
33ca03c
Compare
I think this is actually worth taking a look at now. |
Add WithExtraEnv so callers can append environment variables for one wrapped stage without changing the pipeline-wide environment. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
WithExtraEnv appended to the StageOptions Env.Vars slice directly. The StageOptions value is copied per stage, but the slice backing array can still be shared with the pipeline-level environment hooks, so a wrapped stage could overwrite another stage's stage-local hook when the shared slice had spare capacity. Fix, and add tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
WithExtraEnv wraps the inner stage, so it would otherwise hide optional command hooks such as Process and Kill. Wrap the inner stage in a way that faithfully preserves whether both are present. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use testify assertions consistently in env_stage_test and add coverage for stage-local env values overriding pipeline values without leaking to later stages. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
add/remove stages: remove MemoryWatch and add WithExtraEnv
also drop one low-value test
|
I'm quite un-psyched about every implementation of |
Because separate Reader and Closer arguments make it look like they could be separate objects for some reason. (I had to argue with the copilot review on that point in making that change, as a matter of fact). If it's just one object, then we don't have to consider the issue. The casting is hidden and cheap, so what's the problem? |
The problem is that the interface is deceptive. If I pass an object as an |
mhagger
left a comment
There was a problem hiding this comment.
Here are a few comments so far. I'm not done reading.
| // ForbidStdin returns a FunctionOption declaring that the stage must not be | ||
| // connected to stdin. | ||
| func ForbidStdin() FunctionOption { | ||
| return func(s *goStage) { | ||
| s.requirements.Stdin = StreamForbidden | ||
| } | ||
| } | ||
|
|
||
| // ForbidStdout returns a FunctionOption declaring that the stage must not be | ||
| // connected to stdout. | ||
| func ForbidStdout() FunctionOption { | ||
| return func(s *goStage) { | ||
| s.requirements.Stdout = StreamForbidden | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
It's weird that Function() can declare that it doesn't want a stdin or stdout but Command() can't. Let me think through this…
It's easy to think of a pipeline that would start with a command that doesn't need any stdin, like
git cat-file --batch-check --batch-all-objects | …`
or end with one that doesn't produce any stdout, like
… | tar -x
So the question is whether there's some reason for functions to be pickier about this than commands.
(If we wanted to offer this feature for commands, it couldn't use an analogous interface. Command() is already variadic, so we couldn't add options to it. We would probably have to add the extra options to CommandStage and the user would have to create the commandStage in two steps. Plus, the options would need different names, because the FunctionOption interface is defined in such a way that it only applies to functions. Or we would have to invent a totally different mechanism.)
It's true that exec.Cmd always needs to pass file descriptors to subprocesses as stdin and stdout. (If none is set explicitly, it internally opens FDs to /dev/null and passes those.) So, in a way, there's always a stdin and stdout even if the command won't make use of it, and we wouldn't be able to save overhead by avoiding their creation. So if there were a way to make such a declaration, then it would only serve as documentation and an internal consistency check, and only if the person who put together the pipeline took the trouble to explicitly declare that the command doesn't want stdin/stdout but also passed the unwanted stream in. So I guess that's currently pretty unlikely.
There are two things that are different about functions relative to commands in this regard:
- If a function doesn't need stdin/stdout, we can just pass
nilto it rather than having to open/dev/nullor do any other work; and - The library already provides pre-written function stages, namely the
Printones. A person who is using one of those stages is one step further removed from the person who wrote those stages in the first place and might be more likely to benefit from the consistency check.
So yes, I see that functions benefit more from this capability than commands would, and this seems ok. Funny, but ok.
If we wanted to make this a more general facility, we could do it as follows: Add functions Source() and Sink() that can be used to wrap any Stage. The results would be a special sourceStage or sinkStage, which wraps the enclosed stage and delegates to it. Its Requirements() method would tweak the requirements before returning them, and the Start() method would do the consistency checks on its arguments.
Co-authored-by: Michael Haggerty <mhagger@github.com>
Motivation: Stage pipe negotiation and pooled buffer copies
The original motivation of this work was the perf optimizations in #49 and #50. But rather than optimizing the
ioCopier, we can instead completely1 remove it by adopting the approach taken in @mhagger's Stage interface redesign (#21). This branch rebasesd #21 onto a recent main and folds in the #49/#50 optimizations according to the new structure.This PR is the integration branch for go-pipe v2. PRs included:
v2 interface changes
Summary of new Stage interface:
Since we're doing a major version bump, we also take advantage of the opportunity to do some additional API cleanups. This happened in phases:
Phase 1: Stage interface pipe negotiation
The interface now hands each stage both its stdin and stdout, plus a new
StageOptions, and asks each stage to declare its I/O requirements so the pipeline can pick the cheapest pipe type between neighbors.Phase 2: rework MemoryWatch constructor
In #54, the three flavors of MemoryWatch stage are folded together to use
MemoryWatchOptionto configure a single kind of stage instead of having several almost-identical variants.Phase 3: Separate close-responsibility and other concerns
In #53, rather than inferring who is responsible for closing a pipeline stage's stdin/stdout from the dynamic type, we communicate that explicitly in StageOptions.
Phase 4: Remove non-essential Stage implementations and add
WithExtraEnvstage.Since it's (almost) possible to implement the MemoryWatch stage entirely outside of go-pipe, in #56 we add the one interface necessary to allow that (captured as
processProvider- a stage that implementsProcess() *os.Process- and remove it. This allows a lot of code to be removed -pipe/memory*as well as all ofinternal/ptree.In the other direction, we add a
WithExtraEnvwrapper that allows a single stage to be run under a modified environment (this complementsStageOptions.Env, which sets the environment for the overall pipeline).Phase 1 Details
Since Phase 1 wasn't a separate PR and was repurposed as this integration branch, its original description is below.
Perf optimizations
We want to minimize data copies, whether by letting the kernel move bytes instead of Go, or if we do copy in Go, doing it more efficiently. With the pipeline owning the connections it can:
*os.Filedestination's fd directly into the child, letting the kernel (sendfile(2)/splice(2)and friends) do the copy (so-called "zero-copy" i/o);exec.Cmdallocate a fresh one per pipeline.#49 and #50 were aiming for similar optimizations, but now they fall out of the structure instead of being bolted onto
ioCopier.Panic Handling
go-pipe runs user code (function stages, memory-limit event handlers) in goroutines it spawns itself, so a panic there used to be able to take down the whole process. There was already panic handling, but it was part of the Stage interface, which caused a lot of problems with error-prone boilerplate in stages that don't do any panic handling (which is most of them). Panic handling is now a callback passed into Start that the pipeline threads to every stage, so a single handler covers all stages.
Supersedes
Stageinterface to make stdin/stdout handling more flexible #21 (stale, merge conflicts)Stage2interface that allows such stages to be started more flexibly #20 (the opt-inStage2variant)sync.PoolforioCopiercopy buffers) — pool preserved incopy_pool.goioCopier) — sendfile preserved structurally; theWriterTopool-bypass workaround is gone, since we control the copy site nowThe
git-systems/pooled-copiesbranch (which carried #49 + #50) can be deleted after this merges./cc @mhagger @migue @carlosmn
Copilot Summary
Interface change
Stage.Startgainsstdoutand aStartOptionsstruct (a struct so future run-scoped options don't break the interface again):Pipeline.Start()uses each stage'sPreferences()to negotiate the pipe type between adjacent stages:os.Pipe()when either neighbor is a command (needs a real*os.Filefd)io.Pipe()when both neighbors are Go functions (all userspace, cheaper)stdoutpassed directly to the last stage — no syntheticioCopierModule path bumped to
github.com/github/go-pipe/v2for the breaking change.Fast paths preserved from #49/#50
ioCopieris deleted; the optimizations it carried now live in the stage structure:commandStagewriting to an*os.Filedestination dup's the fd into the child; for a non-*os.Filewriter that implementsio.ReaderFrom, the copy goes throughReadFrom(sendfile where the kernel supports it). Pinned inpipe/command_stdout_fastpath_test.go.*os.File, non-ReaderFromstdout,setupPooledStdoutbuilds anos.Pipe()and copies through async.Poolbuffer rather than lettingexec.Cmdallocate per-pipeline. Seepipe/copy_pool.go.Type unwrapping (replaces transparent NopCloser forwarding)
The old
NopCloserre-exposedio.WriterToby forwarding, so a naiveio.Copykept the fast path transparently through the wrapper. That's replaced by explicit, exported helpers:UnwrapReader(io.Reader) io.Reader/UnwrapWriter(io.Writer) io.Writerrecover the concrete type go-pipe wrapped around stdin/stdout (nil-safe; non-wrappers pass through unchanged).goStage.StartandcommandStage.Startcall them internally, so everypipe.Functionconsumer transparently regainsWriterTo/ReaderFrom/*os.Fileidentity — a superset of the single method the old forwarding preserved, and it picked up a casegoStagewas previously missing.Panic handling
StartOptions.PanicHandler(StagePanicHandler = func(p any) error) is set on the pipeline viaWithStagePanicHandlerand threaded to every stage'sStart. The previousStagePanicHandlerAwareopt-in interface is removed; wrapper stages just forwardopts.FunctionstageStageFuncs and memory-limit event handlers (both run in library-spawned goroutines). IfPanicHandleris nil the panic propagates and crashes — intentional, since gitrpcd treats an unhandled pipeline panic as catastrophic.PanicHandlerrather than crashing. The over-limit kill is guaranteed even if the event handler panics; a panic in a purely-informational handler (RSS-read error / peak usage) is recovered and the stage keeps running unmonitored ("fail open") — losing monitoring is not, on its own, a reason to stop a healthy stage. This is documented onMemoryLimit/StartOptions.PanicHandler.Commit structure
version-2(authorship preserved): linter shush,pipeline_test.gocleanup, pipeline benchmarks, NopCloser simplification, the Stage interface change, pipe-matching tests.MemoryLimitWithObserver, restore the Function-stage panic handler, fixmemoryWatchStage.Wait()to alwaysstopWatching(), lint.*os.Filestdout, avoid leaking the pooled-stdout goroutine whencmd.Start()fails./v2, removeIOPreferenceNil, handle nil stdin/stdout cleanly, exportUnwrap{Reader,Writer}and fully unwrapgoStagestdin, thread the panic handler throughStart(droppingStagePanicHandlerAware), and recover panics escaping the memory-watch goroutine.Footnotes
almost ↩