Anonymous View
Skip to content

feat(server): support out-of-tree compute drivers via --compute-driver-socket#1703

Open
st-gr wants to merge 2 commits into
NVIDIA:mainfrom
st-gr:feat/external-compute-driver-socket
Open

feat(server): support out-of-tree compute drivers via --compute-driver-socket#1703
st-gr wants to merge 2 commits into
NVIDIA:mainfrom
st-gr:feat/external-compute-driver-socket

Conversation

@st-gr

@st-gr st-gr commented Jun 3, 2026

Copy link
Copy Markdown

Summary

Adds an opt-in --compute-driver-socket=<path> flag (env OPENSHELL_COMPUTE_DRIVER_SOCKET) to the gateway. When set, the gateway dispatches sandbox lifecycle to an out-of-tree compute driver speaking the existing compute_driver.proto contract over a Unix domain socket, instead of one of the four built-in drivers (Kubernetes / Podman / Docker / VM).

The implementation deliberately avoids adding a fifth ComputeDriverKind variant — the gateway treats out-of-tree drivers as an acquired endpoint rather than a parallel kind in the in-tree enum. This was rewritten in response to review feedback from @elezar and @drew (see "Review history" below).

Related Issue

Supersedes the closed draft #1604 (same author, same patch shape — re-opened as a non-draft after rebasing onto current main, with all commits DCO-signed and st-gr-attributed). Same direction as cheese-head's vouch in #1345 ("extend or provide new out-of-tree compute drivers to OpenShell").

No upstream tracking issue filed — happy to file one if reviewers prefer.

Changes

  • crates/openshell-core/src/config.rs: adds Config.external_compute_driver_socket: Option<PathBuf> plus with_external_compute_driver_socket(...). No new ComputeDriverKind variant. The four in-tree variants remain Copy and bare.
  • crates/openshell-server/src/cli.rs: adds --compute-driver-socket=<path> flag (env OPENSHELL_COMPUTE_DRIVER_SOCKET) on RunArgs. When set, effective_single_driver returns None, short-circuiting both --drivers and the auto-detect probe so callers keyed off the in-tree enum skip the match.
  • crates/openshell-server/src/compute/mod.rs:
    • ComputeRuntime::from_driver now takes Option<ComputeDriverKind>; the four in-tree constructors wrap their kind in Some(...). Out-of-tree drivers pass None.
    • connect_external_compute_driver(&Path) produces a tonic Channel over a pre-existing UDS, mirroring the connector pattern the VM driver already uses (hyper-util::TokioIo + tokio::net::UnixStream).
    • ComputeRuntime::new_remote_external consumes the channel via the existing RemoteComputeDriver proxy and skips shutdown cleanup / managed-process supervision (the operator owns the lifecycle).
    • from_driver logs the driver_name advertised by GetCapabilities whenever driver_kind is None, so operators can confirm the gateway connected to the driver they expect (logged for diagnostics — not validated against operator-supplied input in this PR).
  • crates/openshell-server/src/lib.rs: build_compute_runtime short-circuits to connect_external_compute_driver + new_remote_external when Config.external_compute_driver_socket is set, before consulting --drivers / auto-detect. The four in-tree backends keep their existing dispatch arms unchanged.
  • architecture/compute-runtimes.md: adds an "External" row to the Runtime Summary table (activation flag, GetCapabilities.driver_name logging, operator-owned process+socket lifecycle, trust boundary = filesystem permissions on the UDS) and a matching row in Supervisor Delivery.

No compute_driver.proto changes. No new workspace dependencies (hyper-util and tower are already in scope; tokio already exposes UnixStream under cfg(unix)). No security-model changes.

Testing

  • mise run pre-commit passes — not run end-to-end (mise not on author's dev environment), but the equivalent rust pieces verified independently in a rust:1.95.0-slim-bookworm dev container with the workspace's apt deps: cargo fmt --all -- --check clean; cargo clippy --no-deps --workspace --all-targets -- -D warnings clean; cargo test --workspace --lib --bins passes (incl. all 769 openshell-server lib tests).
  • Unit tests added/updated — 3 new CLI-arg tests in crates/openshell-server/src/cli.rs::tests: compute_driver_socket_flag_populates_run_args, compute_driver_socket_overrides_drivers_flag, compute_driver_socket_reads_from_env_var. The four existing in-tree compute-driver tests (configured_compute_driver_*) continue to pass unchanged.
  • E2E tests added/updated — none — running an External driver in CI would require shipping a stub driver binary; deferring until at least one in-tree consumer wants to test against it. The patch is exercised in production at SAP via the st-gr/openshell-driver-kyma reference driver running against a Gardener-managed Kyma cluster since 2026-05-28 (sandbox CR creation, openshell sandbox exec, openshell sandbox upload/download all work).

Checklist

  • Follows Conventional Commits (feat(core):, feat(server):, docs(compute):)
  • Commits are signed off (DCO)
  • Architecture docs updated — added "External" row to architecture/compute-runtimes.md::Runtime Summary and ::Supervisor Delivery. Per-runtime implementation-notes section is intentionally NOT extended for external drivers because they ship their own documentation; the reference implementation lives at st-gr/openshell-driver-kyma.

Review history

The original revision of this PR added ComputeDriverKind::External(PathBuf) as a fifth enum variant. @elezar and @drew both flagged this as the wrong shape — out-of-tree drivers are an acquired endpoint rather than a parallel kind. The current revision implements that direction:

  • No External enum variant. ComputeDriverKind is unchanged; out-of-tree dispatch lives on Config.external_compute_driver_socket and goes through a separate construction path that produces a ComputeRuntime with driver_kind = None.
  • (name, socket) tuple syntax is deferred. The flag remains a single path. GetCapabilities.driver_name is logged for diagnostics, not validated against operator input. This matches @drew's suggestion to defer name-keying / <name>@<socket> to a follow-up that aligns with the broader name-keyed driver registry direction (rfc-0006: add driver config passthrough proposal #1589).
  • Construction / consumption split. The VM driver still spawns its own subprocess; external drivers consume an operator-owned channel. Both paths share RemoteComputeDriver for the actual proto consumption — the only divergence is at construction.

Happy to revisit any of those choices if reviewers prefer a different cut point.

Notes for reviewers

Out of scope (intentional):

  • No new in-tree compute driver implementations. This PR only adds the activation flag + connector + runtime split. Operators bring their own driver process and point --compute-driver-socket at its socket path.
  • No compute_driver.proto changes. The existing protocol is reused verbatim — the same one already used by the in-tree VM driver to talk to its tonic peer.
  • No security model changes. UDS endpoint security is the operator's responsibility (filesystem permissions, sidecar isolation), matching the --drivers vm posture.

Operator context:

The forked gateway image at ghcr.io/st-gr/openshell-gateway (carrying these patches plus an in-tree Dockerfile.gateway) has been driving production-shaped Kubernetes clusters with the st-gr/openshell-driver-kyma compute driver since 2026-05-28. The flag has needed zero changes since it landed — the API shape is stable. Merging this PR lets the driver consume the unmodified upstream gateway image, retiring the fork.

@st-gr st-gr requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners June 3, 2026 05:27
@copy-pr-bot

copy-pr-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@st-gr

st-gr commented Jun 3, 2026

Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

@st-gr

st-gr commented Jun 3, 2026

Copy link
Copy Markdown
Author

recheck

@drew drew self-assigned this Jun 3, 2026
@drew

drew commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

/ok to test 8689967

@st-gr st-gr force-pushed the feat/external-compute-driver-socket branch from 8689967 to 464018b Compare June 3, 2026 08:18
@st-gr

st-gr commented Jun 3, 2026

Copy link
Copy Markdown
Author

Pushed a doc_markdown fix (added backticks around compute_driver.proto and OPENSHELL_COMPUTE_DRIVER_SOCKET in the External variant's doc comment, folded into the originating feat(core): commit via autosquash).

/ok to test 464018b7 — superseded by a second force-push (redundant_pub_crate fix); see the follow-up comment below for the current SHA.

st-gr added a commit to st-gr/OpenShell that referenced this pull request Jun 3, 2026
NVIDIA/OpenShell's Cargo.toml has

    [workspace.lints.clippy]
    pedantic = "warn"
    nursery = "warn"
    all = "warn"

so `cargo clippy --workspace -- -D warnings` escalates pedantic lints
(including `doc_markdown`) to errors. Their CI uses this as part of
`mise run pre-commit`. The fork's existing `release-gateway.yml` only
runs `docker build` (which calls `cargo build --release` — no clippy),
so the same lints weren't enforced on the fork.

PR NVIDIA#1703 caught us with two `doc_markdown` violations
in `crates/openshell-core/src/config.rs:56` (`compute_driver.proto` and
`OPENSHELL_COMPUTE_DRIVER_SOCKET` missing backticks in the External
variant's doc comment). The fork's CI didn't catch it; NVIDIA's did.

This workflow runs on push to main + `feat/**` branches and on
workflow_dispatch. It pins to Rust 1.95.0 (matches upstream mise.toml)
and installs the same native deps `Dockerfile.gateway` needs
(libclang-dev + libz3-dev for transitive bindgen + z3-sys).

Workflow is light: ubuntu-latest runner, ~10-15 min cold, ~3-5 min
with the Swatinem/rust-cache hot. Path-filtered so doc-only changes
don't trigger it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
@st-gr st-gr force-pushed the feat/external-compute-driver-socket branch from 464018b to 1c6663d Compare June 3, 2026 09:01
@st-gr

st-gr commented Jun 3, 2026

Copy link
Copy Markdown
Author

Force-pushed once more — caught a second pedantic clippy lint locally that the previous run didn't get to (redundant_pub_crate on connect_external_compute_driver). Folded into the originating feat(server): wire ... commit via autosquash so the PR stays at 5 commits.

New HEAD: 1c6663d. Verified clean with the full-workspace check NVIDIA's CI runs:

      cargo fmt --all -- --check
      cargo clippy --workspace --all-targets -- -D warnings

Could I get a re-vet with /ok to test 1c6663d3?

Apologies for the round-trip — I've added a rust-lint workflow to my fork to mirror your mise run pre-commit so future pushes catch these locally before they reach your runners.

elezar
elezar previously requested changes Jun 3, 2026

@elezar elezar left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking ahead to changes such as #1589 and support for multiple drivers in future, does it make sense to formulate this as a "named driver endpoint" where a user would supply a driver name and a socket path? Especially given that the in-tree vm driver is already an example of such a driver and we would only be extending the existing functionality.

In practice, the gateway would select a driver by name (instead of a kind enum) and one could use GetCapabilities.driver_name to validate the user-provided name if required.

@st-gr

st-gr commented Jun 3, 2026

Copy link
Copy Markdown
Author

@elezar
I read through #1589 / RFC 0005. The named-endpoint shape and GetCapabilities.driver_name validation both fit cleanly. Two ways to implement your suggestion in this PR:

A. Minimal add. Keep External(PathBuf) and add a paired --compute-driver-name=<name> flag. After the UDS connect, the gateway calls GetCapabilities, compares response.driver_name to the supplied name, and errors loudly on mismatch. The enum surface stays flat. Could serve as a stepping stone; when #1589's registry lands, it can replace the enum entirely without re-migrating this code.

B. Restructured External. Extend the variant to External { name: String, socket: PathBuf }. More visible signal of the multi-driver direction at the type level, but touches FromStr / Display / their tests.

Replacing ComputeDriverKind itself with a name-keyed registry (your second paragraph's "select a driver by name instead of a kind enum") reads to me as the actual #1589 implementation. I can work on that as a follow-up once the RFC has a concrete shape, but that's a much bigger refactor than this PR.

I'd lean (A) because it doesn't pre-commit the data model to a shape #1589 might want to rewrite, and it costs ~30 LOC. Either is fine, though. Let me know which you'd prefer to see here.

@elezar

elezar commented Jun 3, 2026

Copy link
Copy Markdown
Member

As a quick follow-up: What would it look like if we first migrate to implementing the vm driver as a specific-instance of a NamedDriver{ name: String, socket: PathBuf } where name == "vm" and then generalize from here?

Also note, that although #1589 is mentioned here as a motivator for restructuring how drivers are managed from a gateway perspective, moving to a name-keyed registry for managing drivers is not specifically related to the driver-specific config.

@st-gr

st-gr commented Jun 3, 2026

Copy link
Copy Markdown
Author

Right, separating from #1589 makes sense, the named-driver registry is its own concern.

Confirmed against the #1051 OpenShell Drivers roadmap: Its stated intent of "the core OpenShell team maintains a small deployment footprint and third parties may develop their own drivers" is exactly the direction your NamedDriver proposal serves. So aligning this PR there feels right; I won't propose alternatives that pull Kyma (or any other operator-specific driver) in-tree.

Here's what Phase A could look like: A pure refactor that pulls today's Vm arm into a generalized shape, no behavior change for existing --drivers vm operators:

    pub struct NamedDriver {
        pub name: String,
        pub socket: PathBuf,
        /// Set when the gateway owns the driver's lifecycle (current
        /// `vm` behavior - `compute::vm::spawn` returning a
        /// `ManagedDriverProcess`). When `None`, the operator runs the
        /// driver out-of-process and supplies the socket path.
        pub spawn: Option<DriverSpawnConfig>,
    }

    pub enum ComputeDriverKind {
        Kubernetes,
        Docker,
        Podman,
        Named(NamedDriver),
    }

Maps cleanly to the existing wiring at crates/openshell-server/src/compute/mod.rs:

  • --drivers vm translates to Named(NamedDriver { name: "vm", socket: <generated under XDG_STATE>, spawn: Some(default_vm_spawn()) }). The CLI front-end keeps the vm alias for backward compat.
  • ManagedDriverProcess (lines 100-128) stays as-is - it's already the lifecycle holder for spawned drivers; it just gets reached via the Named arm instead of being conditional on ComputeDriverKind::Vm.
  • new_remote_vm (line 388) becomes new_remote_named - same signature, same Option<Arc<ManagedDriverProcess>> field.
  • GetCapabilities.driver_name validation lands here: gateway compares Named.name against the proto response and errors at startup on mismatch.

Then Phase B (this PR) becomes additive:

  • --compute-driver=<name>@<path> (or two flags) → Named(NamedDriver { name, socket, spawn: None }).
  • All the dispatch, lifecycle-watcher, and capability-check logic is shared with the VM path.

Ordering: I'd implement Phase A first as a precursor PR (pure refactor, no behavior change since --drivers vm still works through the alias), then rebase this PR on top. Each piece reviews more cleanly on its own. Happy to bundle into one PR if you'd rather see the full motivation in
a single diff.

Before I start coding, anything about the NamedDriver shape you'd want different? Specifically:

  • The spawn: Option<DriverSpawnConfig> distinction (gateway-managed vs operator-managed) - does that match how you'd want the boundary drawn?
  • name: String vs name: &'static str for the in-tree vm case - string keeps the field uniform with operator-supplied names; static would mean a separate type.
  • Whether Kubernetes / Docker / Podman should follow into Named(...) in this same precursor PR or stay as enum arms (they don't speak compute_driver.proto over UDS - they're in-tree adapters - so probably out of scope).

Once we converge on the shape, Phase A is straightforward.

@elezar

elezar commented Jun 4, 2026

Copy link
Copy Markdown
Member

Thanks, this is the right direction, but I'd separate the pieces slightly differently.

For Phase A, I don't think spawn should be part of the named driver definition. A named endpoint should describe how the gateway consumes a remote driver: { name, socket }. Whether that socket came from a gateway-spawned VM driver or from an operator-managed external process is a construction/lifecycle concern.

For the VM refactor in Phase A, the shape I'd like to see is:

  1. Resolve --drivers vm.
  2. Run the VM-specific construction path: read VM config, prepare the socket path, spawn openshell-driver-vm, and wait for readiness.
  3. Treat the result as an acquired named endpoint plus a lifecycle guard.
  4. From that point on, use the generic endpoint path: connect to the socket, wrap the channel in RemoteComputeDriver, and pass it into ComputeRuntime as a normal SharedComputeDriver.

Conceptually:

VM config -> spawn VM driver -> acquired named endpoint
acquired named endpoint -> RemoteComputeDriver -> ComputeRuntime

For Phase B, an external driver should enter at the same acquired-endpoint boundary:

external config -> existing socket -> acquired named endpoint
acquired named endpoint -> RemoteComputeDriver -> ComputeRuntime

That keeps driver construction separate from driver consumption. The VM lifecycle guard can own process/socket cleanup; an external driver can use a no-op lifecycle guard. ComputeRuntime should not need to care which acquisition path produced the driver.

I'd also prefer to avoid a special External enum variant if possible. The <name>@<socket> shape you proposed seems like a good fit here, especially if we expect to support multiple configured drivers later. With that said, I'd like @drew or @johntmyers to weigh in here in terms of the UX of this. We may not want to overload the existing flag, but then we need to consider how we want to handle multiple drivers and their mapping to endpoints.

Reserved names like vm, docker, podman, and kubernetes can keep their current meaning when supplied bare. Endpoint-based drivers can use the explicit endpoint form:

--drivers vm
--drivers vm@/path/to/compute-driver.sock
--drivers kyma@/path/to/compute-driver.sock

In that model, bare vm keeps the managed VM default path. vm@<socket> uses the VM construction path but overrides the managed socket path (if we wanted to allow this). A non-reserved name such as kyma@<socket> is an unmanaged out-of-tree endpoint-based driver. A non-reserved bare name should error until there is some other way to resolve its endpoint.

For the initial parser, I'd keep the rules narrow: parse the current comma-delimited --drivers entries independently, split endpoint entries on the first @, and reject empty names or empty socket paths. docker@..., podman@..., and kubernetes@... can remain reserved/errors for now unless we explicitly add endpoint-backed versions of those drivers.

One follow-up to call out is the gateway TOML mapping. Today [openshell.gateway].compute_drivers is the structured driver selection list, while [openshell.drivers.<name>] is intentionally driver-owned/raw TOML. If we accept <name>@<socket> on the CLI, we should decide whether the TOML uses the same string form:

[openshell.gateway]
compute_drivers = ["kyma@/run/openshell/kyma.sock"]

or a more structured form under the driver table:

[openshell.gateway]
compute_drivers = ["kyma"]

[openshell.drivers.kyma]
endpoint = "/run/openshell/kyma.sock"

I don't think that has to block Phase A, but it should be discussed and clarified before Phase B lands so CLI/env/TOML do not grow incompatible shapes.

Name matching via GetCapabilities.driver_name can be deferred as a later hardening step. The initial refactor only needs to establish the construction/consumption split and the common endpoint-based consumption path.

@st-gr

st-gr commented Jun 5, 2026

Copy link
Copy Markdown
Author

Thanks @elezar — agree the named-driver-endpoint framing is the right shape, and lines up with #1589's name-keyed driver_config. Concrete proposal for this PR:

  • ComputeDriverKind::External(PathBuf)ComputeDriverKind::External { name: String, socket: PathBuf }. Other variants stay bare.
  • CLI: add --compute-driver-name=<name> (env OPENSHELL_COMPUTE_DRIVER_NAME) paired with the existing --compute-driver-socket. Both required when either is set; the pair pins External { name, socket } and skips --drivers and auto-detect.
  • On startup, after the tonic channel connects, the gateway calls GetCapabilities and fails fast if response.driver_name != name.

One scope question before I push: should I generalise the dispatch beyond External in this PR — e.g. fold Vm into the same name-keyed path so the registry is uniform — or keep that for the #1589 follow-up and leave the four in-tree variants as-is for now? My read is the latter (this PR stays narrow; #1589 / a successor RFC handles the registry), but happy to expand if you'd prefer it land in one go.

@drew

drew commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

I'd also prefer to avoid a special External enum variant if possible. The <name>@<socket> shape you proposed seems like a good fit here, especially if we expect to support multiple configured drivers later.

+1 to avoiding the External enum variant. I like the <name>@<socket> pattern, though I do worry a bit about overloading the string.

We will eventually want to support multiple drivers, and possibly different versions of the same driver.

That said, can we add the <name>@<socket> convention in a followup PR without breaking contracts? Would be good to see how external drivers gets used before including including additional overloads to the path. In the meantime we can rely on GetCapabilities.driver_name and assume it will be unique.

@drew

drew commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

@st-gr

One scope question before I push: should I generalise the dispatch beyond External in this PR — e.g. fold Vm into the same name-keyed path so the registry is uniform — or keep that for the #1589 follow-up and leave the four in-tree variants as-is for now? My read is the latter (this PR stays narrow; #1589 / a successor RFC handles the registry), but happy to expand if you'd prefer it land in one go.

+1 to your read. Let's leave the four in-tree for now.

@st-gr

st-gr commented Jun 5, 2026

Copy link
Copy Markdown
Author

@elezar @drew — reading the convergence:

This PR:

  • Extract the construction/consumption split as @elezar described: VM and external both produce (channel, advertised_name) via dedicated acquisition paths; consumption (RemoteComputeDriverComputeRuntime::from_driver) is shared. VM construction keeps its ManagedDriverProcess lifecycle guard on its side of the boundary — endpoint type stays slim, no spawn field. UX for --drivers vm unchanged.
  • Drop ComputeDriverKind::External(PathBuf). Restore upstream Copy + const fn as_str. External driver is signalled by a separate Config.compute_driver_socket: Option<PathBuf> that takes precedence over compute_drivers in build_compute_runtime.
  • Keep --compute-driver-socket=<path> (env OPENSHELL_COMPUTE_DRIVER_SOCKET) as the single external flag — no <name>@<socket>, no user-supplied name. Gateway calls GetCapabilities once after connect and logs the advertised driver_name (no validation, per @drew).
  • Bare --drivers vm/docker/podman/kubernetes unchanged.

Deferred to follow-up(s): <name>@<socket> syntax + reserved-name discipline, multi-driver / multi-version, TOML shape choice, GetCapabilities.driver_name matching against a supplied name.

Branch will be rewritten as: (1) refactor extracting the boundary, (2) wire --compute-driver-socket through it, (3) drop External + revert the enum changes, (4) refresh the arch doc. Will start coding and force-push a new revision; flag if anything's misread.

@st-gr st-gr force-pushed the feat/external-compute-driver-socket branch from 1c6663d to be90905 Compare June 5, 2026 11:12
@st-gr

st-gr commented Jun 5, 2026

Copy link
Copy Markdown
Author

@elezar @drew — pushed the rewrite. PR body has been refreshed end-to-end to match the new shape. Summary of what changed since your last comments:

  • No External enum variant. ComputeDriverKind is unchanged. Out-of-tree dispatch now lives on Config.external_compute_driver_socket: Option<PathBuf> and goes through a separate construction path that produces a ComputeRuntime with driver_kind: None.
  • Construction / consumption split. ComputeRuntime::from_driver now takes Option<ComputeDriverKind>; the four in-tree constructors wrap their kind in Some(...), the new new_remote_external passes None. Both paths share the existing RemoteComputeDriver proxy for proto consumption.
  • (name, socket) tuple deferred per @drew's guidance — flag stays a single path, GetCapabilities.driver_name is logged for diagnostics only (no validation against operator input). Name-keying / <name>@<socket> is a follow-up that aligns with rfc-0006: add driver config passthrough proposal #1589.
  • Workspace verification: cargo fmt --check, cargo clippy --workspace --all-targets -D warnings, and cargo test --workspace --lib --bins all clean (incl. all 769 openshell-server lib tests).

Branch is force-pushed; commits are squash-friendly (4 small, named per conventional-commits). Happy to revisit any cut point if you'd prefer a different shape.

@st-gr

st-gr commented Jun 5, 2026

Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

@st-gr

st-gr commented Jun 5, 2026

Copy link
Copy Markdown
Author

recheck

@st-gr st-gr force-pushed the feat/external-compute-driver-socket branch from be90905 to b1ed362 Compare June 5, 2026 18:17
st-gr added a commit to st-gr/OpenShell that referenced this pull request Jun 6, 2026
Upstream merge changed `ComputeRuntime::from_driver` to take
`driver_kind: ComputeDriverKind` by value, but the fork's external-driver
constructor has no kind to supply, and `External(PathBuf)` makes the enum
non-Copy — multiple `move out of self.driver_kind` errors result.

Adopt the shape PR NVIDIA#1703's branch already shipped:
- `from_driver` accepts `Option<ComputeDriverKind>`
- four in-tree constructors pass `Some(kind)`, external passes `None`
- clone `self.driver_kind` and the local `driver` at the three sites that
  previously moved (struct field is now non-Copy)

Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
@st-gr st-gr force-pushed the feat/external-compute-driver-socket branch from b1ed362 to f0d5b27 Compare June 9, 2026 08:14
@st-gr

st-gr commented Jun 9, 2026

Copy link
Copy Markdown
Author

@drew @elezar — pinging on this one. Pushed the rewrite Jun 5 implementing the construction/consumption split @elezar described and dropping the External enum variant per the convergence in this thread. @drew — your "+1 to your read. Let's leave the four in-tree for now" was the last word; happy to address anything I missed before re-review.

Just rebased onto main to clear the conflict that landed with #1744 (driver config passthrough). Conflict was a single mechanical hunk in compute/mod.rsdriver_sandbox_from_public got a new Option<ComputeDriverKind> parameter from #1744; my connect_external_compute_driver helper sits adjacent. Both coexist cleanly. No semantic change vs the Jun 5 head.

Verified clean: cargo fmt --check, cargo clippy -p openshell-core -p openshell-server --all-targets -- -D warnings, and the 3 new cli::tests::compute_driver_socket_* tests.

New head: f0d5b277. Could I get a /ok to test f0d5b277 and a re-review when you have a moment?

@elezar elezar self-requested a review June 9, 2026 21:39
@elezar elezar dismissed their stale review June 9, 2026 21:40

need to re-review.

@pimlock

pimlock commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

/ok to test f0d5b27

@st-gr st-gr force-pushed the feat/external-compute-driver-socket branch from f0d5b27 to 0925729 Compare June 13, 2026 21:35
@st-gr

st-gr commented Jun 13, 2026

Copy link
Copy Markdown
Author

Rebased onto current upstream/main and resolved the merge conflicts that had the PR sitting at mergeStateStatus: DIRTY.

Conflicts resolved:

  • crates/openshell-server/src/cli.rs — interleaved upstream's new .with_grpc_rate_limit(...) chain element with this PR's .with_external_compute_driver_socket(...).
  • crates/openshell-server/src/compute/mod.rs (twice) — merged the use-block, keeping upstream's tokio::sync::{Mutex, watch} and tracing::{debug, info, warn} (the new watch + debug imports landed in main) alongside this PR's tokio::net::UnixStream, tonic::transport::Endpoint, and tower::service_fn.

No conflicts in crates/openshell-server/src/lib.rs once cli.rs and compute/mod.rs landed — the auto-merge held there.

Verified inside rust:1.95-slim:

  • cargo fmt --all -- --check
  • cargo check --workspace --tests
  • cargo test -p openshell-core: 175/175 ✓
  • cargo test -p openshell-server --lib: 817/817 ✓

4 commits on top of upstream/main, new HEAD 0925729b. mergeStateStatus flipped to BLOCKED (mergeable, review-only). @elezar's earlier review was dismissed automatically when the PR branch was switched to the field-based design — ready for re-review and /ok to test 0925729b when convenient.

Comment thread crates/openshell-server/src/lib.rs Outdated
tracing_log_bus: TracingLogBus,
supervisor_sessions: Arc<supervisor_session::SupervisorSessionRegistry>,
) -> Result<ComputeRuntime> {
if let Some(socket_path) = config.external_compute_driver_socket.as_deref() {

@elezar elezar Jun 15, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we discussed that the current in-tree VM driver is an example of a "remote socket driver", I don't think that we should introduce a new conditional branch for this special case.

I am prepping a PR from a fork that would propose additional changes. See st-gr#1

Comment thread architecture/compute-runtimes.md Outdated
| Podman | Rootless or single-machine deployments. | Container plus nested sandbox namespace. | Uses the Podman REST API, OCI image volumes, and CDI GPU devices when available. |
| Kubernetes | Cluster deployment through Helm. | Pod plus nested sandbox namespace. | Uses Kubernetes API objects, service accounts, secrets, PVC-backed workspace storage, and GPU resources. |
| VM | Experimental microVM isolation. | Per-sandbox libkrun VM. | Gateway spawns `openshell-driver-vm` as a subprocess over a private, state-local Unix socket. The VM driver boots a cached bootstrap `rootfs.ext4`, prepares requested OCI images inside a bootstrap VM with `umoci`, attaches the prepared image disk read-only, and gives each sandbox a writable `overlay.ext4` for merged-root changes and runtime material. The driver persists each accepted launch request beside the overlay and restarts those VMs on driver startup without recreating the overlay. |
| External | Out-of-tree drivers operated alongside the gateway. | Whatever boundary the driver implements. | Activated by `--compute-driver-socket=<path>` (env `OPENSHELL_COMPUTE_DRIVER_SOCKET`). The gateway connects to a UDS the operator already provisioned, runs `GetCapabilities`, logs the advertised `driver_name`, and dispatches all sandbox lifecycle calls through the same `compute_driver.proto` surface as the in-tree drivers. The driver process and socket lifecycle are operator-owned; the gateway does not spawn, supervise, or remove the driver. The trust boundary is the socket's filesystem permissions — the operator must ensure only the gateway uid can read/write it. |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In working through a local review, I think using "Remote" instead of "External" is clearer going forward. This also makes it less confusing if we implement the VM driver as a specific in-tree instance of a remote driver.

@st-gr

st-gr commented Jun 16, 2026

Copy link
Copy Markdown
Author

@elezar — adopted your refactor PR (st-gr/OpenShell#1) on top of the rebased PR #1703 branch. Branch is now 0925729b + 2979bf9c; new HEAD 2979bf9c carries your feat(server): model remote compute driver endpoints commit verbatim, with authorship preserved.

This addresses both inline comments:

  • lib.rs:712 — VM and out-of-tree remote drivers now route through the same acquired-endpoint path (compute::vm::spawnEndpointRemoteComputeDriverComputeRuntime). The previous External conditional branch is gone.
  • architecture/compute-runtimes.md:37 — renamed "External" → "Remote" throughout.

Verified inside rust:1.95-slim:

  • cargo fmt --all -- --check
  • cargo check --workspace --tests
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test -p openshell-core: 176/176 ✅
  • cargo test -p openshell-server: 822/822 ✅
  • No conflict against current upstream/main (git merge-tree exits clean).

The branch now sits at 5 commits ahead of upstream/main. Ready for re-review and /ok to test 2979bf9c when convenient.

st-gr and others added 2 commits June 17, 2026 14:27
Add named remote compute driver endpoint support to the gateway.

Remote drivers are selected by a non-reserved compute driver name and either a CLI/env socket endpoint or [openshell.drivers.<name>].socket_path. The VM driver now enters ComputeRuntime through the same acquired remote endpoint path, while Docker, Podman, and Kubernetes retain their in-process drivers.

Require --drivers/OPENSHELL_DRIVERS when pairing an ad-hoc socket endpoint so the socket does not imply a magic driver name, and keep reserved in-tree names unavailable for unmanaged socket endpoints.

Co-authored-by: Evan Lezar <elezar@nvidia.com>

Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the feat/external-compute-driver-socket branch from 2979bf9 to 02d9b31 Compare June 17, 2026 13:09
@elezar

elezar commented Jun 17, 2026

Copy link
Copy Markdown
Member

Hi @st-gr. I spent some time on this again today. I rebased the changes onto the current main as a single commit (keeping shared authorship) and also added an initial in-tree test case for remote drivers. There are some follow-ups that I would like to see, but I don't think that they block this PR further. Most notably is #1950 which discusses adding end-to-end tests for this driver path, but probably requires more work in general.

There is also #1949 which should further cleanup driver aquisition / configuration, but once again is not a blocker.

If you could verify the current changes against your remote driver, we could look at getting this integrated. We would also need to update the PR description. Did you want to do that, or should I?

@elezar

elezar commented Jun 17, 2026

Copy link
Copy Markdown
Member

/ok-to-test 02d9b31

@st-gr

st-gr commented Jun 17, 2026

Copy link
Copy Markdown
Author

@elezar — verified against my Kyma compute driver. Both the cold-start dispatch and live sandbox traffic work cleanly with 02d9b319.

Build (cargo zigbuild --target x86_64-unknown-linux-gnu.2.31 --features bundled-z3 -p openshell-server --bin openshell-gateway):

  • cargo fmt --check
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test -p openshell-core -p openshell-server --lib → 1,083 tests, all green (includes your new test_support.rs UDS lifecycle coverage).
  • Image: ghcr.io/st-gr/openshell-gateway@sha256:343630cda459bcd27507c035c521032e277c36ea2d544d29ec0b14178d30bbaa.

Deploy (Helm upgrade on Gardener-managed Kyma cluster, 4 amd64 workers; the existing ods release rolled gateway.image.tag to the new digest with --reuse-values):

Required one chart-side change in openshell-driver-kyma: pass --drivers kyma alongside the existing --compute-driver-socket, since the refactored gateway requires the explicit driver name (reserved built-ins reject sockets; kyma is non-reserved). One-line addition in the deployment template.

Gateway log on first boot:

INFO openshell_server: Using compute driver driver=kyma
INFO openshell_server: Using remote compute driver endpoint driver=kyma socket=/var/run/openshell-driver.sock
INFO openshell_server::compute: Compute driver connected configured_driver=kyma advertised_driver=kyma in_tree=false

That's the exact shape your refactor advertises — configured/advertised name comparison via GetCapabilities, in_tree=false correctly identifying Kyma as out-of-tree. The pre-existing persistent sandbox kept working through the rollout: GetSandboxConfig, ConnectSupervisor, GetSandboxProviderEnvironment, GetInferenceBundle all returning 200 with sub-5 ms latency. Supervisor session accepted, sandbox-side claude-tui resumed without restart.

Follow-ups in this fork's chart (openshell-driver-kyma) — I'll land these after #1703 merges so the chart and upstream stay in sync:

I'm happy for you to do the PR description update — your wording on the construction/consumption split and the named-endpoint framing is the canonical version, and you've been driving the refactor. From my side, this is ready to merge once you've updated the description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants