feat(cli): add GPU count requests#1812
Conversation
|
/ok to test abe5b79 |
|
/ok to test abe5b79 |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Landing #1815 first should simplify the changes here. |
abe5b79 to
06c69dd
Compare
|
Label |
PR Review StatusValidation: this is maintainer-authored, project-valid GPU CLI/API/runtime work that aligns GPU sandbox intent with structured resource requirements and the related resource-requirements RFC direction. Review findings:
Docs: Fern docs were updated under Next state: |
Re-check After CI UpdateI re-evaluated latest head Disposition: partially resolved. Remaining items:
Next state: |
Re-check After CI UpdateI re-evaluated latest head Disposition: partially resolved. Remaining items:
Next state: |
06c69dd to
87c9a8c
Compare
Re-check After Author UpdateI re-evaluated latest head Disposition: not resolved. Remaining items:
Next state: |
Re-check After CI UpdateI re-evaluated latest head Disposition: partially resolved. Remaining items:
Next state: |
The breaking proto change is intentional. However, I will defer to @drew on whether we should rather reserve the previous |
Re-check After Author UpdateI re-evaluated latest head Disposition: not resolved. Remaining items:
Next state: |
87c9a8c to
da6fbd8
Compare
Re-check After Author UpdateI re-evaluated latest head Disposition: not resolved. Remaining items:
Next state: |
Re-check After CI and Independent ReviewI re-evaluated latest head Disposition: not resolved. Remaining items:
Next state: |
|
Responding to the two blocking findings from gator:
Given that the API is still alpha, we do not want to carry legacy GPU-specific reserved or transitional fields forward into the proto shape we intend to stabilize. I will update RFC 0004 in this branch to reflect that decision, since the current RFC text still describes reserving the old GPU fields. |
Re-check After Author UpdateI re-evaluated latest head Disposition: partially resolved. Remaining items:
Next state: |
c7cf9d6 to
b681234
Compare
b681234 to
60a7ccd
Compare
Re-check After Author UpdateI re-evaluated latest head Disposition: not resolved. Remaining items:
Checks: the latest forced-push head is still early in CI; DCO and required status publication are pending at the time of this re-check. The review findings above must be addressed before gator can move this PR to pipeline watch. Next state: |
Re-check After CI UpdateI re-evaluated latest head Disposition: not resolved. Remaining items:
Next state: |
Re-check After CI and Independent ReviewI re-evaluated latest head Disposition: not resolved. Remaining items:
Independent review summary: no additional code-level blocker was found in this bounded pass beyond the items above. Next state: |
35c7ef2 to
8e108bb
Compare
8e108bb to
cad3745
Compare
Remove the Python GPU smoke test and its fixture. The e2e:k3s:gpu task only depended on e2e:python:gpu and did not have a separate k3s implementation, so remove that stale alias with the task it pointed at. Signed-off-by: Evan Lezar <elezar@nvidia.com>
BREAKING CHANGE: SandboxSpec.gpu and DriverSandboxSpec.gpu were replaced with resource_requirements.gpu, changing protobuf field 9 from a bool to a message for both public and driver APIs. Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
cad3745 to
82f88d8
Compare


Summary
Adds structured GPU resource requirements for sandbox creation and updates the
CLI/API/runtime path so
openshell sandbox create --gpu [COUNT]records GPUintent in
ResourceRequirements.gpu.This is an intentional alpha API break: the public and compute-driver sandbox
specs now carry
resource_requirements.gpuin place of the previous flat GPUfields. Existing live or persisted legacy GPU intent is not migrated; callers
should use a matching OpenShell CLI/API version and recreate GPU sandboxes when
they need the new typed shape. RFC 0004 is updated to document that decision.
Related Issue
Part of #1444. Related to #1338, #1156, #1360, and #1492. Follow-up GPU support
preflight semantics are tracked in #1807.
Changes
ResourceRequirements.gpu.countto the public and compute-driver protos.A present GPU requirement with omitted count means one GPU;
count = 0isrejected.
--gpufor one GPU and--gpu COUNTfor counted requests.
translation, provisioning timeout messages, and driver helper APIs.
nvidia.com/gpulimits from GPU requirements.driver_config: Docker andPodman use
cdi_devices, and VM usesgpu_device_ids.IDs are rejected, a single exact device works with default
--gpu, andmulti-device exact lists require
--gpu COUNTmatching the list length.selector refreshes CDI inventory before validation/create, picks from the
normalized NVIDIA CDI inventory in round-robin order, fails when count exceeds
selectable devices, and treats WSL2
nvidia.com/gpu=allfallback as oneselectable device.
e2e:k3s:gpu; GPU E2E coverage nowlives in the Rust Docker/Podman GPU device-selection suites.
Kubernetes driver READMEs.
Testing
mise run pre-commit/Users/elezar/.local/bin/mise exec -- cargo check -p openshell-core -p openshell-driver-docker -p openshell-driver-podman -p openshell-driver-vm -p openshell-driver-kubernetes/Users/elezar/.local/bin/mise exec -- cargo test -p openshell-core -p openshell-driver-docker -p openshell-driver-podman -p openshell-driver-vm -p openshell-driver-kubernetes gpu --lib/Users/elezar/.local/bin/mise exec -- cargo clippy -p openshell-core -p openshell-driver-docker -p openshell-driver-podman -p openshell-driver-vm -p openshell-driver-kubernetes --all-targets -- -D warningsThe PR has the
test:e2e-gpulabel so the required Docker GPU E2E gate runs inCI for the updated head.
Checklist