securitypolicy: Add enforcement point for blockdev mounts#2762
securitypolicy: Add enforcement point for blockdev mounts#2762micromaomao wants to merge 13 commits into
Conversation
[cherry-picked from 421b12249544a334e36df33dc4846673b2a88279] This set of changes fixes the [Metadata Desync with UVM State](https://msazure.visualstudio.com/One/_workitems/edit/33232631/) bug, by reverting the Rego policy state on mount and some types of unmount failures. For mounts, a minor cleanup code is added to ensure we close down the dm-crypt device if we fails to mount it. Aside from this, it is relatively straightforward - if we get a failure, the clean up functions will remove the directory, remove any dm-devices, and we will revert the Rego metadata. For unmounts, careful consideration needs to be taken, since if the directory has been unmounted successfully (or even partially successful?), and we get an error, we cannot just revert the policy state, as this may allow the host to use a broken / empty mount as one of the image layer. See 615c9a0bdf's commit message for more detailed thoughts. The solution I opted for is, for non-trivial unmount failure cases (i.e. not policy denial, not invalid mountpoint), if it fails, then we will block all further mount, unmount, container creation and deletion attempts. I think this is OK since we really do not expect unmounts to fail - this is especially the case for us since the only writable disk mount we have is the shared scratch disk, which we do not unmount at all unless we're about to kill the UVM. Testing ------- The "Rollback policy state on mount errors" commit message has some instruction for making a deliberately corrupted (but still passes the verifyinfo extraction) VHD that will cause a mount error. The other way we could make mount / unmount fail, and thus test this change, is by mounting a tmpfs or ro bind in relevant places: To make unmount fail: mkdir /run/gcs/c/.../rootfs/a && mount -t tmpfs none /run/gcs/c/.../rootfs/a or mkdir /run/gcs/mounts/scsi/m1/a && mount -t tmpfs none /run/gcs/mounts/scsi/m1/a To make mount fail: mount -o ro --bind /run/mounts/scsi /run/mounts/scsi or mount --bind -o ro /run/gcs/c /run/gcs/c Once failure is triggered, one can make them work again by `umount`ing the tmpfs or ro bind. What about other operations? ---------------------------- This PR covers mount and unmount of disks, overlays and 9p. Aside from setting `metadata.matches` as part of the narrowing scheme, and except for `metadata.started` to prevent re-using a container ID, Rego does not use persistent state for any other operations. Since it's not clear whether reverting the state would be semantically correct (we would need to carefully consider exactly what are the side effects of say, failing to execute a process, start a container, or send a signal, etc), and adding the revert to those operations does not currently affect much behaviour, I've opted not to apply the metadata revert to those for now. Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
…cation IsSNP() now can return an error, although this is not expected on LCOW. Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
…fidential containers [cherry-picked from f81b450894206a79fff4d63182ff034ba503ebdb] This PR contains 2 commits. The first one is the fix: **bridge: Force sequential message handling for confidential containers** This fixes a vulnerability (and reduces the surface for other similar potential vulnerabilities) in confidential containers where if the host sends a mount/unmount modification request concurrently with an ongoing CreateContainer request, the host could re-order or skip image layers for the container to be started. While this could be fixed by adding mutex lock/unlock around the individual modifyMappedVirtualDisk/modifyCombinedLayers/CreateContainer functions, we decided that in order to prevent any more of this class of issues, the UVM, when running in confidential mode, should just not allow concurrent requests (with exception for any actually long-running requests, which for now is just waitProcess). The second one adds a log entry for when the processing thread blocks. This will make it easier to debug should the gcs "hung" on a request. This PR is created on ADO targeting the conf branch as this security vulnerability is not public yet. This fix should be backported to main once deployed. Related work items: #33357501, #34327300 Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
[cherry-picked from 055ee5eb4a802cb407575fb6cc1e9b07069d3319] guest/network: Restrict hostname to valid characters Because we write this hostname to /etc/hosts, without proper validation the host can trick us into writing arbitrary data to /etc/hosts, which can, for example, redirect things like ip6-localhost (but likely not localhost itself) to an attacker-controlled IP address. We implement a check here that the host-provided DNS name in the OCI spec is valid. ACI actually restricts this to 5-63 characters of a-zA-Z0-9 and '-', where the first and last characters can not be '-'. This aligns with the Kubernetes restriction. c.f. IsValidDnsLabel in Compute-ACI. However, there is no consistent official agreement on what a valid hostname can contain. RFC 952 says that "Domain name" can be up to 24 characters of A-Z0-9 '.' and '-', RFC 1123 expands this to 255 characters, but RFC 1035 claims that domain names can in fact contain anything if quoted (as long as the length is within 255 characters), and this is confirmed again in RFC 2181. In practice we see names with underscopes, most commonly \_dmarc.example.com. curl allows 0-9a-zA-Z and -.\_|~ and any other codepoints from \u0001-\u001f and above \u007f: https://github.com/curl/curl/blob/master/lib/urlapi.c#L527-L545 With the above in mind, this commit allows up to 255 characters of a-zA-Z0-9 and '_', '-' and '.'. This change is applied to all LCOW use cases. This fix can be tested with the below code to bypass any host-side checks: +++ b/internal/hcsoci/hcsdoc_lcow.go @@ -52,6 +52,10 @@ func createLCOWSpec(ctx context.Context, coi *createOptionsInternal) (*specs.Spe spec.Linux.Seccomp = nil } + if spec.Annotations[annotations.KubernetesContainerType] == "sandbox" { + spec.Hostname = "invalid-hostname\n1.1.1.1 ip6-localhost ip6-loopback localhost" + } + return spec, nil } Output: time="2025-10-01T15:13:41Z" level=fatal msg="run pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: failed to create container f2209bb2960d5162fc9937d3362e1e2cf1724c56d1296ba2551ce510cb2bcd43: guest RPC failure: hostname \"invalid-hostname\\n1.1.1.1 ip6-localhost ip6-loopback localhost\" invalid: must match ^[a-zA-Z0-9_\\-\\.]{0,999}$: unknown" Related work items: #34370598 Closes: https://msazure.visualstudio.com/One/_workitems/edit/34370598 Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
…nts, and prevent unmounting or deleting in-use things [cherry-picked from d0334883cd43eecbb401a6ded3e0317179a3e54b] This set of changes adds some checks (when running with a confidential policy) to prevent the host from trying to clean up mounts, overlays, or the container states dir when the container is running (or when the overlay has not been unmounted yet). This is through enhancing the existing `hostMounts` utility, as well as adding a `terminated` flag to the Container struct. The correct order of operations should always be: - mount read-only layers and scratch (in any order, and individual containers (not the sandbox) might not have their own scratch) - mount the overlay - start the container - container terminates - unmount overlay - unmount read-only layers and scratch The starting up order is implied, and we now explicitly deny e.g. unmounting layer/scratch before unmounting overlay, or unmounting the overlay while container has not terminated. We also deny deleteContainerState requests when the container is running or when the overlay is mounted. Doing so when a container is running can result in unexpectedly deleting its files, which breaks it in unpredictable ways and is bad. Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
[cherry-picked from 1dd0b7ea0b0f91d3698f6008fb0bd5b0de777da6] Blocks mount option passing for 9p (which is accidental) and SCSI disks. - guest: Restrict plan9 share names to digits only on Confidential mode - hcsv2/uvm: Restrict SCSI mount options in confidential mode (The only one we allow is `ro`) Related work items: #34370380 Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
…n any LinuxDevices [cherry-picked from 9f69e49..72b338a] [forward-ported from original PR 13653011 (e631aa42c6039ab0d228b746b280c26809433f00)] See https://msazure.visualstudio.com/ContainerPlatform/_git/Microsoft.hcsshim/pullrequest/13653011 Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
…allow trailing newlines This makes it less likely that things break because a trailing newline is added by a code editor. Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
A "BlockDev" mount (as introduced by microsoft#2168) is a special type of MappedVirtualDisk mount request, in which the GCS just creates a symlink at the requested target path pointing at the underlying `/dev/sdX`, and can later be consumed by a container by bind-mounting that symlink in the OCI spec. C-ACI does not currently use this feature, so the framework rejects both `mount_blockdev` and `unmount_blockdev` by default. An allow all policy will still let it through, which is useful for testing. Assisted-by: GitHub Copilot:claude-opus-4.7 Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com> Depends-on: microsoft#2559 Depends-on: microsoft#2761
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR extends the security policy + GCS hardening model for confidential containers by adding new enforcement points (blockdev mount/unmount), introducing a revertable policy-metadata mechanism, and tightening runtime behavior around mounts/unmounts and request concurrency.
Changes:
- Add
mount_blockdev/unmount_blockdevenforcement points across API/framework rego, Go enforcer interfaces, and host-side mount flows. - Introduce “revertable sections” for policy metadata (Save/Restore) and use them to keep policy state consistent across mount/unmount failure paths.
- Add host mount tracking + “mountsBroken” kill-switch, sequential bridge mode for SNP, and hostname / Plan9 share-name validation.
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/securitypolicy/version_framework | Bump embedded framework version. |
| pkg/securitypolicy/version_api | Bump embedded API version. |
| pkg/securitypolicy/securitypolicyenforcer_rego.go | Add blockdev enforcement, device input plumbing, and revertable section implementation for rego enforcer. |
| pkg/securitypolicy/securitypolicyenforcer.go | Extend enforcer interface/options for blockdev and revertable sections; add no-op handles for open/closed door. |
| pkg/securitypolicy/securitypolicy.go | Make embedded version parsing robust to trailing whitespace/newlines. |
| pkg/securitypolicy/regopolicy_linux_test.go | Add regression/property tests for revertable sections, overlay retry behavior, devices unsupported, and blockdev policy. |
| pkg/securitypolicy/rego_utils_test.go | Refactor overlay mount helper; add revertable-section test utilities; factor out container-spec creation helper. |
| pkg/securitypolicy/policy.rego | Wire new framework rules for blockdev enforcement points. |
| pkg/securitypolicy/open_door.rego | Allow blockdev mount/unmount in open-door policy. |
| pkg/securitypolicy/framework.rego | Default-deny blockdev; add devices_ok and reject linux devices; add related error strings. |
| pkg/securitypolicy/api.rego | Register new enforcement points and introduced versions for blockdev. |
| internal/regopolicyinterpreter/regopolicyinterpreter_test.go | Add tests for copying rego metadata and Save/Restore behavior. |
| internal/regopolicyinterpreter/regopolicyinterpreter.go | Add SavedMetadata, deep-copy support, Save/Restore APIs, and constants for metadata keys. |
| internal/guest/storage/scsi/scsi_test.go | Extend test to validate dm-crypt cleanup on mount failure. |
| internal/guest/storage/scsi/scsi.go | Ensure dm-crypt device is cleaned up on mount failure. |
| internal/guest/storage/plan9/plan9.go | Add Plan9 share-name validation helper. |
| internal/guest/storage/overlay/overlay.go | Update MountLayer comment to match behavior. |
| internal/guest/storage/mount.go | Log warning when unmount is requested for a non-existent path. |
| internal/guest/runtime/hcsv2/uvm_state_test.go | Update tests for new hostMounts API + add coverage for RO devices and overlay usage tracking. |
| internal/guest/runtime/hcsv2/uvm_state.go | Rework hostMounts tracking to include RO devices + overlays and usage counts; add explicit Lock/Unlock contract. |
| internal/guest/runtime/hcsv2/uvm.go | Add revertable policy sections around mount ops; introduce mountsBroken; add LinuxDevices to policy input; add overlay-in-use checks; centralize delete container state. |
| internal/guest/runtime/hcsv2/standalone_container.go | Validate hostname before writing to /etc/hosts. |
| internal/guest/runtime/hcsv2/sandbox_container.go | Validate hostname before writing to /etc/hosts. |
| internal/guest/runtime/hcsv2/process.go | Mark container terminated when init process exits. |
| internal/guest/runtime/hcsv2/container.go | Track container init termination state. |
| internal/guest/network/network_test.go | Add unit tests for hostname validation. |
| internal/guest/network/network.go | Add hostname validation (regex-based) to prevent injection via /etc/hosts generation. |
| internal/guest/bridge/bridge_v2.go | Route delete-container-state through Host API to centralize policy checks. |
| internal/guest/bridge/bridge.go | Add optional sequential request processing mode (with async exceptions) plus blockage warning. |
| internal/gcs/unrecoverable_error.go | Introduce UnrecoverableError handler (panic on non-SNP; infinite loop on SNP). |
| cmd/gcs/main.go | Enable sequential bridge processing automatically on SNP (confidential). |
Comments suppressed due to low confidence (1)
internal/guest/storage/mount.go:134
- Logging a warning for every unmount of a non-existent path can be noisy in normal flows where “remove if present” semantics are expected (the function already treats this as success). Consider lowering this to Debug/Trace, or gating it behind conditions that indicate a real anomaly, to avoid log spam in steady state.
if _, err := osStat(target); err != nil {
if os.IsNotExist(err) {
log.G(ctx).WithField("target", target).Warnf("UnmountPath called for non-existent path")
return nil
}
return errors.Wrapf(err, "failed to determine if path '%s' exists", target)
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func appendDeviceData(deviceData []interface{}, devices []oci.LinuxDevice) []interface{} { | ||
| for _, device := range devices { | ||
| deviceData = append(deviceData, inputData{ | ||
| "path": device.Path, | ||
| "type": device.Type, | ||
| "major": device.Major, | ||
| "minor": device.Minor, | ||
| "fileMode": device.FileMode, | ||
| "uid": device.UID, | ||
| "gid": device.GID, | ||
| }) | ||
| } | ||
|
|
||
| return deviceData | ||
| } |
| // Represents an in-progress revertable section. To ensure state is consistent, | ||
| // Commit() and Rollback() must not fail, so they do not return anything, and if | ||
| // an error does occur they should panic. | ||
| type RevertableSectionHandle interface { | ||
| Commit() | ||
| Rollback() | ||
| } |
| func (policy *regoEnforcer) StartRevertableSection() (RevertableSectionHandle, error) { | ||
| policy.revertableSectionLock.Lock() | ||
| var err error | ||
| policy.savedMetadata, err = policy.rego.SaveMetadata() | ||
| if err != nil { | ||
| err = errors.Wrapf(err, "unable to save metadata for revertable section") | ||
| policy.revertableSectionLock.Unlock() | ||
| return &revertableSectionHandle{}, err | ||
| } |
| if device.ty != ty { | ||
| return fmt.Errorf("wrong device type %s, expected %s", ty, device.ty) | ||
| } |
| for name, origObject := range orig { | ||
| if copyObject, ok := copy[name]; ok { | ||
| if !assertObjectsEqual(origObject, copyObject) { | ||
| t.Errorf("original and copy differ on key %s", name) | ||
| } | ||
| } else { | ||
| t.Errorf("copy missing object %s", name) | ||
| } | ||
| } | ||
|
|
||
| return true | ||
| } |
| } | ||
| } | ||
|
|
||
| func Test_Regi_EnforceCreateContainer_RequireNoDevices(t *testing.T) { |
| return errors.Errorf( | ||
| "Mount, unmount, container creation and deletion has been disabled in this UVM due to a previous error (%q)", | ||
| h.mountsBrokenCausedBy, | ||
| ) |
| b := bridge.New(mux, *v4) | ||
| // For confidential containers, we protect ourselves against attacks caused | ||
| // by concurrent modifications, by processing one request at a time. | ||
| forceSequential, err := amdsevsnp.IsSNP() |
There was a problem hiding this comment.
If we are non SNP but enforcing policy we should also force sequential to keep the dev and real environments as similar as possible.
There was a problem hiding this comment.
Unfortunately at this point if we're not in SNP, we do not yet know if we will have a policy or not.
There was a problem hiding this comment.
We could still set this value after getting the policy tho, will check
A "BlockDev" mount (as introduced by
#2168) is a special type of
MappedVirtualDisk mount request, in which the GCS just creates a symlink at the
requested target path pointing at the underlying
/dev/sdX, and can later beconsumed by a container by bind-mounting that symlink in the OCI spec.
C-ACI does not currently use this feature, so the framework rejects both
mount_blockdevandunmount_blockdevby default. An allow all policy willstill let it through, which is useful for testing.
Assisted-by: GitHub Copilot:claude-opus-4.7
Signed-off-by: Tingmao Wang tingmaowang@microsoft.com
Depends-on: #2559
Depends-on: #2761