feat(toolkit-lib): surface CloudTrail control-plane errors for failed custom resources#1676
feat(toolkit-lib): surface CloudTrail control-plane errors for failed custom resources#1676iankhou wants to merge 3 commits into
Conversation
Validated end-to-end
The CloudWatch logs alone ("Setup failed due to an internal error.") are not diagnosable and require additional investigation; the CloudTrail line names the exact denied permission. Confirms the function-scoped (Username) lookup, the rolled-back-stack diagnosis fix, and attribution all work in a real account. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1676 +/- ##
==========================================
+ Coverage 89.29% 89.31% +0.01%
==========================================
Files 77 77
Lines 11716 11716
Branches 1620 1619 -1
==========================================
+ Hits 10462 10464 +2
+ Misses 1223 1221 -2
Partials 31 31
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR enhances @aws-cdk/toolkit-lib’s post-failure diagnosis for custom resources by optionally consulting CloudTrail to surface underlying control-plane errors (e.g., AccessDenied) that typically don’t show up clearly in the custom resource Lambda logs. It also fixes a pre-existing diagnosis gap for rolled-back stacks by expanding the stack event poll range to cover both the failed operation and its rollback.
Changes:
- Add a CloudTrail lookup (scoped to the custom resource’s backing Lambda “Username”) to surface errored API calls during
cdk diagnose, while keepingcdk deployon a hint-only path. - Fix rolled-back stack diagnosis by introducing
PollRange.mostRecentDeploymentAttempt()and using it in fresh diagnosis. - Add/update tests and SDK mocking to cover CloudTrail behavior, pagination, and the rolled-back regression.
Reviewed changes
Copilot reviewed 12 out of 15 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| yarn.lock | Adds the CloudTrail AWS SDK v3 client (and transitive updates) to the workspace lockfile. |
| packages/aws-cdk/THIRD_PARTY_LICENSES | Updates bundled third-party attributions to include the newly bundled CloudTrail client and related deps. |
| packages/@aws-cdk/toolkit-lib/test/api/diagnosing/stack-diagnoser.test.ts | Adds regression test ensuring diagnosis finds failures across rollback + failed create operations. |
| packages/@aws-cdk/toolkit-lib/test/api/diagnosing/resource-investigation.test.ts | Adds tests for CloudTrail integration (gating, scoping, pagination, and “no errors” behavior). |
| packages/@aws-cdk/toolkit-lib/test/_helpers/mock-sdk.ts | Adds CloudTrail client mocking support for unit tests. |
| packages/@aws-cdk/toolkit-lib/package.json | Declares @aws-sdk/client-cloudtrail as a runtime dependency of toolkit-lib. |
| packages/@aws-cdk/toolkit-lib/lib/api/stack-events/stack-event-poller.ts | Introduces PollRange.mostRecentDeploymentAttempt() to span rollback + triggering operation. |
| packages/@aws-cdk/toolkit-lib/lib/api/diagnosing/stack-diagnoser.ts | Threads cloudTrailEnabled and uses the new poll range for diagnose-from-fresh. |
| packages/@aws-cdk/toolkit-lib/lib/api/diagnosing/resource-investigation.ts | Extends the investigation dispatcher to pass options through to custom-resource investigation. |
| packages/@aws-cdk/toolkit-lib/lib/api/diagnosing/investigate-ecs-service.ts | Clarifies shared vs resource-specific investigation options in comments. |
| packages/@aws-cdk/toolkit-lib/lib/api/diagnosing/investigate-custom-resource.ts | Implements CloudTrail lookup + deploy-path hint for custom resource investigation. |
| packages/@aws-cdk/toolkit-lib/lib/api/aws-auth/sdk.ts | Adds ICloudTrailClient and SDK.cloudTrail() wrapper. |
| packages/@aws-cdk/toolkit-lib/.projen/deps.json | Records the new runtime dependency for projen-managed dependency tracking. |
| .projenrc.ts | Adds the CloudTrail SDK dependency to the toolkit-lib project configuration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Mostly true, but not exactly: CloudTrail records control plane events, not data plane events. Usually that means: low-traffic configuration events are recorded, high-traffic events are not. So 2 counterexamples to your statement above:
The point still stands that not everything will be recorded, but you need to know what will and won't be, and for what reasons, to give accurate guidance to users. Speaking of, can you make sure that you understand the Pull Request and write the PR body yourself, rather than extrude some text via AI? I would like some proof of understanding. Thank you. |
Will do, rewritten. Missed the errant caveats at the end, and thanks for the catch. Removed that section because it's evident from the main section of the PR body anyway. I'll outsource less for the PR descriptions. |
|
Total lines changed 7257 is greater than 1000. Please consider breaking this PR down. |
It's from THIRD_PARTY_LICENSES, which should be ignored. See #1680 for fix. |
… custom resources When a custom resource fails because its backing Lambda was denied an AWS API call (a control-plane failure), the function's own logs often don't show it — but CloudTrail records the AccessDenied. This adds a CloudTrail lookup to the custom-resource investigation. CloudTrail events are delivered with several minutes of latency, so they aren't available when a `cdk deploy` fails. Since deploy and diagnose share the diagnosis code, this is gated on a cloudTrailEnabled flag: - diagnoseFromFresh (`cdk diagnose`, run after the fact) enables it and looks up CloudTrail, scoped to the function server-side via the Username attribute (the Lambda execution role's session name is the function name), paginated and bounded to a window around the failure. Errored events are surfaced (e.g. "AccessDenied on s3:CreateBucket"). - diagnoseFromErrorCollection (`cdk deploy`) leaves it off and adds a hint to re-run `cdk diagnose` in a few minutes. Also fixes diagnoseFromFresh reporting "no issues found" on any rolled-back stack: a rolled-back deployment spans two CloudFormation operations (the failed create and the rollback); the poll range only covered the most recent (the rollback's successful deletes), excluding the real CREATE_FAILED. Adds PollRange.mostRecentDeploymentAttempt() spanning both. - Add @aws-sdk/client-cloudtrail and an ICloudTrailClient.lookupEvents wrapper. - Thread cloudTrailEnabled through DiagnoseOptions/InvestigateOptions. Validated end-to-end against a denied s3:CreateBucket. Known limitation: CloudTrail LookupEvents records write/mutating denials, not most read calls. Tests cover both paths, Username scoping, pagination, and the rolled-back regression; new behaviors are mutation-verified.
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>


Summary
Today, deployment failures containing custom resources (CR) often do not present useful error messages to users in the CDK CLI, especially when not using the
aws-cdk-lib/custom-resources.Providerframework.Custom resources can be deployed in several ways, many of which put the burden of handing communication between CloudFormation and the CR lambda. If a response is not provided to CloudFormation, or if a response is provided with insufficient context/reason, then the CDK CLI user will not see the root cause of the deployment failure.
Solution
Use AWS CloudTrail LookupEvents to check events that caused the custom resource deployment to fail. Below is a log example of what this feature delivers, when the CDK CLI user runs
cdk diagnose~5 minutes aftercdk deployfails:This will allow the CDK CLI user to get the root cause of their deployment failure, no matter how they deployed their custom resource.
Why isn't CloudTrail lookup used on
cdk deployfailure?cdk deploygenerally does not run for more than a few minutes for most deployments, whether successful or failed. CloudTrail events take ~5 minutes to be available using theLookupEventsAPI. That leaves us with a couple of options:cdk deploy(terrible dev experience)cdk diagnose) laterCloudTrail management events are available for 90 days by default, and incur no charges to the customer. Data-plane events are not recorded by default.
Implementation details
cdk deployandcdk diagnoseshare much of the same logic, so the CloudTrail investigation is gated by a flag:cloudTrailEnabled, and only true incdk diagnose.LookupAttributes: [{ Username: <functionName> }]. The Lambda execution role's assumed-role session name is the function name, which CloudTrail indexes asUsername, so the lookup returns only the function's own calls — no client-side attribution, and no risk of unrelated account activity crowding the results past the page cap.ResourceError.timestamp), with bounded pagination.errorCodeand is best-effort: any failure degrades to debug logging and is non-blocking.Higher level change (prerequisite to our CloudTrail feature)
diagnoseFromFreshreported "no issues found" on any rolled-back stack — a pre-existing bug that madecdk diagnoseuseless on the common case, where we see a successful rollback. A rolled-back deployment spans two CloudFormation operations (the failed create/update and the rollback); the poll range only included the most recent (the rollback's successful deletes), excluding the actualCREATE_FAILED. AddedPollRange.mostRecentDeploymentAttempt()which spans both.Changes
lib/api/aws-auth/sdk.ts—@aws-sdk/client-cloudtrail+ICloudTrailClient.lookupEvents.lib/api/diagnosing/resource-investigation.ts—investigateViaCloudTrail(Username-scoped, paginated, windowed).lib/api/diagnosing/stack-diagnoser.ts— threadcloudTrailEnabled; use the new poll range for diagnose.lib/api/stack-events/stack-event-poller.ts—PollRange.mostRecentDeploymentAttempt().By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license