Skip to content

feat(toolkit-lib): surface CloudTrail control-plane errors for failed custom resources#1676

Draft
iankhou wants to merge 3 commits into
mainfrom
iankhou-cr-cloudtrail
Draft

feat(toolkit-lib): surface CloudTrail control-plane errors for failed custom resources#1676
iankhou wants to merge 3 commits into
mainfrom
iankhou-cr-cloudtrail

Conversation

@iankhou

@iankhou iankhou commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Today, deployment failures containing custom resources (CR) often do not present useful error messages to users in the CDK CLI, especially when not using the aws-cdk-lib/custom-resources.Provider framework.

Custom resources can be deployed in several ways, many of which put the burden of handing communication between CloudFormation and the CR lambda. If a response is not provided to CloudFormation, or if a response is provided with insufficient context/reason, then the CDK CLI user will not see the root cause of the deployment failure.

Solution

Use AWS CloudTrail LookupEvents to check events that caused the custom resource deployment to fail. Below is a log example of what this feature delivers, when the CDK CLI user runs cdk diagnose ~5 minutes after cdk deploy fails:

   AccessDenied on s3.amazonaws.com:CreateBucket — ... not authorized to perform: s3:CreateBucket

This will allow the CDK CLI user to get the root cause of their deployment failure, no matter how they deployed their custom resource.

Why isn't CloudTrail lookup used on cdk deploy failure?

cdk deploy generally does not run for more than a few minutes for most deployments, whether successful or failed. CloudTrail events take ~5 minutes to be available using the LookupEvents API. That leaves us with a couple of options:

  • Wait 5 minutes at the end of cdk deploy (terrible dev experience)
  • Run another command (cdk diagnose) later

CloudTrail management events are available for 90 days by default, and incur no charges to the customer. Data-plane events are not recorded by default.

Implementation details

  • cdk deploy and cdk diagnose share much of the same logic, so the CloudTrail investigation is gated by a flag: cloudTrailEnabled, and only true in cdk diagnose.
  • It is scoped server-side via LookupAttributes: [{ Username: <functionName> }]. The Lambda execution role's assumed-role session name is the function name, which CloudTrail indexes as Username, so the lookup returns only the function's own calls — no client-side attribution, and no risk of unrelated account activity crowding the results past the page cap.
  • Investigation is bounded to a time window around the failure event (ResourceError.timestamp), with bounded pagination.
  • The feature surfaces events that carry an errorCode and is best-effort: any failure degrades to debug logging and is non-blocking.

Higher level change (prerequisite to our CloudTrail feature)

diagnoseFromFresh reported "no issues found" on any rolled-back stack — a pre-existing bug that made cdk diagnose useless on the common case, where we see a successful rollback. A rolled-back deployment spans two CloudFormation operations (the failed create/update and the rollback); the poll range only included the most recent (the rollback's successful deletes), excluding the actual CREATE_FAILED. Added PollRange.mostRecentDeploymentAttempt() which spans both.

Changes

  • lib/api/aws-auth/sdk.ts@aws-sdk/client-cloudtrail + ICloudTrailClient.lookupEvents.
  • lib/api/diagnosing/resource-investigation.tsinvestigateViaCloudTrail (Username-scoped, paginated, windowed).
  • lib/api/diagnosing/stack-diagnoser.ts — thread cloudTrailEnabled; use the new poll range for diagnose.
  • lib/api/stack-events/stack-event-poller.tsPollRange.mostRecentDeploymentAttempt().
  • Tests for both paths, the Username scoping, pagination, and the rolled-back regression — new behaviors mutation-verified.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@github-actions github-actions Bot added the p2 label Jun 25, 2026
@aws-cdk-automation aws-cdk-automation requested a review from a team June 25, 2026 20:35
@iankhou

iankhou commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Validated end-to-end

cdk diagnose on a custom resource whose backing Lambda was denied s3:CreateBucket, ~5 min after the failed deploy:

❌ Stack IamDeniedCustomResourceStack:
Resource updates failed:
IamDeniedCustomResourceStack/MyDeniedResource/Default  (AWS::CloudFormation::CustomResource MyDeniedResource)
  Received response status [FAILED] from custom resource. ...

   Logs from /aws/lambda/IamDeniedCustomResourceStack-CrHandler...:
   INFO  request type: "Create"
   ERROR Setup failed due to an internal error.
   INFO  request type: "Delete"
   Logs: https://...console.aws.amazon.com/cloudwatch/...

   AccessDenied on s3.amazonaws.com:CreateBucket — User: arn:aws:sts::...:assumed-role/...CrHandlerServiceRole.../IamDeniedCustomResourceStack-CrHandler... is not authorized to perform: s3:CreateBucket ... because no identity-based policy allows the s3:CreateBucket action

The CloudWatch logs alone ("Setup failed due to an internal error.") are not diagnosable and require additional investigation; the CloudTrail line names the exact denied permission. Confirms the function-scoped (Username) lookup, the rolled-back-stack diagnosis fix, and attribution all work in a real account.

Comment thread packages/@aws-cdk/toolkit-lib/lib/api/diagnosing/stack-diagnoser.ts
@codecov-commenter

codecov-commenter commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.31%. Comparing base (221e5de) to head (429255b).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1676      +/-   ##
==========================================
+ Coverage   89.29%   89.31%   +0.01%     
==========================================
  Files          77       77              
  Lines       11716    11716              
  Branches     1620     1619       -1     
==========================================
+ Hits        10462    10464       +2     
+ Misses       1223     1221       -2     
  Partials       31       31              
Flag Coverage Δ
suite.unit 89.31% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances @aws-cdk/toolkit-lib’s post-failure diagnosis for custom resources by optionally consulting CloudTrail to surface underlying control-plane errors (e.g., AccessDenied) that typically don’t show up clearly in the custom resource Lambda logs. It also fixes a pre-existing diagnosis gap for rolled-back stacks by expanding the stack event poll range to cover both the failed operation and its rollback.

Changes:

  • Add a CloudTrail lookup (scoped to the custom resource’s backing Lambda “Username”) to surface errored API calls during cdk diagnose, while keeping cdk deploy on a hint-only path.
  • Fix rolled-back stack diagnosis by introducing PollRange.mostRecentDeploymentAttempt() and using it in fresh diagnosis.
  • Add/update tests and SDK mocking to cover CloudTrail behavior, pagination, and the rolled-back regression.

Reviewed changes

Copilot reviewed 12 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
yarn.lock Adds the CloudTrail AWS SDK v3 client (and transitive updates) to the workspace lockfile.
packages/aws-cdk/THIRD_PARTY_LICENSES Updates bundled third-party attributions to include the newly bundled CloudTrail client and related deps.
packages/@aws-cdk/toolkit-lib/test/api/diagnosing/stack-diagnoser.test.ts Adds regression test ensuring diagnosis finds failures across rollback + failed create operations.
packages/@aws-cdk/toolkit-lib/test/api/diagnosing/resource-investigation.test.ts Adds tests for CloudTrail integration (gating, scoping, pagination, and “no errors” behavior).
packages/@aws-cdk/toolkit-lib/test/_helpers/mock-sdk.ts Adds CloudTrail client mocking support for unit tests.
packages/@aws-cdk/toolkit-lib/package.json Declares @aws-sdk/client-cloudtrail as a runtime dependency of toolkit-lib.
packages/@aws-cdk/toolkit-lib/lib/api/stack-events/stack-event-poller.ts Introduces PollRange.mostRecentDeploymentAttempt() to span rollback + triggering operation.
packages/@aws-cdk/toolkit-lib/lib/api/diagnosing/stack-diagnoser.ts Threads cloudTrailEnabled and uses the new poll range for diagnose-from-fresh.
packages/@aws-cdk/toolkit-lib/lib/api/diagnosing/resource-investigation.ts Extends the investigation dispatcher to pass options through to custom-resource investigation.
packages/@aws-cdk/toolkit-lib/lib/api/diagnosing/investigate-ecs-service.ts Clarifies shared vs resource-specific investigation options in comments.
packages/@aws-cdk/toolkit-lib/lib/api/diagnosing/investigate-custom-resource.ts Implements CloudTrail lookup + deploy-path hint for custom resource investigation.
packages/@aws-cdk/toolkit-lib/lib/api/aws-auth/sdk.ts Adds ICloudTrailClient and SDK.cloudTrail() wrapper.
packages/@aws-cdk/toolkit-lib/.projen/deps.json Records the new runtime dependency for projen-managed dependency tracking.
.projenrc.ts Adds the CloudTrail SDK dependency to the toolkit-lib project configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rix0rrr

rix0rrr commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Why do this only for custom resources? Can we not do it in a way that works generically for all resources?

I haven't though too deeply about how we should get "useful information" from CloudTrail; I was hoping you would.

I was hoping we could do something like the following:

image

@rix0rrr

rix0rrr commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

CloudTrail LookupEvents records write/mutating API denials, not most read calls (e.g. ssm:GetParameter).

Mostly true, but not exactly: CloudTrail records control plane events, not data plane events. Usually that means: low-traffic configuration events are recorded, high-traffic events are not.

So 2 counterexamples to your statement above:

  • S3's PutObject is a mutating call, but wouldn't be recorded.
  • Lambda's GetFunctionConfiguration is a read call, but would be recorded.

The point still stands that not everything will be recorded, but you need to know what will and won't be, and for what reasons, to give accurate guidance to users.

Speaking of, can you make sure that you understand the Pull Request and write the PR body yourself, rather than extrude some text via AI? I would like some proof of understanding. Thank you.

@iankhou

iankhou commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

CloudTrail LookupEvents records write/mutating API denials, not most read calls (e.g. ssm:GetParameter).

Mostly true, but not exactly: CloudTrail records control plane events, not data plane events. Usually that means: low-traffic configuration events are recorded, high-traffic events are not.

So 2 counterexamples to your statement above:

  • S3's PutObject is a mutating call, but wouldn't be recorded.
  • Lambda's GetFunctionConfiguration is a read call, but would be recorded.

The point still stands that not everything will be recorded, but you need to know what will and won't be, and for what reasons, to give accurate guidance to users.

Speaking of, can you make sure that you understand the Pull Request and write the PR body yourself, rather than extrude some text via AI? I would like some proof of understanding. Thank you.

Will do, rewritten. Missed the errant caveats at the end, and thanks for the catch. Removed that section because it's evident from the main section of the PR body anyway. I'll outsource less for the PR descriptions.

@github-actions

Copy link
Copy Markdown
Contributor

Total lines changed 7257 is greater than 1000. Please consider breaking this PR down.

@iankhou

iankhou commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Total lines changed 7257 is greater than 1000. Please consider breaking this PR down.

It's from THIRD_PARTY_LICENSES, which should be ignored. See #1680 for fix.

@iankhou

iankhou commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Why do this only for custom resources? Can we not do it in a way that works generically for all resources?

I haven't though too deeply about how we should get "useful information" from CloudTrail; I was hoping you would.

I was hoping we could do something like the following:

image

Yep this is what we are doing for custom resources. I'll explore how to do this generically.

iankhou and others added 3 commits June 29, 2026 14:21
… custom resources

When a custom resource fails because its backing Lambda was denied an AWS API
call (a control-plane failure), the function's own logs often don't show it — but
CloudTrail records the AccessDenied. This adds a CloudTrail lookup to the
custom-resource investigation.

CloudTrail events are delivered with several minutes of latency, so they aren't
available when a `cdk deploy` fails. Since deploy and diagnose share the diagnosis
code, this is gated on a cloudTrailEnabled flag:
- diagnoseFromFresh (`cdk diagnose`, run after the fact) enables it and looks up
  CloudTrail, scoped to the function server-side via the Username attribute (the
  Lambda execution role's session name is the function name), paginated and bounded
  to a window around the failure. Errored events are surfaced (e.g. "AccessDenied
  on s3:CreateBucket").
- diagnoseFromErrorCollection (`cdk deploy`) leaves it off and adds a hint to re-run
  `cdk diagnose` in a few minutes.

Also fixes diagnoseFromFresh reporting "no issues found" on any rolled-back stack:
a rolled-back deployment spans two CloudFormation operations (the failed create and
the rollback); the poll range only covered the most recent (the rollback's
successful deletes), excluding the real CREATE_FAILED. Adds
PollRange.mostRecentDeploymentAttempt() spanning both.

- Add @aws-sdk/client-cloudtrail and an ICloudTrailClient.lookupEvents wrapper.
- Thread cloudTrailEnabled through DiagnoseOptions/InvestigateOptions.

Validated end-to-end against a denied s3:CreateBucket. Known limitation: CloudTrail
LookupEvents records write/mutating denials, not most read calls. Tests cover both
paths, Username scoping, pagination, and the rolled-back regression; new behaviors
are mutation-verified.
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants