Skip to content

2024.9.x fix workflow error surfacing#2320

Open
Toocky wants to merge 19 commits into
cortezaproject:2024.9.xfrom
Toocky:2024.9.x-fix-workflow-error-surfacing
Open

2024.9.x fix workflow error surfacing#2320
Toocky wants to merge 19 commits into
cortezaproject:2024.9.xfrom
Toocky:2024.9.x-fix-workflow-error-surfacing

Conversation

@Toocky
Copy link
Copy Markdown

@Toocky Toocky commented Apr 11, 2026

The following changes are implemented

  1. Improvement - Workflow errors #2319 Placing an iterator inside a parallel fork/join produced a bogus failed to resolve workflow step dependencies error and the workflow couldn't be saved or run. Iterator conversion required its next and exit children in the graph first; join gateway required all parents in the graph first. When an iterator's exit path led to a join gateway, they formed a circular dependency and the resolution loop stalled.
  • Fix: Join gateway is now created eagerly with whatever parents are already resolved; a new PathAdder interface + joinGateway.AddPath method lets the converter backfill missing parents in a second pass after the main resolution loop finishes. Breaks the cycle without affecting runtime correctness.
  1. Improvement - Workflow errors #2319 After an error handler was registered, author-chosen variables like errorMessage / errStep became unavailable to downstream steps the moment any non-erroring step ran between the handler registration and the actual error. Manifested as "works initially, stops working after some time/uses", clustered in subworkflows and inside exclusive/inclusive gateways.
  • Fix: Split the two concerns. Added a new State.errHandlerResults field holding only the name mapping, carried through State.Next() alongside errHandler with the same lifecycle (set together, cleared together). results goes back to its proper role of only holding the previous step's output.
  1. Improvement - Workflow errors #2319 Any variable set by a step inside a parallel branch could silently vanish at the join if that branch wasn't paths[0].
  • Fix: Merge every parent's full scope (not just paths[0]'s), then merge every parent's results on top, both in reverse path order so paths[0] deterministically wins conflicts. Gives a clean precedence hierarchy: paths[0].results > … > paths[N].results > paths[0].scope > … > paths[N].scope.
  1. Improvement - Workflow errors #2319 Previously, when makeGraph stalled it emitted a single "failed to resolve workflow step dependencies" error attached to whatever step happened to be iterated last. The user couldn't tell which step was stuck or why
  • Fix: When the loop stalls, every unresolved step is now reported individually with its kind, ref, ID, and the specific unresolved child/parent step IDs it is waiting for.
  1. Show appropriate error from workflow editor when backend is unaccessible #687 Saving a workflow when the backend was unreachable produced Cannot read property 'workflowID' of undefined in the browser console. And when the real cause was a trigger update failure, the user saw the misleading "Make sure all trigger steps are properly configured" message even if the actual error was totally unrelated.
  • Fix: Wrapped each save stage in its own try/catch. Network errors produce "Unable to reach the server. Check your connection and try again." with positive-evidence detection (axios err.code matching ECONN|ETIMEDOUT|ENETUNREACH|ENOTFOUND, or err.request && !err.response, or explicit message regex). Trigger update failures surface the actual underlying error instead of the generic configure-triggers red herring. Defensive guard against unexpected saved.workflowID shapes. processingSave = false moved into finally.
  1. Add more information to workflow errors by including which parameter causes the error #1725 Expression parse failures returned bare parser output like "syntax error near X" with no indication of which argument or result failed. Type mismatches named the wrong type but not what was accepted. Missing required args didn't say what type was expected.
  • Fix: New parseExpressionsAs wrapper takes a category ("argument" / "result") and enriches parse failures with the target name, the failing expression (rune-aware truncated to 80 chars), and which lifecycle stage produced the error (parse / type / test parse). requiredArg now says "expected type: X or Y" when reporting a missing argument. checkArg lists the accepted types when the value type is wrong. workflowStepDefConv wrapper messages now lead with the step kind and ref so the offending step is identifiable at a glance.
  1. Improve error message display in workflows #1905 The error step could only carry a single flat message. End users saw generic red toasts with no title, no severity distinction, no structure.
  • Fix: Added optional title and severity (error/warning/info) arguments to the error step alongside the existing required message. Runtime attaches them to the resulting KindAutomation error via the errors package's meta bag under a namespaced workflow.error.* key, gated by a specific workflow.error.safe = true meta flag so only user-authored workflow errors (not every KindAutomation error) bypass production masking. Frontend's shared toastErrorHandler mixin detects the meta keys, renders rich toasts with title/variant/body, and falls back to the legacy plain-toast path for old workflows and non-workflow errors. Configurator gets dedicated Title/Message/Severity fields; severity dropdown falls back to a raw text input with a reset button when the severity expression is not a literal, so custom expressions (vars.level) aren't clobbered.

Changes in the user interface:

Extended workflow Error Step
image

More meaning full errors:
image

Checklist when submitting a final (!draft) PR

  • Commits are tidied up, squashed if needed and follow guidelines in CONTRIBUTING.md
  • Code builds
  • All existing tests pass
  • All new critical code is covered by tests
  • PR is linked to the relevant issue(s)
  • Rebased with the target branch

Toocky added 19 commits April 11, 2026 10:28
When the resolution loop stalls, report every unresolved step
individually with its kind, ref, ID, and the specific unresolved
child/parent step IDs it is waiting for, instead of a single
generic 'failed to resolve workflow step dependencies' error
pointing at an arbitrary step.
Issue cortezaproject#1905: error step now supports optional title, body and severity
arguments alongside the legacy message argument. Fields are attached
to the returned KindAutomation error via the errors package meta bag
under the workflow.error.* namespace.

The errors package http.go is updated to preserve full error details
(message + meta) for KindAutomation errors regardless of mask state,
since workflow error steps are user-authored and explicitly safe to
surface to the client.

Backwards compatible: workflows configured with only 'message' produce
identical wire output. Frontend rendering changes land separately.
Issue cortezaproject#1905: expose the new rich error step arguments (title, body,
severity) alongside the legacy message field in the error step
configurator. The canvas node label now prefers title > body > message
when building its preview.

Config.arguments is normalised on load so all four targets are always
present (with empty exprs), keeping v-model bindings stable even for
legacy workflows that only have message.
Issue cortezaproject#1905: detect the namespaced workflow.error.{title,body,severity}
meta keys attached by the backend error step and render a rich toast
with title/body/variant when present.

The detection probes a handful of error-shape variants (direct reject,
axios wrap, nested response.data.error) and returns null for anything
that does not look like a workflow error step payload. When null, or
when the rich rendering branch throws, the handler falls through to
the existing legacy plain-toast path - so non-workflow errors and
legacy workflow errors render exactly as they do today.
Issue cortezaproject#1725: workflow validation errors previously omitted which
argument or result caused the failure, making them hard to debug.

- parseExpressionsAs now wraps each parser/type/test failure with the
  category (argument vs result), the target name, the failing expression
  (truncated to 80 chars), and the lifecycle stage.
- requiredArg includes the accepted type(s) when reporting a missing
  argument so designers know what to put in the field.
- checkArg includes the list of accepted types when the value type is
  wrong, instead of just naming the unexpected type.
- workflowStepDefConv wrapper messages now lead with the step kind and
  ref so the offending step is identifiable at a glance.

All existing test fixtures continue to work since the wrapped errors
remain *errors.Internal under the hood.
Issue cortezaproject#687: the workflow save handler previously crashed with
'Cannot read property workflowID of undefined' when the backend
was unreachable, and masked trigger-update failures behind a
misleading 'Make sure all trigger steps are properly configured'
message.

The handler now:
- Catches trigger create/update/delete failures separately and
  surfaces the underlying error rather than the configure-triggers
  red herring.
- Wraps each API stage so a network failure short-circuits with a
  clear 'Unable to reach the server' toast.
- Defensively guards against an unexpected workflowCreate /
  workflowUpdate response shape so saved.workflowID never throws.
- Distinguishes network errors (axios timeouts, ENOTFOUND, etc.)
  from backend-returned errors and renders the appropriate message.
Backend
- errors/http.go bypass is now narrowed to KindAutomation errors that
  carry an explicit workflow.error.safe = true meta flag (set only by
  convErrorStep). Generic automation errors from corredor etc. keep
  going through the normal masking path. Flag constant lives in
  errors.MetaWorkflowErrorSafe so both sides share one source of truth.
- exprExcerpt now truncates on rune boundaries so non-ASCII identifiers
  and literals can never split mid-codepoint in error messages.
- Dead parseExpressions wrapper removed; only parseExpressionsAs remains.

Frontend
- Workflow save handler: isNetworkError now requires *positive* evidence
  (axios err.code, err.request && !err.response, or an explicit Network
  Error message). Plain JS errors thrown from inside .then() handlers
  no longer get misreported as "server unreachable". processingSave is
  cleared in finally so an unexpected sync throw can never leave the
  spinner stuck.
- Error step configurator: severity dropdown only renders when the
  severity expression is a literal we can round-trip; otherwise a raw
  text input is shown with a "reset to literal" affordance so custom
  expressions (vars.level etc.) are never clobbered on re-render. Node
  label preview strips literal quotes for a cleaner canvas label.
- toast mixin: added a comment explaining that body/title are
  workflow-author controlled and that BootstrapVue toasts render them
  as plain text — switching to HTML rendering here without sanitising
  would introduce a stored-XSS surface.
- Removed dead notification:configure-triggers locale key.
State.results was doing double duty: it held both the previous step's
output values AND the error handler's variable-name mapping (error ->
myErr, errorMessage -> myMsg, errorStepID -> myID). The error-handler
case merged the mapping into results, but the normal-step case blew
results away with a plain assignment — so the first non-error step
after an error handler registration clobbered the mapping, and any
later error left errorMessage / error / errorStepID undefined in the
downstream scope.

This matched the reported "works for a while then stops working" and
"Log Message can't find errorMessage" symptoms, and explains why the
bug clustered around subworkflows and iterators (both of which have
multi-step paths after the error handler where a normal step is
guaranteed to execute before an error surfaces).

Fix: split the two concerns onto separate State fields.

- Add State.errHandlerResults holding only the name mapping.
- Carry it across states in State.Next() alongside errHandler, same
  lifecycle (set together, cleared together).
- setErrorHandlerResultsToScope now reads errHandlerResults.
- Normal step path leaves the mapping untouched.
- errHandlerResults is cleared alongside errHandler when a handler
  activates, so a subsequent handler cannot inherit stale mappings.
The original design split the error step body from the legacy message
argument so authors could theoretically write a terse flat fallback
and a longer rich toast body. In practice nobody would ever write
two different strings and the extra field just added confusing UI
surface for zero real benefit.

message is now the single source of truth: required, used as both
the flat error string and the rich toast content. title and severity
remain as optional rich-toast decorations. Backwards compatible with
legacy workflows that only configured message, since the flat string
is still what it always was.

- Backend convErrorStep drops body/metaBody entirely; flat message is
  now just the evaluated message arg.
- Backend verifyStep drops body from the allowed args.
- Frontend configurator drops the Body form-group and bodyArg.
- toast.js mixin no longer reads workflow.error.body; the flat
  err.message is the body.
- Locale keys for error-step.body removed and message/title labels
  updated to reflect the simplified semantics.
In makeGraph(), the iterator step requires both its next and exit
child steps to already exist in the graph, and the join gateway
requires all its parent steps to already exist. When an iterator's
exit path leads to a join gateway with multiple parents, neither
can resolve first — the resolution loop stalls and reports a
generic 'failed to resolve workflow step dependencies' error.

Fix:
- Add an exported PathAdder interface and AddPath method on
  joinGateway so parents can be registered after construction
  (mutex-guarded, deduped).
- Change convGateway 'join' case to create the gateway eagerly
  with whatever parents are already resolved, instead of returning
  nil, nil. This breaks the cycle: the join exists, the iterator
  finds its exit step, the iterator resolves, the iterator branch
  resolves, and the missing parents become available.
- Add a backfill pass after the main resolution loop that walks
  every join gateway and registers any parents that resolved
  later via PathAdder.AddPath.

The runtime check at joinGateway.Exec (paths.Contains(r.Parent))
still guards against unknown parents — the backfill pass runs
synchronously to completion before any session executes so the
graph is always fully wired before runtime.
joinGateway.Exec previously started its merge from paths[0]'s scope
only and merged in every parent's immediate step results on top. Any
variable set mid-branch in paths[1..N] lived in that branch's scope
but was never in the final step's results, so it was silently dropped
at the join. Whichever branch happened to be paths[0] worked; the
others lost their intermediate variables.

This matched the reported "variables don't always pass through the
joined flow" symptom and explained why authors worked around it by
moving Base64-style steps out of parallel forks entirely.

Fix: merge every parent's scope (not just paths[0]'s), then merge
every parent's results on top, and do both in reverse path order so
that paths[0] wins conflicts deterministically. This preserves the
intra-branch rule that step results overwrite earlier scope values
and gives a clear precedence hierarchy:

  paths[0].results > ... > paths[N].results >
  paths[0].scope   > ... > paths[N].scope

No behaviour change for workflows that previously relied on paths[0]
winning (that's now the explicit rule). Variables that used to be
silently dropped from non-paths[0] branches are now preserved.
When two parallel branches produce conflicting values for the same
variable, the merge still proceeds with paths[0] winning but the
colliding branches are now reported as non-fatal warnings.

- Frame gains a Warnings []string field alongside the existing
  Action/Error so the editor test panel and admin session log can
  render them in their existing per-step display.
- State gains a matching warnings slice that MakeFrame copies into
  the emitted frame.
- A new responseWithWarnings wrapper response lets a step return
  its normal *expr.Vars payload together with a list of warnings.
  The session unwraps it on dispatch, attaches warnings to the
  current State (for frame emission) and also mirrors each warning
  through zap.Warn so ops can grep the server log independently of
  the editor.
- joinGateway.Exec runs a detectConflicts pass before merging. For
  every key seen across multiple branches it compares stringified
  values; mismatches produce a warning naming the key, the
  contributing branches/origins (scope vs results), and the clipped
  values. The merge then proceeds with paths[0] winning as before.

Workflows with no conflicts get zero behaviour change — the warning
path only kicks in when detectConflicts returns a non-empty slice.
Introduces a design-time analyzer that walks the workflow graph and
warns when two or more parallel branches joining at the same gateway
can write the same variable name. Runs during validateWorkflow
alongside the existing graph conversion so warnings are populated on
every save without requiring a test run.

Output lives in a new Workflow.Warnings field (parallel to Issues,
same WorkflowIssueSet shape) so the frontend can render warnings with
the existing per-step culprit mapping and badge logic.

What the analyzer considers a "branch write":
- Function/Iterator/ExecWorkflow step: its declared Results targets.
- Expressions step: its Arguments targets (the assignment LHS).
- ErrorHandler step: its Results targets (author-chosen error var names).
- Everything else (Error, Termination, Break, Continue, Debug, Delay,
  Prompt, Visual, plain Gateway): no scope writes.

Subworkflow walking is intentionally NOT recursive. A subworkflow's
effect on the calling scope is exactly the ExecWorkflow step's own
declared results (see convExecWorkflowStep which Evals s.Results
against the subworkflow return value). The static declaration is
the full contract, so there's nothing to gain from walking inside.

Dynamic writes (vars[computed] = ...) are not detected. Documented
as a known limitation in the analyzer source.

The analyzer is non-fatal by construction — it never adds to Issues
and never blocks save/run. Pair with the runtime detectConflicts
warning in joinGateway.Exec to catch what static analysis missed.
Mirrors the existing issues pattern for the new workflow.warnings
payload (populated by the backend analyzer) and for the runtime
Frame.Warnings emitted by joinGateway.Exec during a test run.

- New this.warnings map populated from workflow.warnings, keyed by
  stepID, identical shape and assembly to this.issues.
- hasWarnings computed + yellow header badge next to the existing
  red issues badge when wf.warnings is non-empty.
- Yellow warning triangle icon (new public/icons/warning.svg, same
  silhouette as issue.svg with a warning-yellow fill) rendered on
  the canvas node when a step has warnings but no issues. Issues
  take precedence on the badge position since they are more severe.
- Clicking the warning icon opens the shared issues modal which now
  has two sections (Issues in red, Warnings in yellow). The modal
  only renders a section if that list is non-empty.
- Locale keys editor:detected-warnings and editor:warnings added.
When the step resolution loop stalls, a step may have no unresolved
child or parent neighbours. In that case it is not actually waiting
on anything — its own conversion is failing (bad ref, missing
required argument, unknown function, etc.) and verifyStep or the
converter error path has already recorded the specific reason in
wfii. Appending a "waiting for [] and/or []" entry on top adds
noise that actively misleads the user into looking for a
dependency problem that does not exist.

The stall reporter now skips any unresolved step whose unresolved
neighbour lists are both empty; the pre-existing issues on that
step are left to speak for themselves.
Review C2: replace fmt.Sprintf("%v", tv.Get()) as the detectConflicts
fingerprint with encoding/json marshalling. %v prints maps in random
iteration order and pointers as addresses, producing both false
positives (same map compared unequal because keys swapped) and false
negatives (different pointers compared equal because both printed
their address only). encoding/json sorts object keys on output so
structurally equal values serialise identically. Failed marshals
fall back to %v, still consistent within a single run.

Review C3: collectBranchWrites now stops at any join gateway other
than the home join being analyzed. In nested parallels the outer
branch's upstream chain passes through the inner join -> the inner
fork -> the outer branch; walking through the inner join used to
bleed outer-branch writes into every inner branch and flagged
spurious conflicts. The inner join's own analysis handles its
incoming branches independently so we must not cross it.

Review C4: drop iterator from stepWrites. Iterator loop vars (item,
counter, isFirst, ...) are injected into the body scope per
iteration and never survive the iterator's exit path, so two parallel
iterators sharing the idiomatic "item" target are not actually in
conflict at the downstream join. Steps inside the iterator body
still get walked normally via parent edges.

Review M1: warning messages no longer refer to the paths[i]
abstraction that authors cannot correlate with the canvas. They now
name the winning branch by its parent step ID and list each
contributing branch by its parent step ID, origin (scope/results)
and value fingerprint.

Review m5: truncateForWarning is now rune-aware so non-ASCII values
cannot be split mid-codepoint.

Review m11: isNetworkError recognises EAI_AGAIN (DNS transient) and
axios >= 1.x ERR_NETWORK generic network codes.

Review m13: drop the redundant .Wrap(err) on errors.Internal calls
in workflowStepDefConv. parseExpressionsAs already wraps its inner
error through fmt.Errorf("%w") so the outer message already contains
the full chain; the extra .Wrap doubled the inner error in both the
formatted text and the wrap chain.

Review m14: verifyStep now validates severity as a save-time error
when the expression is a simple quoted literal ("error", "warning",
"info"). Non-literal expressions (e.g. vars.level) fall through and
are normalised at runtime as before. Catches authors typing e.g.
"danger" without running the workflow first.
Review M3: Error.vue computed getters (messageArg/titleArg/severityArg)
no longer mutate item.config.arguments. They were previously calling
findOrCreateArg, which pushed a new arg onto the array and invoked
this.\$set — a side effect inside a computed getter that happened to
work today only because created() pre-populated every target and the
create branch was never hit at runtime. Any future filter on the
arguments array would re-enter the mutating getter during reactivity
graph evaluation. The getters are now pure findArg + makeArg fallback
reads; mutation is hoisted into a new ensureArgs() method called from
created() and implicitly by findOrCreateArg for write-path callers.

Review M8: when workflowCreate/workflowUpdate returns an unexpected
shape without a workflowID, the save handler now console.warns the
raw payload before toasting. The toast still shows a user-facing
message but the malformed payload is no longer silently discarded,
so the path is actually diagnosable from the browser console.

Review m2: document the concurrency model of State.warnings instead
of guarding it. Each *State is owned by a single step-execution
goroutine for its entire lifecycle — Session.exec and the
corresponding eventHandler run on the same goroutine — and sibling
parallel branches use distinct *State values. No cross-state sharing
means no mutex is required, just the documentation of the invariant
so future code doesn't accidentally cross-post.

Review m9 (no change, verified): the title-only severity default
allegedly mismatches between Go default and frontend variant. Verified
this is not a real mismatch. Backend attaches workflow.error.severity
meta whenever title or severity is non-default; when only title is
set, severity falls through as "error", which the frontend maps to
variant "danger". When neither is set, no meta attaches and the
legacy plain-toast path runs. All three cases are consistent.
Review M6: add Test_writeHttpJSON_workflowErrorSafeBypass to lock the
four-cell matrix for KindAutomation error masking so the opt-in
MetaWorkflowErrorSafe bypass semantics cannot be silently changed by
future refactors. The test exercises:

  1. automation + safe flag + masked → meta preserved
  2. automation + no flag + masked → meta stripped
  3. non-automation + safe flag + masked → meta stripped
     (flag is a no-op on non-automation kinds — defensive
     regression check against a future author bypassing the
     kind gate)
  4. automation + safe flag + unmasked → meta preserved

All four pass. The flat message is always visible in all cases,
matching the pre-existing contract that Error.Error() is always
sent to the client even under mask.

Review M2: add a prominent doc block on State.errHandlerResults
explaining the intentional non-propagation to the running scope.
An Expressions step downstream of an error handler that references
a to-be-populated error variable name will not see it until the
error actually fires and the injection path runs. This is by
design — silently seeding author-chosen variable names into scope
would pollute the namespace and make debugging harder (a nil from
a reference cannot be distinguished from a typo). Documenting the
contract so future readers do not "fix" it into scope pollution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant