brief field with colon in YAML frontmatter produces invalid YAML in concept and summary pages
Summary
openkb add generates wiki/concepts/*.md (and to a lesser extent wiki/summaries/*.md) with YAML frontmatter where the brief: value is written as an unquoted plain scalar. When the LLM-generated brief contains ": " (a colon followed by space) — which is very common in natural-language summaries — the resulting frontmatter is invalid YAML and breaks any external tool that strictly parses it (VS Code's Markdown extensions, Obsidian plugins, yaml lint, generic YAML loaders, etc.).
OpenKB's own internal helpers (e.g. _read_concept_briefs in agent/compiler.py) read brief via string slicing (line[len("brief:"):]), so the failure is silent within OpenKB — it only surfaces when wiki pages are consumed by external YAML-aware tools. That is plausibly why this hasn't been reported despite likely being triggered by many users.
Reproduction
openkb init in an empty dir, drop any .md doc whose topic naturally invites a "X: Y" style one-liner (technical specs, comparisons, "why X over Y" themes).
openkb add path/to/doc.md
- Inspect
wiki/concepts/*.md. With non-trivial probability one of the generated briefs contains ": " and the frontmatter is invalid YAML.
Concrete example actually produced in my run (ADASIS v2 spec corpus):
---
sources: [summaries/v2s_chapter_02.md, summaries/v2s_chapter_01.md]
brief: Why ADASIS v2 supersedes v1: shifting horizon reconstruction complexity from client applications to the provider.
---
VS Code Markdown Preview surfaces:
Failed to parse frontmatter
Nested mappings are not allowed in compact mappings at line 2, column 8:
brief: Why ADASIS v2 supersedes v1: shifting horizon reconstruction complexity ...
^
yaml.safe_load confirms: yaml.scanner.ScannerError (mapping values not allowed here).
Root cause
openkb/agent/compiler.py, two locations write brief naively as f-string:
- Line ~534 (update path):
fm = re.sub(r"brief:.*", f"brief: {brief}", fm)
- Line ~546 (create path):
fm_lines.append(f"brief: {brief}")
The brief value is the only LLM-authored field in the frontmatter (other fields — doc_type, full_text, sources, source paths — are code-generated and sanitized). So the existing assumption in schema.py ("frontmatter is managed by code") is correct in intent but violated for this one field, because the code doesn't quote/escape what the LLM provides.
Suggested fix
Route brief through yaml.safe_dump so PyYAML auto-quotes when needed:
import yaml
def _yaml_kv_line(key: str, value: str) -> str:
line = yaml.safe_dump(
{key: value},
default_flow_style=False,
width=10**9,
allow_unicode=True,
).strip()
return line.split("\n")[0]
# replace both call sites with:
fm_lines.append(_yaml_kv_line("brief", brief))
# and:
safe = _yaml_kv_line("brief", brief)
fm = re.sub(r"brief:.*", lambda _m: safe, fm)
I have applied exactly this monkey-patch locally and verified round-trip for 5 cases (colon, double-quote, hash, comma, parentheses) — all pass yaml.safe_load. The patch is ~5 lines.
Impact
- Silent: OpenKB's own pipeline keeps working, so users may not notice until they consume wiki pages with an external tool (Obsidian, VS Code preview, doc generators, git-based publishing, future schema validation in OpenKB itself).
- Probabilistic: depends on whether the LLM happens to produce ": " in the one-sentence brief. Likelihood rises sharply for technical/comparative documents.
- Cumulative: every add adds more potentially-invalid pages; harder to retrofit later than to fix at write time.
Notes
- Same naive f-string pattern exists for
sources: [{source_file}], but source_file is code-generated and sanitized, so it's currently safe. Worth keeping in mind if future code lets unsanitized values into the sources list.
- This is unrelated to LLM provider — any model that generates natural-language summaries (Claude, GPT, Gemini, local) will eventually produce a colon-bearing brief.
Diagnostics (auto-collected by openkb feedback)
- openkb: 0.2.1.dev3+g91cf6d22c
- python: 3.11.14
- platform: Linux 6.17.0-1011-oracle
- kb_initialised: yes
brief field with colon in YAML frontmatter produces invalid YAML in concept and summary pages
Summary
openkb addgenerateswiki/concepts/*.md(and to a lesser extentwiki/summaries/*.md) with YAML frontmatter where thebrief:value is written as an unquoted plain scalar. When the LLM-generated brief contains": "(a colon followed by space) — which is very common in natural-language summaries — the resulting frontmatter is invalid YAML and breaks any external tool that strictly parses it (VS Code's Markdown extensions, Obsidian plugins, yaml lint, generic YAML loaders, etc.).OpenKB's own internal helpers (e.g.
_read_concept_briefsinagent/compiler.py) read brief via string slicing (line[len("brief:"):]), so the failure is silent within OpenKB — it only surfaces when wiki pages are consumed by external YAML-aware tools. That is plausibly why this hasn't been reported despite likely being triggered by many users.Reproduction
openkb initin an empty dir, drop any.mddoc whose topic naturally invites a "X: Y" style one-liner (technical specs, comparisons, "why X over Y" themes).openkb add path/to/doc.mdwiki/concepts/*.md. With non-trivial probability one of the generated briefs contains ": " and the frontmatter is invalid YAML.Concrete example actually produced in my run (ADASIS v2 spec corpus):
VS Code Markdown Preview surfaces:
yaml.safe_loadconfirms:yaml.scanner.ScannerError(mapping values not allowed here).Root cause
openkb/agent/compiler.py, two locations write brief naively as f-string:The brief value is the only LLM-authored field in the frontmatter (other fields —
doc_type,full_text,sources, source paths — are code-generated and sanitized). So the existing assumption inschema.py("frontmatter is managed by code") is correct in intent but violated for this one field, because the code doesn't quote/escape what the LLM provides.Suggested fix
Route brief through
yaml.safe_dumpso PyYAML auto-quotes when needed:I have applied exactly this monkey-patch locally and verified round-trip for 5 cases (colon, double-quote, hash, comma, parentheses) — all pass yaml.safe_load. The patch is ~5 lines.
Impact
Notes
sources: [{source_file}], butsource_fileis code-generated and sanitized, so it's currently safe. Worth keeping in mind if future code lets unsanitized values into the sources list.Diagnostics (auto-collected by
openkb feedback)