Skip to content

[bug] brief field with colon in YAML frontmatter produces invalid … #67

@invu557

Description

@invu557

brief field with colon in YAML frontmatter produces invalid YAML in concept and summary pages

Summary

openkb add generates wiki/concepts/*.md (and to a lesser extent wiki/summaries/*.md) with YAML frontmatter where the brief: value is written as an unquoted plain scalar. When the LLM-generated brief contains ": " (a colon followed by space) — which is very common in natural-language summaries — the resulting frontmatter is invalid YAML and breaks any external tool that strictly parses it (VS Code's Markdown extensions, Obsidian plugins, yaml lint, generic YAML loaders, etc.).

OpenKB's own internal helpers (e.g. _read_concept_briefs in agent/compiler.py) read brief via string slicing (line[len("brief:"):]), so the failure is silent within OpenKB — it only surfaces when wiki pages are consumed by external YAML-aware tools. That is plausibly why this hasn't been reported despite likely being triggered by many users.

Reproduction

  1. openkb init in an empty dir, drop any .md doc whose topic naturally invites a "X: Y" style one-liner (technical specs, comparisons, "why X over Y" themes).
  2. openkb add path/to/doc.md
  3. Inspect wiki/concepts/*.md. With non-trivial probability one of the generated briefs contains ": " and the frontmatter is invalid YAML.

Concrete example actually produced in my run (ADASIS v2 spec corpus):

---
sources: [summaries/v2s_chapter_02.md, summaries/v2s_chapter_01.md]
brief: Why ADASIS v2 supersedes v1: shifting horizon reconstruction complexity from client applications to the provider.
---

VS Code Markdown Preview surfaces:

Failed to parse frontmatter
Nested mappings are not allowed in compact mappings at line 2, column 8:
brief: Why ADASIS v2 supersedes v1: shifting horizon reconstruction complexity ...
       ^

yaml.safe_load confirms: yaml.scanner.ScannerError (mapping values not allowed here).

Root cause

openkb/agent/compiler.py, two locations write brief naively as f-string:

  • Line ~534 (update path):
    fm = re.sub(r"brief:.*", f"brief: {brief}", fm)
  • Line ~546 (create path):
    fm_lines.append(f"brief: {brief}")

The brief value is the only LLM-authored field in the frontmatter (other fields — doc_type, full_text, sources, source paths — are code-generated and sanitized). So the existing assumption in schema.py ("frontmatter is managed by code") is correct in intent but violated for this one field, because the code doesn't quote/escape what the LLM provides.

Suggested fix

Route brief through yaml.safe_dump so PyYAML auto-quotes when needed:

import yaml

def _yaml_kv_line(key: str, value: str) -> str:
    line = yaml.safe_dump(
        {key: value},
        default_flow_style=False,
        width=10**9,
        allow_unicode=True,
    ).strip()
    return line.split("\n")[0]

# replace both call sites with:
fm_lines.append(_yaml_kv_line("brief", brief))
# and:
safe = _yaml_kv_line("brief", brief)
fm = re.sub(r"brief:.*", lambda _m: safe, fm)

I have applied exactly this monkey-patch locally and verified round-trip for 5 cases (colon, double-quote, hash, comma, parentheses) — all pass yaml.safe_load. The patch is ~5 lines.

Impact

  • Silent: OpenKB's own pipeline keeps working, so users may not notice until they consume wiki pages with an external tool (Obsidian, VS Code preview, doc generators, git-based publishing, future schema validation in OpenKB itself).
  • Probabilistic: depends on whether the LLM happens to produce ": " in the one-sentence brief. Likelihood rises sharply for technical/comparative documents.
  • Cumulative: every add adds more potentially-invalid pages; harder to retrofit later than to fix at write time.

Notes

  • Same naive f-string pattern exists for sources: [{source_file}], but source_file is code-generated and sanitized, so it's currently safe. Worth keeping in mind if future code lets unsanitized values into the sources list.
  • This is unrelated to LLM provider — any model that generates natural-language summaries (Claude, GPT, Gemini, local) will eventually produce a colon-bearing brief.

Diagnostics (auto-collected by openkb feedback)
  • openkb: 0.2.1.dev3+g91cf6d22c
  • python: 3.11.14
  • platform: Linux 6.17.0-1011-oracle
  • kb_initialised: yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions