Skip to content

Linkml conversion tooling#387

Draft
yarikoptic wants to merge 79 commits into
masterfrom
linkml-conversion
Draft

Linkml conversion tooling#387
yarikoptic wants to merge 79 commits into
masterfrom
linkml-conversion

Conversation

@yarikoptic
Copy link
Copy Markdown
Member

@yarikoptic yarikoptic commented Mar 20, 2026

This is an extract with amends from

which (branch linkml-auto-converted) would keep merging this branch into itself while reflecting on changes in the branch which could be rebased or gain merges from the master, and also can accumulate or drop "patch branches" from within its script defining what to patch with.

This way linkml-auto-converted would represent reflection of current state of conversion

TODO/PLAN

  • Establish branch linkml-auto-converted -- that one in WiP: Branch with auto converted linkml model #381
  • Made ‘hatch’ script (you could add pydantic2linkml as dependency there) to convert orig_models.py into dandischema/models.yaml : hatch ... TODO
  • Translated the original models.py into dandischema/models.yaml and overlaid with an [dandischema/models_overlay.yaml] overlay file.
  • script tools/linkml_conversion to convert into ‘linkml-auto-converted’
  • define model_instances.yaml (or alike) which would define pre-populated records such as standards (bids, nwb, ...). aim for potentially multiple classes there.
  • add a github workflow here which would react to changes into 'master' and this branch and with manual dispatch, which would first merge master into this branch, then run the script, and push results to linkml-auto-converted branch. This way we would always have 'up to date' and automatically updated state of that branch.
  • address "notes" about failed conversions one way (changing current dandi-schema pydantic model) or another (pydantic2linkml) or !
    • we can add a custom script to "enhance" auto generate linkml model to address any changes needed programmatically!
    • we can have a branch (or just a .patch file) with changes to perform on top of converted linkml
  • ...
  • There you produce pydantic model out of this patched model sufficient (although potentially more relaxed) to replace current pydantic model.

candleindark and others added 17 commits March 13, 2026 17:01
 Specify Hatch-managed env for auto converting
 `dandischema.models` to LinkML schema and
 back to Pydantic models
Provide script to translate `dandischema.models`
in to a LinkML schema and overly it with
definition provided by an overlay file.
Provide script to translate `dandischema/models.yaml`
back to Pydantic models and store them in
`dandischema/models.py`
The previous BRE pattern used `\+` (GNU sed extension) which silently
fails
on macOS BSD sed. Switch to `-E` (extended regex) with POSIX character
class
`[^[:space:]]` instead of `\S` (also unsupported by BSD sed), making the
normalization work on both macOS and Linux.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Expand comment for linkml-auto-converted hatch env with usage instructions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There is no prefix defined as `dandi_default`.
The intended default prefix is `dandi`
…ed and some symbols from _orig for now

we do it so it does not overlay models.py since then git
is unable to track renames
we had to maintain original filename for models.py to apply patches
easily
Comment thread tools/linkml_conversion Outdated
# Poor man patch queue implementation
# Edit this list if you want to merge or drop PRs branches to be patched with.
# Order matters
branches_to_merge=( remove-discriminated-unions )
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is where we define branches from PRs to merge!

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.05%. Comparing base (d752738) to head (4e77ee5).

Files with missing lines Patch % Lines
dandischema/models_importstab.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #387      +/-   ##
==========================================
+ Coverage   48.31%   49.05%   +0.74%     
==========================================
  Files          19       20       +1     
  Lines        2434     2436       +2     
==========================================
+ Hits         1176     1195      +19     
+ Misses       1258     1241      -17     
Flag Coverage Δ
unittests 49.05% <0.00%> (+0.74%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@candleindark candleindark force-pushed the linkml-conversion branch 2 times, most recently from 59b0587 to c0fbd02 Compare March 31, 2026 00:59
…nator

`dandischema.models` use `schemaKey` in each
Pydantic as a de facto type designator in
LinkML. However, director translation
to LinkML based on individual model's
defintion is not possible. This override
provided in the merge file completes the
translation
Comment thread pyproject.toml Outdated
candleindark and others added 6 commits April 29, 2026 16:51
Naming the aggregated report `README.md` means GitHub renders it
automatically as the landing view when the output directory is
presented as a repo, so a reader lands on the validation report
without having to click into a file. The on-disk layout is otherwise
unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`references/DESIGN.md` and `references/OUTPUT.md` largely duplicated
content already present in the three scripts' module docstrings and
inline comments — the maintenance cost of two parallel sources
outweighed the progressive-disclosure benefit at this scale. Anyone
needing the on-disk layout or design rationale can read the relevant
script directly.

SKILL.md's "Further reading" section is replaced with a one-line
pointer to the script docstrings.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`validate_metadata.py` now runs ``dandischema.metadata.migrate`` on
each version's raw metadata before validating. The migrated instance
is persisted as ``metadata_migrated.json`` (the verbatim
``metadata.json`` is left untouched) and is what the LinkML validator
sees. Versions whose migration fails are recorded with the error and
skipped for validation — the validator never sees something the
migrator couldn't handle.

`validation.json` gains ``migration_status`` (``"success"`` /
``"failed"``) and ``migration_error`` fields. On migration failure
``problems`` is empty and ``exit_code`` is null. The CLI-equivalent
transcript is replaced by a one-line ``Migration failed: …`` notice
and ``SUMMARY.md`` calls the failure out instead of rendering a
problems block.

`generate_report.py` distinguishes migration failures in both the
overall headline (``N valid / M migration-failed / P with problems``)
and per-bucket sections (extra ``Migration failed:`` count when
non-zero) and renders an ``[migration failed]`` cell in the
per-version table linking to the version's ``SUMMARY.md`` for the
failure detail. Migration-failed versions are excluded from problem
pattern grouping since validation never ran for them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The output tree no longer namespaces under a `<short-sha>/` directory.
All artifacts live under one flat root:

    linkml-validation-reports/
    ├── README.md
    └── data/<dandiset>/<version>/{metadata.json, info.json,
                                   metadata_migrated.json,
                                   validation.json, validation.txt,
                                   SUMMARY.md}

Raw metadata is schema-independent and only fetched once; subsequent
runs against a different schema reuse it.

`validate_metadata.py`'s resume guard is now schema-aware. Each
`validation.json` is stamped with the SHA-256 of the schema file's
bytes (`schema_sha256` field). On a re-run the guard skips a version
only when its stamp matches the current schema, so a schema-content
change — committed *or* uncommitted — re-runs migration and
validation automatically without `--refresh`. `--refresh` is now
documented as a forceful override only.

Per-version logging moved into `_validate_one`. The function used to
return a `(target_class, migration_status, n_problems)` tuple
consumed *only* by the orchestrator's per-version log line. With the
log emitted at the decision site (resumed / migration failed /
migrated and validated), the return value carried no information and
has been dropped.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- patch -p1 --forward: skip hunks that are already applied (default
  `Assume -R?` would silently revert them since stdin is the diff).
- After the patch loop, inspect captured output: abort on real
  conflicts ("failed" / "FAILED"), tolerate non-zero exit only when a
  skip indicator ("previously applied" / "reversed" / "skipping patch")
  is present.
- Delete .rej files left by --forward.
- Replace `git commit -a` with `git add -A; git commit` so newly
  introduced files from patched branches are included in the merge
  commit.
Comment thread tools/linkml_conversion
Comment on lines +64 to +73
if [ "$status" -ne 0 ]; then
if grep -qi 'failed' <<<"$out"; then
echo "patch FAILED for branch $b — see rejects" >&2
exit 1
fi
if ! grep -qiE 'previously applied|reversed|skipping patch' <<<"$out"; then
echo "patch exited $status for branch $b without a recognized skip indicator; aborting" >&2
exit 1
fi
fi
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if [ "$status" -ne 0 ]; then
if grep -qi 'failed' <<<"$out"; then
echo "patch FAILED for branch $b — see rejects" >&2
exit 1
fi
if ! grep -qiE 'previously applied|reversed|skipping patch' <<<"$out"; then
echo "patch exited $status for branch $b without a recognized skip indicator; aborting" >&2
exit 1
fi
fi
if [ "$status" -ne 0 ]; then
if grep -qiE 'previously applied|reversed|skipping patch' <<<"$out"; then
echo "patch exited $status for branch $b but it was about applied patch, ignoring"
else
echo "patch FAILED for branch $b with exit $status — see rejects" >&2
exit $status
fi
fi

candleindark and others added 13 commits May 13, 2026 23:23
… override

The `2pydantic` hatch script previously stripped LinkML's namespace-prefix
munging from enum member names with `sed -E 's,[a-z]+COLON,,g'` over the
whole generated file. Replace that with an `enum.py.jinja` override passed
via `gen-pydantic --template-dir`, so the substitution happens at the
exact place the labels are emitted (using `pv.label.split("COLON") | last`)
rather than as blind text replacement after the fact.

Verified byte-identical output against the previous `sed`-based pipeline
on `dandischema/models.yaml` from `linkml-auto-converted` (all 116 `COLON`
occurrences are preceded by a lowercase prefix, so `split("COLON") | last`
is equivalent to the `sed` substitution on this schema).

Co-Authored-By: Claude Code 2.1.141 / Claude Opus claude-opus-4-7 <noreply@anthropic.com>
Replace sed post-processing in `2pydantic` with gen-pydantic template override
…finement

Adds a detached hatch env (`linkml-behavior-test`) and a new test tree
(`tests/linkml_behavior/`) asserting that `linkml-validate` honors a
`slot_usage` entry that tightens an inherited slot's `required` from
`False` to `True` while preserving the slot's other inherited
constraints (here, `range`). This is the LinkML behavior the LinkML
version of `dandischema` relies on; see issue #405.

The env is detached (no dandischema install) with only `linkml` and
`pytest` as dependencies, and it does not pin a LinkML version so the
tests run against the latest release and surface upstream regressions
early. The new directory carries its own `pytest.toml` so pytest's
config-file discovery stops there and does not inherit the repo-root
`tox.ini` `[pytest]` section, which is tailored to the `dandischema`
package test suite.

A follow-up will add a GitHub Actions workflow that invokes the hatch
env so the same checks run in CI.

Co-Authored-By: Claude Code 2.1.139 / Claude Opus 4.7 claude-opus-4-7 <noreply@anthropic.com>
Extends the `required_refinement` behavior tests to also assert that the
JSON Schema produced by running `gen-json-schema --title-from title`
against `schema.yaml` (matching the invocation in `pyproject.toml`'s
`linkml-auto-converted:2json` script) honors the `required: False -> True`
`slot_usage` refinement while preserving the inherited `range` constraint.
Validation is performed via the `check-jsonschema` CLI, which has JSON
Schema `format` keyword validation enabled by default.

Adds a `conftest.py` with two session-scoped fixtures:

- `json_schemas` — per-target-class JSON schemas generated once from
  `schema.yaml`.
- `json_instances` — the YAML data instances converted once to JSON.

The detached `linkml-behavior-test` hatch env now also depends on
`check-jsonschema` and `PyYAML`.

Co-Authored-By: Claude Code 2.1.139 / Claude Opus 4.7 claude-opus-4-7 <noreply@anthropic.com>
The `test_validate.py` and `test_json_schema_validate.py` files exercise
the same six `(target_class, instance)` cases against different
validators. Move the case lists into a shared `_cases.py` so adding or
adjusting a case updates both test files at once.

Co-Authored-By: Claude Code 2.1.141 / Claude Opus 4.7 claude-opus-4-7 <noreply@anthropic.com>
Add `test_pydantic_validate.py`, mirroring `test_validate.py` and
`test_json_schema_validate.py`. A new session-scoped `pydantic_module`
fixture runs `gen-pydantic --black --template-dir <...>` (matching the
invocation in the `linkml-auto-converted:2pydantic` script) and loads
the generated module dynamically; the new `instance_data` fixture
provides YAML instances parsed into Python dicts for `model_validate`.
The dynamically-loaded module is registered in `sys.modules` under a
topic-folder-suffixed name so future sibling fixtures don't collide.

Also add `black` and `pydantic` to the `linkml-behavior-test` env so
`--black` formatting works and `pydantic.ValidationError` is importable.

Co-Authored-By: Claude Code 2.1.141 / Claude Opus 4.7 claude-opus-4-7 <noreply@anthropic.com>
Set version environment to 3.10. The
lowest currently supported Python
Runs the `tests/linkml_behavior/` suite under the `linkml-behavior-test`
hatch env on push/PR to `master`, on a daily 06:00 UTC schedule, and on
manual dispatch. The env doesn't pin a LinkML version, so the daily
schedule surfaces upstream LinkML regressions early.

Co-Authored-By: Claude Code 2.1.141 / Claude Opus 4.7 claude-opus-4-7 <noreply@anthropic.com>
The env provides `hatch run linkml-behavior-typing:check`, which runs
`mypy --install-types --non-interactive` over `tests/linkml_behavior/`.
It is kept distinct from `linkml-behavior-test` and from the tox
`typing` env (which targets `dandischema`).

Co-Authored-By: Claude Code 2.1.141 / Claude Opus 4.7 claude-opus-4-7 <noreply@anthropic.com>
Runs `hatch run linkml-behavior-typing:check` on push/PR to `master`.
No cron — typing breakages from upstream stub or library releases are
rare and typically don't cause runtime errors.

Co-Authored-By: Claude Code 2.1.141 / Claude Opus 4.7 claude-opus-4-7 <noreply@anthropic.com>
Add LinkML behavior tests for required: False -> True slot_usage refinement
Collection is supposed to be confined to `dandischema/` via the literal
positional arg in `commands = pytest -v {posargs} dandischema`. Under
tox 4.25 (local) that works as intended. Under tox 4.54 (CI), however,
the literal `dandischema` following `{posargs}` is dropped from the
invocation, leaving pytest to default to rootdir-based collection. That
made it walk into `tests/linkml_behavior/required_refinement/`, whose
dependencies (PyYAML, linkml, etc.) live in the dedicated
`linkml-behavior-test` hatch env rather than the tox env, and fail
collection with `ModuleNotFoundError: No module named 'yaml'`.

Set `testpaths = dandischema` in the `[pytest]` section so pytest's
default-collection target matches the intended scope of this test
suite. With that in place, the literal `dandischema` in the tox command
is redundant — and was the source of the tox-version-dependent
behavior — so drop it. A path explicitly supplied via `{posargs}` (e.g.
`tox -e py -- tests/foo.py`) still overrides `testpaths` and is now
honored cleanly instead of being combined with `dandischema`.

Co-Authored-By: Claude Code 2.1.143 / Claude Opus claude-opus-4-7 <noreply@anthropic.com>
Resolve conflit in `tox.ini` regarding
pytest settings
candleindark and others added 9 commits May 28, 2026 15:11
Introduce `docs/designs/migration_to_linkml_playbook/` as a living, self-updating
playbook for the ongoing migration of `dandischema` from Pydantic-defined models
to a LinkML-defined schema as the source of truth.

The foundation covers the problem statement and success criteria, the current
wiring of `./tools/linkml_conversion` (including the `2linkml` / `2pydantic` /
`2json` / `pydantic2json` Hatch scripts and the role of each file under
`tools/linkml_conversion_tools/`), a repeatable procedure with parity checks
against the dandi-archive frontend, an inventory of the patch queue applied
during translation (`master`, `remove-discriminated-unions`), and conventions
for the `log.md` / `findings.md` / `tools/` / `context/` subfiles. More content
will be added in subsequent commits.

Co-Authored-By: Claude Code 2.1.154 / Claude Opus 4.7 <noreply@anthropic.com>
Introduce `context/roles/` in the playbook, with a mandatory
`senior-developer.md` baseline (behavioral habits inherited by every
agent and subagent) plus topical stubs for Vue, Django, and LinkML
slices of the migration. Role files are loaded into a working agent's
context when a session touches the corresponding slice; the same files
also serve as spawn-prompt material when a subagent is invoked.

Each topical stub names its scope, its explicit not-in-scope handoffs to
sibling roles, and curated references for filling in its content
(including first-party LinkML AGENTS.md / SKILL.md material in
`linkml/linkml`, the LinkML specification at w3id.org, and community
subagent collections vetted as starting material). OVERVIEW.md gains a
pointer to `context/roles/` from the "How to use this directory"
section.

Co-Authored-By: Claude Code 2.1.154 / Claude Opus 4.7 <noreply@anthropic.com>
Fill in the "What this role needs to know" sections of `vue.md`,
`django.md`, and `linkml.md` from primary sources: the local
`dandi-archive` (frontend `package.json`, backend `pyproject.toml`, dev
docs) and `pydantic2linkml` (README, CLAUDE.md, source). Each role names
its stack landscape, the seam where dandischema crosses into that slice,
and a curated lift from upstream community subagent definitions with
caveats about version mismatches.

Shared findings across roles surfaced:

- @koumoul/vjsf is the form generator that consumes the JSON Schema in
  the dandi-archive frontend — the concrete consumer for criterion 3.
- dandi-archive pins dandischema==0.12.1 exact; dandi-cli pins
  dandischema ~= 0.12.0 — the generated Pydantic must satisfy both.
- `pydantic2linkml -M` is implemented via deepmerge.always_merger; the
  README's "values from the file win on conflict" oversimplifies. The
  actual per-type rule: dicts deep-merge, lists append, sets union,
  type-mismatches and scalars override. OVERVIEW.md's wiring and
  procedure step 7 are corrected to match, and a list-replacement
  escape hatch is documented (use `-O` for top-level lists, or fix on
  the Pydantic side).

Co-Authored-By: Claude Code 2.1.154 / Claude Opus 4.7 <noreply@anthropic.com>
Add a self-contained, runnable exhibit under the playbook's tools/
directory demonstrating that LinkML's `designates_type: true` makes the
generated Pydantic models and JSON Schema resolve a superclass-typed slot
value to its concrete subtype, and pins each class's `schemaKey` to its
class name consistently across both representations.

The exhibit is organized into:
- schemas/ — two source schemas (with/without the type designator) plus
  their generated Pydantic + JSON Schema snapshots
- subtype_resolution/ — demos asserting a Project survives in a
  BareAsset.wasGeneratedBy list only with the designator on
- schemakey_validation/ — demos asserting schemaKey validation
  (valid/absent/null/wrong-class) agrees between Pydantic and JSON Schema

Record the finding in findings.md. Demos run in the linkml-auto-converted
pipeline env (linkml 1.10.0). Exclude each exhibit's schemas/ folder from
pre-commit, since the generated snapshots are raw gen-pydantic /
gen-json-schema output that the linters should not touch.

Co-Authored-By: Claude Code 2.1.159 / Claude Opus 4.8 <noreply@anthropic.com>
The linkml-auto-converted pipeline env now pins linkml==1.11.1 (was
1.10.0). Regenerate the demo's Pydantic/JSON Schema snapshots under 1.11.1
and update the exhibit README + findings.md provenance accordingly. The
finding is unchanged: the designates_type behavior is identical across both
versions, and all four demos still pass.

Co-Authored-By: Claude Code 2.1.159 / Claude Opus 4.8 <noreply@anthropic.com>
LinkML's gen-json-schema hardcodes draft 2019-09; Pydantic and the
dandi-archive frontend's Ajv are both on 2020-12. The frontend deletes
$schema before validating, so the gap is mostly cosmetic except for
tuple arrays. Promote this into findings.md and partly answer the
matching open question in OVERVIEW.md.

Co-Authored-By: Claude Code 2.1.159 / Claude Opus 4.8 claude-opus-4-8 <noreply@anthropic.com>
…forces them

Verified against the linkml-auto-converted pipeline env (linkml 1.11.1): a
LinkML class rule (precondition value_presence PRESENT -> postcondition
required/minimum_value) is translated to an if/then block by gen-json-schema
and enforced, but gen-pydantic stores it only as inert linkml_meta metadata
and the generated model accepts violating instances. Also notes the
expressiveness limit: nested-slot and dict-key checks are not cleanly
expressible via rule slot_conditions.

Relevant to preserving dandischema's conditional (publish-only) validators
across the migration without losing Pydantic-side enforcement.

Co-Authored-By: Claude Code 2.1.161 / Claude Opus 4.8 claude-opus-4-8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants