Skip to content

Add vmagent role#2704

Draft
scibi wants to merge 3 commits into
debops:masterfrom
scibi:feature/vmagent-role
Draft

Add vmagent role#2704
scibi wants to merge 3 commits into
debops:masterfrom
scibi:feature/vmagent-role

Conversation

@scibi

@scibi scibi commented Jun 1, 2026

Copy link
Copy Markdown
Member

Summary

Adds a new debops.vmagent role that installs and configures VictoriaMetrics' vmagent on Debian-family hosts. The role manages one or more named instances via a single hardened vmagent@.service systemd template unit, sources the binary from the upstream vmutils GitHub release archive (with full SHA256 verification), and integrates with the standard DebOps secret mechanism for per-instance bearer tokens. Wires the new playbook into layer/agent.yml so that debops run site configures vmagent
alongside the other observability agents (Filebeat, Metricbeat, Telegraf, Zabbix Agent).

A first-class air-gapped install path is built in: the binary can be sourced from an internal HTTP(S) mirror (Nexus / Artifactory / MinIO / nginx), copied from the Ansible Controller, picked up from a path already on the host (e.g. Packer-baked), or its management can be skipped entirely. The same SHA256 contract applies to every channel.

Motivation

VictoriaMetrics does not publish an APT repository - the official binaries are distributed only as .tar.gz archives on the GitHub Releases page, which makes the debops.apt-style pattern unusable. Most DebOps deployments of vmagent today rely on either (a) running it as a container (which adds an unnecessary network namespace between the agent and the host exporters it scrapes) or (b) hand-rolled resources role tasks + manual systemd units.

A dedicated DebOps role offers:

  • Multi-instance out of the box - one host can run several named instances (vmagent@default.service, vmagent@aggregator.service, ...) sharing the same binary with independent persistent queues and remote-write targets.
  • Air-gap as a first-class deployment mode - the binary delivery layer is a strict waterfall: skip / current / local archive / controller archive / URL download, with mandatory SHA256 verification in every branch. The default URL can be retargeted to an internal mirror with a single variable.
  • Pinned, reproducible installs - vmagent__version + vmagent__archive_sha256_map gate every install; bumping is an explicit two-line change.
  • DebOps secret integration - remote-write bearer tokens live under secret/vmagent/instances/<name>/bearer_token on the Controller, never in cleartext in inventory.
  • Hardened systemd unit - ProtectSystem=strict, PrivateTmp, ProtectKernel*, empty CapabilityBoundingSet, MemoryDenyWriteExecute, StateDirectory=vmagent/%i (only writable path for the daemon), and a TimeoutStopSec large enough for the persistent queue to flush before SIGKILL.
  • Idempotent restart-on-change semantics - changing one instance's config / env / secret / binary triggers a restart of that instance only.

Design

Systemd template unit

A single vmagent@.service template unit drives all instances. Per-instance state lives in two files under /etc/vmagent:

  • <name>.yml - Prometheus-format scrape_configs: consumed by -promscrape.config=.
  • <name>.env - EnvironmentFile= containing ARGS="..." (CLI flags).

Persistent queues live under /var/lib/vmagent/<name>/, restricted to the instance via StateDirectory=vmagent/%i so that hardening (ProtectSystem=strict) does not block writes.

Adding a new instance is a single YAML entry under vmagent__instances; removing one is state: 'absent' with full cleanup of unit / config / env / queue files.

Validated instance definitions

Each instance passes through an assert in manage_instance.yml before any files are deployed:

- name: Validate vmagent instance '{{ instance.name | d("<unnamed>") }}' definition
  ansible.builtin.assert:
    that:
      - instance.name is defined and instance.name | length > 0
      - instance.state | d('present') in ['present', 'absent']
      - (instance.state | d('present') == 'absent')
        or (instance.remote_write_urls | d([]) | length > 0)

A missing name, an unknown state, or a present instance with no remote_write_urls fails the play immediately with a clear message instead of silently deploying a broken systemd unit.

Binary install waterfall

The role evaluates five sources of the vmagent binary, in order, and uses the first one that resolves:

  1. vmagent__skip_install: True - role is a no-op for binary management (image-baked deployments).
  2. vmagent -version already reports the matching version - short circuit, no downloads.
  3. vmagent__local_archive_path - archive already present on the remote host.
  4. vmagent__controller_archive_path - archive copied from the Ansible Controller via ansible.builtin.copy.
  5. Default: ansible.builtin.get_url from vmagent__release_url = {{ release_base_url }}/v{{ version }}/{{ archive_name }}.

The same SHA256 from vmagent__archive_sha256_map[arch] is enforced in every branch. For the controller-side and local-host branches the check runs as a separate stat + assert step after the file is in place.

Restart semantics

The role registers separate change variables for binary, config, env, and secret per instance, and only restarts when one of that instance's files changes:

- name: Restart on configuration change for vmagent instance {{ instance.name }}
  ansible.builtin.systemd:
    name: 'vmagent@{{ instance.name }}.service'
    state: 'restarted'
  when: (vmagent__register_config is changed) or
        (vmagent__register_env is changed) or
        (vmagent__register_secret is changed) or
        (vmagent__register_binary is defined and
         vmagent__register_binary is changed)

Editing one instance's scrape config does not bounce other instances on the same host - important when multiple instances target different remote-write tenants.

Hardened systemd unit

vmagent only needs to read its config, write to its queue directory, listen on a loopback port for its HTTP endpoint, and open outbound TCP to the remote-write target. The hardened template applies ProtectSystem=strict, PrivateTmp, ProtectKernel*, ProtectClock, ProtectHostname, RestrictRealtime, RestrictSUIDSGID, RestrictNamespaces, LockPersonality, MemoryDenyWriteExecute, an empty CapabilityBoundingSet, and RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX AF_NETLINK. The only writable path is {{ vmagent__home }}/%i, scoped via ReadWritePaths. Additional restrictions (e.g. IPAddressAllow=) can be appended via vmagent__systemd_hardening_extra without forking the role.

Hook points

tasks/vmagent/{pre_main,post_main}.yml are empty placeholders that project-level overrides can populate via the standard debops.debops.task_src lookup. Useful for hooks like "wait for vmagent to report /-/ready" or "register a Prometheus scrape target for the agent's own metrics".

What this PR adds

  • Role: ansible/roles/vmagent/
    • defaults/main.yml (~370 lines, fully documented)
    • meta/main.yml, COPYRIGHT
    • tasks/main.yml - user/group/dirs, binary install include, systemd template unit, loop over vmagent__combined_instances
    • tasks/main_env.yml - pre-task computing vmagent__secret_directories
    • tasks/install_binary.yml - five-stage binary delivery waterfall with mandatory SHA256 verification
    • tasks/manage_instance.yml - per-instance deploy / remove block with validation and selective restart
    • tasks/vmagent/{pre_main,post_main}.yml - hook placeholders
    • templates/etc/vmagent/instance.yml.j2 - Prometheus scrape_configs: renderer
    • templates/etc/vmagent/instance.env.j2 - CLI-flag renderer (handles bools, scalars, lists for repeating flags)
    • templates/etc/systemd/system/vmagent@.service.j2 - hardened systemd template unit
    • templates/etc/ansible/facts.d/vmagent.fact.j2 - Python local facts exposing version + active instances
    • templates/lookup/vmagent_env_secret_directories.j2
  • Global handler:
    roles/global_handlers/handlers/vmagent.yml with Restart vmagent instances, registered alphabetically in roles/global_handlers/handlers/main.yml.
  • Playbook: ansible/playbooks/service/vmagent.yml
    • Imports vmagent/main_env as pre_task to populate vmagent__secret_directories
    • roles: order: secret (creates secret dirs) -> vmagent (main role). No keyring role - no APT repo.
  • Site integration: ansible/playbooks/layer/agent.yml
    • Adds the import between Telegraf and Zabbix Agent.
  • Documentation: docs/ansible/roles/vmagent/
    • getting-started.rst - what vmagent is, prerequisites, minimal inventory, multi-instance pattern, secret management, tags
    • defaults-detailed.rst - vmagent__ref_instances, vmagent__ref_binary_source (waterfall), vmagent__ref_systemd_hardening_extra
    • guide-victoriametrics-integration.rst - typical patterns: per-host scrape + remote write, self-monitoring on the VictoriaMetrics host, bearer-token auth, multi-target fan-out, aggregator
    • guide-airgapped-install.rst - the four non-GitHub delivery patterns step by step
  • Role index: docs/ansible/role-index.rst updated.
  • Inventory wiring (view system):
    • ansible/views/system/inventory/groups.yml - new file: debops_service_vmagent.children = debops_all_hosts, so vmagent rolls out to every managed host without touching hosts_*.yml files.
    • ansible/views/system/inventory/group_vars/all/vmagent.yml - central VictoriaMetrics endpoint default (https://vmetrics.sciborek.com/api/v1/write) and a placeholder empty scrape_configs ready for node_exporter.
    • ansible/views/system/inventory/host_vars/vmetrics.sciborek.com/vmagent.yml
      • on the VictoriaMetrics host itself, write to http://127.0.0.1:8428/api/v1/write to bypass nginx + DNS.

Dependencies

Independent - depends only on master. No coupling with any other PR in flight.

Testing

Tested in a homelab DebOps deployment running Debian on unprivileged Proxmox LXC containers, with a central VictoriaMetrics single-node already exposed at vmetrics.sciborek.com:8428 via debops.docker_service + nginx:

  • Default GitHub path - clean install on a host with /usr/local/bin/vmagent absent. debops run service/vmagent downloads vmutils-linux-amd64-v1.144.0.tar.gz, verifies SHA256, installs the binary, deploys the unit, starts vmagent@default.service, queries on the central VM confirm ingestion within 30 s.
  • Internal mirror - host with vmagent__release_base_url: 'https://nexus.lan/raw/vendor/victoriametrics/releases' and the archive pre-uploaded to the mirror. No GitHub traffic; SHA256 verification still applies.
  • Air-gap (controller archive) - host with outbound TCP/443 blocked, vmagent__controller_archive_path pointed at a tarball in files/. Role copies + verifies + extracts; binary lands at the expected version.
  • Skip install (Packer-baked) - test image with binary at /usr/local/bin/vmagent, vmagent__skip_install: True. Role manages only configuration / systemd, never touches the binary.
  • Multi-instance - one host running vmagent@default.service + vmagent@aggregator.service. Modifying the aggregator's scrape_configs triggers exactly one restart - of the aggregator unit only.
  • State transitions - state: 'absent' on an instance stops + disables the unit, removes config / env / queue. Removing vmagent__instances entries is verified by inspecting ansible_local.vmagent.instances after a re-run.
  • Idempotency - re-running the playbook with no changes results in changed=0 for the role; vmagent.fact reports unchanged version.
  • Hardening - systemd-analyze security vmagent@default.service returns an Exposure level of 1.x SAFE.

Compatibility

  • Debian 12 (bookworm), Debian 13 (trixie), Ubuntu 22.04+.
  • vmagent v1.144.0 (the pinned default; bumping is a two-line change in defaults/main.yml).
  • linux-amd64 and linux-arm64 (via vmagent__arch_map). Other architectures require an extra entry in vmagent__archive_sha256_map.
  • systemd >= 247 (for the hardening directives in the unit; older systemd silently ignores unknown directives).

Checklist

  • Role follows DebOps role conventions (defaults / meta / tasks structure, vmagent__ variable namespace, [ 'role::vmagent', 'skip::vmagent' ] tags, become: True on the playbook level)
  • Playbook follows the standard service/<name>.yml structure (pre_task computes dependent vars, then secret -> main role)
  • Documentation under docs/ansible/roles/vmagent/ (getting-started, defaults-detailed, two guides, all listed in role-index.rst)
  • All new files include SPDX-License-Identifier and Copyright headers
  • tasks/vmagent/{pre_main,post_main}.yml hook placeholders provided so default-config users don't need to create empty files
  • Binary delivery waterfall: skip / current / local / controller / URL; SHA256 enforced in every branch
  • Hardened systemd unit with TimeoutStopSec large enough for persistent queue flush before SIGKILL
  • CI green (will be verified after push)

@scibi scibi force-pushed the feature/vmagent-role branch 4 times, most recently from 6e90968 to 501a042 Compare June 22, 2026 14:44
scibi and others added 3 commits July 3, 2026 13:40
Install and configure VictoriaMetrics vmagent via systemd template units,
with multi-instance support, SHA256-verified binary installs and DebOps
secret integration for remote-write bearer tokens.

Co-authored-by: Cursor <cursoragent@cursor.com>
Import the service playbook from layer/agent.yml alongside the other
observability agents, register global handlers, add the role to the
Monitoring section of role-index.rst and document the new role in
CHANGELOG.rst.

Co-authored-by: Cursor <cursoragent@cursor.com>
@scibi scibi force-pushed the feature/vmagent-role branch from 501a042 to 442dd6d Compare July 3, 2026 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant