Skip to content

Export hotblocks storage size metrics NET-910#78

Open
kalabukdima wants to merge 3 commits into
masterfrom
dataset-sizes
Open

Export hotblocks storage size metrics NET-910#78
kalabukdima wants to merge 3 commits into
masterfrom
dataset-sizes

Conversation

@kalabukdima

Copy link
Copy Markdown
Contributor

Run a background task to roughly estimate how much disk space each dataset is using. Expose it with the metrics. These numbers may be used to quickly understand which datasets started growing, but not as a precise disk usage profiler.

Also move all the DB reads out of the scraping path to avoid locking the DB on calls to /metrics.

Run a background task to roughly estimate how much disk
space each dataset is using. Expose it with the metrics.
These numbers may be used to quickly understand which datasets
started growing, but not as precise disk usage profiler.

Also move all the DB reads out of the scraping path to avoid
locking the DB on calls to /metrics.
@kalabukdima kalabukdima requested a review from define-null June 25, 2026 09:55
@kalabukdima kalabukdima marked this pull request as ready for review June 25, 2026 09:56
@kalabukdima kalabukdima changed the title Export hotblocks storage size metrics NET-442 Export hotblocks storage size metrics NET-910 Jun 30, 2026
@define-null

Copy link
Copy Markdown
Contributor

Have you run the benchmarks? I'm not familiar with this code, but by the quick look - PR introduces a basically fetch_all loop across all datasets, which is triggered every 60 seconds. Given that it reads from disk, de-serializes the data - it would be great to validate how it affects the overall performance.

Additionally - have you considered keeping counters per dataset, with recording how much is added/removed at the places where it is done instead? If yes - why that idea was discarded?


#[instrument(name = "dataset_stats", skip_all)]
async fn dataset_stats_loop(db: DBRef, dataset_id: DatasetId, sender: tokio::sync::watch::Sender<DatasetStats>) {
const REFRESH: Duration = Duration::from_secs(60);

@define-null define-null Jun 30, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better move this to config parameters to better control the overhead, that this and similar loop causes

Comment thread crates/storage/src/db/db.rs Outdated
}

/// Approximate on-disk size of each column family as `(cf_name, bytes)`. Counts
/// SST files only, excluding WAL and memtables. Constant-time property reads.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not constant-time property reads
https://github.com/facebook/rocksdb/blob/08809f5e6cd9cc4bc3958dd4d59457ae78c76660/include/rocksdb/db.h#L579

//  "rocksdb.total-sst-files-size" - returns total size (bytes) of all SST
//      files.
//  WARNING: may slow down online queries if there are too many files.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing! Moving to the background task as well

loop {
let db = db.clone();
let span = tracing::Span::current();
let result = tokio::task::spawn_blocking(move || {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more blocking task per dataset, per minute doesn't look right to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've measured the time of running, and it's <40 ms per dataset at worst, without blocking anything.
The time of running the global metadata estimator is <10 ms and holds the global locks for writes, but doesn't affect reads.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, would you mind sharing some details related to the overall load, size of the db, etc?

@define-null

Copy link
Copy Markdown
Contributor

There exist write_chunk/delete_chunk that could be instrumented. I suggest we go that route instead. You may get the current numbers from the DB at the startup of the system.

@kalabukdima

Copy link
Copy Markdown
Contributor Author

have you considered keeping counters per dataset, with recording how much is added/removed at the places where it is done instead? If yes - why that idea was discarded?

Even if we knew how much data is added/removed by every operation, we would need to read some initial value at the startup. But I don't think that storing N bytes of data in the KV storage corresponds to storing N bytes on disk. So I don't see a way to combine the values read from disk at startup and the incremental changes.

My plan was to build an image and test it in a live environment. But let me do it in advance for safety.

- Compute rocksdb_column_family_size_bytes on a background loop
  (storage_metrics_loop), published via a watch channel, instead of
  hitting RocksDB on every Prometheus scrape
- Share one configurable --storage-stats-interval-secs between this
  loop and the existing per-dataset stats loop
- Log measurement timing (elapsed_us) for both loops
- Fix doc comments that understated column_family_sizes' cost: it
  holds RocksDB's DB mutex and scales with live SST files, not O(1)

Part of NET-442
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants