Skip to content

Provide an API endpoint with statistics usable for monitoring#183

Merged
praiskup merged 1 commit into
praiskup:mainfrom
FrostyX:pool-statistics
May 18, 2026
Merged

Provide an API endpoint with statistics usable for monitoring#183
praiskup merged 1 commit into
praiskup:mainfrom
FrostyX:pool-statistics

Conversation

@FrostyX

@FrostyX FrostyX commented Apr 22, 2026

Copy link
Copy Markdown
Collaborator

Fix #124

Comment thread resallocserver/alembic/versions/9d9d3a7f7c8f_add_pool_statistics.py Fixed
Comment thread resallocserver/manager.py Fixed
Comment thread resallocwebui/app.py Fixed
@FrostyX

FrostyX commented Apr 22, 2026

Copy link
Copy Markdown
Collaborator Author

What do you think of this approach? Should I rewrite it to calculate the statistics on the fly from the Resource table?

@nikromen nikromen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the approach (EMA stored in the Pool table + a simple JSON endpoint) is reasonable and simple. A few issues though, some of them are bugs:

Bugs

1. last_successful_start is set unconditionally

dbpool.last_successful_start = start_finished
session.add(dbpool)

This is set regardless of the success parameter. It should be guarded:

if success:
    dbpool.last_successful_start = start_finished

2. startup_success_rate is actually a failure rate

started_successfully = 0 if success else 1
dbpool.startup_success_rate = \
    alpha * started_successfully + (1 - alpha) * dbpool.startup_success_rate

When a resource starts successfully, started_successfully = 0, which pulls the rate toward 0. When it fails, started_successfully = 1, pulling toward 1. So startup_success_rate converges to 1 when everything is broken. That's a failure rate, not a success rate. Either flip the values (1 if success else 0) or rename the field to startup_failure_rate.

3. Race condition on self.pool.last_start

start_initiated = self.pool.last_start
start_finished = time.time()
startup_time = start_finished - start_initiated

self.pool.last_start is set in-memory on the shared Pool config object in allocate() (line 682). If two AllocWorkers run concurrently for the same pool, the second one overwrites self.pool.last_start before the first finishes. The first worker then calculates startup_time using the second worker's start time, yielding a wrong (too short) duration.

To fix this, save the start time per-worker at __init__ time or pass it into the worker, rather than reading a shared mutable attribute later.

Smaller issues

4. No initialization logic for the EMA

When startup_time_avg is 0 (fresh DB), the first calculation produces alpha * real_value + (1 - alpha) * 0, which gives a significantly dampened first measurement. Consider initializing to the actual value on first observation:

if dbpool.startup_time_avg == 0:
    dbpool.startup_time_avg = startup_time
else:
    dbpool.startup_time_avg = alpha * startup_time + (1 - alpha) * dbpool.startup_time_avg

Same applies to startup_success_rate.

5. session.add(dbpool) is unnecessary

dbpool was obtained via session.query(...), so it's already tracked by the session. Attribute changes are detected automatically. Not a bug, but dead code.

6. The TODO should be resolved before merge

# TODO Make sure this is true
# The assumption is, this is the place where we know a resource
# successfully started

This is indeed the right place (it's after run_command for cmd_new returns), so the comment can be removed. If you're unsure, verify in tests rather than leaving a TODO for the reviewer.

7. startup_time_avg is updated even on failure

Currently you record startup_time regardless of success. If the script fails quickly (e.g., instant cloud API rejection), the average will be pulled down, which is misleading. Consider only updating startup_time_avg on success, since you want to measure "how long does it take to get a working VM."

8. No tests

There are existing tests in tests/. It would be good to add a unit test for recalculate_statistics covering at least: first call, success vs failure, and the EMA convergence.


On the approach question

Should I rewrite it to calculate the statistics on the fly from the Resource table?

The current approach (pre-computed EMA in the Pool table) is fine for monitoring. It's O(1) per API call instead of scanning all resources. The tradeoff is that you lose history and can't re-parameterize after the fact, but for Nagios/Zabbix checks that just need a current health signal, this is the simpler approach. I'd keep it.

@nikromen nikromen dismissed their stale review May 7, 2026 11:22

that review was experiment with claude code to replace gemini where we don't have it, I'll post my real review soon

@nikromen nikromen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is my human review :D

I also think claude has point on 7. We should check whether the resource actually booted so we do not shorten the average time by failures.

IMO the rest of claude comments are not worth spending time on it or missleading

Comment thread resallocserver/manager.py Outdated
"""
dbpool = session.query(models.Pool).get(self.pool.id)

start_initiated = self.pool.last_start

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after investigation, I agree here with claude on point 3 from #183 (review)

in https://github.com/praiskup/resalloc/blob/main/resallocserver/manager.py#L682 self.pool.last_start is set on shared and then AllocWorker is started. But multiple AllocWorkers for the same pool share the same self.pool. Another allocation can overwrite the self.pool.last_start so on reading this, it can lead to shorter startup time than one thinks it is.

However claude could explain why it is problem though 😆

passing the start time into the each worker should do the trick, so each has its own copy then.

so adding attribute to AllocWorker of a start time should do the trick, then something like:

AllocWorker(event, self, int(resource_id), start_time=time.time()).start()

would have the correct start time since each worker would have the correct copy

Comment thread resallocserver/manager.py Outdated
dbpool.startup_time_avg = \
alpha * startup_time + (1 - alpha) * dbpool.startup_time_avg

started_successfully = 0 if success else 1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think claude has also point here (point 2).

startup_success_rate should mean if the boot was success then the value should be higher and vice versa? now it is doing:

let alpha = 0.5
let startup_success_rate = 0.6
started_successfully = 0 if success else 1

success = 0:
  rate =0.5 * 0 + 0.5 * 0.6 = 0.3

fail = 1:
  rate = 0.5 * 1 + 0.5 * 0.6 = 0.8

Comment thread resallocserver/manager.py Outdated
dbpool.startup_success_rate = \
alpha * started_successfully + (1 - alpha) * dbpool.startup_success_rate

dbpool.last_successful_start = start_finished

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is point 1 from claude review... again elaboration on its side would be nice 😆

The issue here would be if we want to detect something like nothing has started with success for 2 hours, since the last_successful_start would be just time of when the VM failed. So we should either rename (and comment) the behavior that we want to track also the time "wasted" on booting failed VMs or do the if success: suggested by claude here

Comment thread resallocserver/manager.py Outdated
alpha = 0.5

dbpool.startup_time_avg = \
alpha * startup_time + (1 - alpha) * dbpool.startup_time_avg

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I didn't know this is so simple to implement.

Do we want to initialize the first dbpool.starup_time_avg with non-zero value (Eg the real first startup_time?)?

Comment thread resallocserver/manager.py Outdated

started_successfully = 0 if success else 1
dbpool.startup_success_rate = \
alpha * started_successfully + (1 - alpha) * dbpool.startup_success_rate

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@FrostyX FrostyX force-pushed the pool-statistics branch from 909f777 to 2519e98 Compare May 16, 2026 13:47
@FrostyX FrostyX force-pushed the pool-statistics branch from 2519e98 to 4e88d05 Compare May 16, 2026 13:50
@FrostyX

FrostyX commented May 16, 2026

Copy link
Copy Markdown
Collaborator Author

Thank you both for the reviews. I addressed all the points except for one

  1. session.add(dbpool) is unnecessary

I like the explicit session.add statements so I left it.

@praiskup praiskup merged commit 4939aab into praiskup:main May 18, 2026
3 checks passed
@praiskup

Copy link
Copy Markdown
Owner

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pool statistics

4 participants