Provide an API endpoint with statistics usable for monitoring by FrostyX · Pull Request #183 · praiskup/resalloc

FrostyX · 2026-04-22T11:38:46Z

FrostyX · 2026-04-22T11:40:39Z

What do you think of this approach? Should I rewrite it to calculate the statistics on the fly from the Resource table?

nikromen

Overall the approach (EMA stored in the Pool table + a simple JSON endpoint) is reasonable and simple. A few issues though, some of them are bugs:

Bugs

1. `last_successful_start` is set unconditionally

dbpool.last_successful_start = start_finished
session.add(dbpool)

This is set regardless of the success parameter. It should be guarded:

if success:
    dbpool.last_successful_start = start_finished

2. `startup_success_rate` is actually a failure rate

started_successfully = 0 if success else 1
dbpool.startup_success_rate = \
    alpha * started_successfully + (1 - alpha) * dbpool.startup_success_rate

When a resource starts successfully, started_successfully = 0, which pulls the rate toward 0. When it fails, started_successfully = 1, pulling toward 1. So startup_success_rate converges to 1 when everything is broken. That's a failure rate, not a success rate. Either flip the values (1 if success else 0) or rename the field to startup_failure_rate.

3. Race condition on `self.pool.last_start`

start_initiated = self.pool.last_start
start_finished = time.time()
startup_time = start_finished - start_initiated

self.pool.last_start is set in-memory on the shared Pool config object in allocate() (line 682). If two AllocWorkers run concurrently for the same pool, the second one overwrites self.pool.last_start before the first finishes. The first worker then calculates startup_time using the second worker's start time, yielding a wrong (too short) duration.

To fix this, save the start time per-worker at __init__ time or pass it into the worker, rather than reading a shared mutable attribute later.

Smaller issues

4. No initialization logic for the EMA

When startup_time_avg is 0 (fresh DB), the first calculation produces alpha * real_value + (1 - alpha) * 0, which gives a significantly dampened first measurement. Consider initializing to the actual value on first observation:

if dbpool.startup_time_avg == 0:
    dbpool.startup_time_avg = startup_time
else:
    dbpool.startup_time_avg = alpha * startup_time + (1 - alpha) * dbpool.startup_time_avg

Same applies to startup_success_rate.

5. `session.add(dbpool)` is unnecessary

dbpool was obtained via session.query(...), so it's already tracked by the session. Attribute changes are detected automatically. Not a bug, but dead code.

6. The TODO should be resolved before merge

# TODO Make sure this is true
# The assumption is, this is the place where we know a resource
# successfully started

This is indeed the right place (it's after run_command for cmd_new returns), so the comment can be removed. If you're unsure, verify in tests rather than leaving a TODO for the reviewer.

7. `startup_time_avg` is updated even on failure

Currently you record startup_time regardless of success. If the script fails quickly (e.g., instant cloud API rejection), the average will be pulled down, which is misleading. Consider only updating startup_time_avg on success, since you want to measure "how long does it take to get a working VM."

8. No tests

There are existing tests in tests/. It would be good to add a unit test for recalculate_statistics covering at least: first call, success vs failure, and the EMA convergence.

On the approach question

Should I rewrite it to calculate the statistics on the fly from the Resource table?

The current approach (pre-computed EMA in the Pool table) is fine for monitoring. It's O(1) per API call instead of scanning all resources. The tradeoff is that you lose history and can't re-parameterize after the fact, but for Nagios/Zabbix checks that just need a current health signal, this is the simpler approach. I'd keep it.

that review was experiment with claude code to replace gemini where we don't have it, I'll post my real review soon

nikromen

So this is my human review :D

I also think claude has point on 7. We should check whether the resource actually booted so we do not shorten the average time by failures.

IMO the rest of claude comments are not worth spending time on it or missleading

nikromen · 2026-05-07T12:01:32Z

+        """
+        dbpool = session.query(models.Pool).get(self.pool.id)
+
+        start_initiated = self.pool.last_start


after investigation, I agree here with claude on point 3 from #183 (review)

in https://github.com/praiskup/resalloc/blob/main/resallocserver/manager.py#L682 self.pool.last_start is set on shared and then AllocWorker is started. But multiple AllocWorkers for the same pool share the same self.pool. Another allocation can overwrite the self.pool.last_start so on reading this, it can lead to shorter startup time than one thinks it is.

However claude could explain why it is problem though 😆

passing the start time into the each worker should do the trick, so each has its own copy then.

so adding attribute to AllocWorker of a start time should do the trick, then something like:

AllocWorker(event, self, int(resource_id), start_time=time.time()).start()

would have the correct start time since each worker would have the correct copy

nikromen · 2026-05-07T12:18:54Z

+        dbpool.startup_time_avg = \
+            alpha * startup_time + (1 - alpha) * dbpool.startup_time_avg
+
+        started_successfully = 0 if success else 1


so I think claude has also point here (point 2).

startup_success_rate should mean if the boot was success then the value should be higher and vice versa? now it is doing:

let alpha = 0.5 let startup_success_rate = 0.6 started_successfully = 0 if success else 1 success = 0: rate =0.5 * 0 + 0.5 * 0.6 = 0.3 fail = 1: rate = 0.5 * 1 + 0.5 * 0.6 = 0.8

nikromen · 2026-05-07T12:23:31Z

+        dbpool.startup_success_rate = \
+            alpha * started_successfully + (1 - alpha) * dbpool.startup_success_rate
+
+        dbpool.last_successful_start = start_finished


This is point 1 from claude review... again elaboration on its side would be nice 😆

The issue here would be if we want to detect something like nothing has started with success for 2 hours, since the last_successful_start would be just time of when the VM failed. So we should either rename (and comment) the behavior that we want to track also the time "wasted" on booting failed VMs or do the if success: suggested by claude here

praiskup · 2026-05-11T04:08:55Z

+        alpha = 0.5
+
+        dbpool.startup_time_avg = \
+            alpha * startup_time + (1 - alpha) * dbpool.startup_time_avg


Nice, I didn't know this is so simple to implement.

Do we want to initialize the first dbpool.starup_time_avg with non-zero value (Eg the real first startup_time?)?

praiskup · 2026-05-11T04:16:06Z

+
+        started_successfully = 0 if success else 1
+        dbpool.startup_success_rate = \
+            alpha * started_successfully + (1 - alpha) * dbpool.startup_success_rate


Fix praiskup#124

FrostyX · 2026-05-16T14:04:22Z

Thank you both for the reviews. I addressed all the points except for one

session.add(dbpool) is unnecessary

I like the explicit session.add statements so I left it.

praiskup · 2026-05-18T05:27:35Z

Thank you!

github-advanced-security AI found potential problems Apr 22, 2026

View reviewed changes

Comment thread resallocserver/alembic/versions/9d9d3a7f7c8f_add_pool_statistics.py Fixed

Comment thread resallocserver/manager.py Fixed

Comment thread resallocwebui/app.py Fixed

FrostyX force-pushed the pool-statistics branch from a1d74aa to 909f777 Compare April 22, 2026 11:42

nikromen previously requested changes May 7, 2026

View reviewed changes

nikromen reviewed May 7, 2026

View reviewed changes

praiskup reviewed May 11, 2026

View reviewed changes

FrostyX force-pushed the pool-statistics branch from 909f777 to 2519e98 Compare May 16, 2026 13:47

Provide an API endpoint with statistics usable for monitoring

4e88d05

Fix praiskup#124

FrostyX force-pushed the pool-statistics branch from 2519e98 to 4e88d05 Compare May 16, 2026 13:50

praiskup merged commit 4939aab into praiskup:main May 18, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an API endpoint with statistics usable for monitoring#183

Provide an API endpoint with statistics usable for monitoring#183
praiskup merged 1 commit into
praiskup:mainfrom
FrostyX:pool-statistics

FrostyX commented Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FrostyX commented Apr 22, 2026

Uh oh!

nikromen left a comment

Uh oh!

nikromen left a comment

Uh oh!

nikromen May 7, 2026

Uh oh!

nikromen May 7, 2026

Uh oh!

nikromen May 7, 2026

Uh oh!

praiskup May 11, 2026

Uh oh!

praiskup May 11, 2026

Uh oh!

FrostyX commented May 16, 2026

Uh oh!

Uh oh!

praiskup commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

FrostyX commented Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FrostyX commented Apr 22, 2026

Uh oh!

nikromen left a comment

Choose a reason for hiding this comment

Bugs

1. last_successful_start is set unconditionally

2. startup_success_rate is actually a failure rate

3. Race condition on self.pool.last_start

Smaller issues

4. No initialization logic for the EMA

5. session.add(dbpool) is unnecessary

6. The TODO should be resolved before merge

7. startup_time_avg is updated even on failure

8. No tests

On the approach question

Uh oh!

nikromen left a comment

Choose a reason for hiding this comment

Uh oh!

nikromen May 7, 2026

Choose a reason for hiding this comment

Uh oh!

nikromen May 7, 2026

Choose a reason for hiding this comment

Uh oh!

nikromen May 7, 2026

Choose a reason for hiding this comment

Uh oh!

praiskup May 11, 2026

Choose a reason for hiding this comment

Uh oh!

praiskup May 11, 2026

Choose a reason for hiding this comment

Uh oh!

FrostyX commented May 16, 2026

Uh oh!

Uh oh!

praiskup commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. `last_successful_start` is set unconditionally

2. `startup_success_rate` is actually a failure rate

3. Race condition on `self.pool.last_start`

5. `session.add(dbpool)` is unnecessary

7. `startup_time_avg` is updated even on failure