Provide an API endpoint with statistics usable for monitoring#183
Conversation
|
What do you think of this approach? Should I rewrite it to calculate the statistics on the fly from the |
a1d74aa to
909f777
Compare
nikromen
left a comment
There was a problem hiding this comment.
Overall the approach (EMA stored in the Pool table + a simple JSON endpoint) is reasonable and simple. A few issues though, some of them are bugs:
Bugs
1. last_successful_start is set unconditionally
dbpool.last_successful_start = start_finished
session.add(dbpool)This is set regardless of the success parameter. It should be guarded:
if success:
dbpool.last_successful_start = start_finished2. startup_success_rate is actually a failure rate
started_successfully = 0 if success else 1
dbpool.startup_success_rate = \
alpha * started_successfully + (1 - alpha) * dbpool.startup_success_rateWhen a resource starts successfully, started_successfully = 0, which pulls the rate toward 0. When it fails, started_successfully = 1, pulling toward 1. So startup_success_rate converges to 1 when everything is broken. That's a failure rate, not a success rate. Either flip the values (1 if success else 0) or rename the field to startup_failure_rate.
3. Race condition on self.pool.last_start
start_initiated = self.pool.last_start
start_finished = time.time()
startup_time = start_finished - start_initiatedself.pool.last_start is set in-memory on the shared Pool config object in allocate() (line 682). If two AllocWorkers run concurrently for the same pool, the second one overwrites self.pool.last_start before the first finishes. The first worker then calculates startup_time using the second worker's start time, yielding a wrong (too short) duration.
To fix this, save the start time per-worker at __init__ time or pass it into the worker, rather than reading a shared mutable attribute later.
Smaller issues
4. No initialization logic for the EMA
When startup_time_avg is 0 (fresh DB), the first calculation produces alpha * real_value + (1 - alpha) * 0, which gives a significantly dampened first measurement. Consider initializing to the actual value on first observation:
if dbpool.startup_time_avg == 0:
dbpool.startup_time_avg = startup_time
else:
dbpool.startup_time_avg = alpha * startup_time + (1 - alpha) * dbpool.startup_time_avgSame applies to startup_success_rate.
5. session.add(dbpool) is unnecessary
dbpool was obtained via session.query(...), so it's already tracked by the session. Attribute changes are detected automatically. Not a bug, but dead code.
6. The TODO should be resolved before merge
# TODO Make sure this is true
# The assumption is, this is the place where we know a resource
# successfully startedThis is indeed the right place (it's after run_command for cmd_new returns), so the comment can be removed. If you're unsure, verify in tests rather than leaving a TODO for the reviewer.
7. startup_time_avg is updated even on failure
Currently you record startup_time regardless of success. If the script fails quickly (e.g., instant cloud API rejection), the average will be pulled down, which is misleading. Consider only updating startup_time_avg on success, since you want to measure "how long does it take to get a working VM."
8. No tests
There are existing tests in tests/. It would be good to add a unit test for recalculate_statistics covering at least: first call, success vs failure, and the EMA convergence.
On the approach question
Should I rewrite it to calculate the statistics on the fly from the Resource table?
The current approach (pre-computed EMA in the Pool table) is fine for monitoring. It's O(1) per API call instead of scanning all resources. The tradeoff is that you lose history and can't re-parameterize after the fact, but for Nagios/Zabbix checks that just need a current health signal, this is the simpler approach. I'd keep it.
that review was experiment with claude code to replace gemini where we don't have it, I'll post my real review soon
nikromen
left a comment
There was a problem hiding this comment.
So this is my human review :D
I also think claude has point on 7. We should check whether the resource actually booted so we do not shorten the average time by failures.
IMO the rest of claude comments are not worth spending time on it or missleading
| """ | ||
| dbpool = session.query(models.Pool).get(self.pool.id) | ||
|
|
||
| start_initiated = self.pool.last_start |
There was a problem hiding this comment.
after investigation, I agree here with claude on point 3 from #183 (review)
in https://github.com/praiskup/resalloc/blob/main/resallocserver/manager.py#L682 self.pool.last_start is set on shared and then AllocWorker is started. But multiple AllocWorkers for the same pool share the same self.pool. Another allocation can overwrite the self.pool.last_start so on reading this, it can lead to shorter startup time than one thinks it is.
However claude could explain why it is problem though 😆
passing the start time into the each worker should do the trick, so each has its own copy then.
so adding attribute to AllocWorker of a start time should do the trick, then something like:
AllocWorker(event, self, int(resource_id), start_time=time.time()).start()
would have the correct start time since each worker would have the correct copy
| dbpool.startup_time_avg = \ | ||
| alpha * startup_time + (1 - alpha) * dbpool.startup_time_avg | ||
|
|
||
| started_successfully = 0 if success else 1 |
There was a problem hiding this comment.
so I think claude has also point here (point 2).
startup_success_rate should mean if the boot was success then the value should be higher and vice versa? now it is doing:
let alpha = 0.5
let startup_success_rate = 0.6
started_successfully = 0 if success else 1
success = 0:
rate =0.5 * 0 + 0.5 * 0.6 = 0.3
fail = 1:
rate = 0.5 * 1 + 0.5 * 0.6 = 0.8
| dbpool.startup_success_rate = \ | ||
| alpha * started_successfully + (1 - alpha) * dbpool.startup_success_rate | ||
|
|
||
| dbpool.last_successful_start = start_finished |
There was a problem hiding this comment.
This is point 1 from claude review... again elaboration on its side would be nice 😆
The issue here would be if we want to detect something like nothing has started with success for 2 hours, since the last_successful_start would be just time of when the VM failed. So we should either rename (and comment) the behavior that we want to track also the time "wasted" on booting failed VMs or do the if success: suggested by claude here
| alpha = 0.5 | ||
|
|
||
| dbpool.startup_time_avg = \ | ||
| alpha * startup_time + (1 - alpha) * dbpool.startup_time_avg |
There was a problem hiding this comment.
Nice, I didn't know this is so simple to implement.
Do we want to initialize the first dbpool.starup_time_avg with non-zero value (Eg the real first startup_time?)?
|
|
||
| started_successfully = 0 if success else 1 | ||
| dbpool.startup_success_rate = \ | ||
| alpha * started_successfully + (1 - alpha) * dbpool.startup_success_rate |
|
Thank you both for the reviews. I addressed all the points except for one
I like the explicit |
|
Thank you! |
Fix #124