Skip to content

Fix resolve token conflict logic under race condition#7554

Open
alexqyle wants to merge 4 commits into
cortexproject:masterfrom
alexqyle:duplicate-token
Open

Fix resolve token conflict logic under race condition#7554
alexqyle wants to merge 4 commits into
cortexproject:masterfrom
alexqyle:duplicate-token

Conversation

@alexqyle
Copy link
Copy Markdown
Contributor

@alexqyle alexqyle commented May 22, 2026

What this PR does:
Inside ring module, there is logic to check conflict/duplicated tokens and resolve the conflict. However, during race condition that two instances joined the ring at same time, this logic will not detect conflicts during the first CAS cycle because none of new instances appeared in the ring. On the following CAS calls, none of those instance would do conflict check again because it did not detect any token changed.

Also, there was another issue in the original code. The issue was that even resolve token function made the correct operation. Token from instances that were not detected as newly added or updated will not be updated with new tokens even they have conflicts.

This change fixs both issues in the code. With the fix, each CAS call would do conflict check while instance is still in JOINING state. It takes O(n*m) where n is number of instances in ring and m is number of tokens per instance. So the check should not add too much overhead for each CAS call and it only keeps constantly check during observe period. Also, add logic to make sure all instances got updated token during resolving conflict process would be properly updated. And this update should be idempotent to avoid updates around the same time produce the same result.

Which issue(s) this PR fixes:
NA

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
  • docs/configuration/v1-guarantees.md updated if this PR introduces experimental flags

alexqyle added 3 commits May 22, 2026 14:03
Signed-off-by: Alex Le <leqiyue@amazon.com>
…tegy"

This reverts commit aff3b18.

Signed-off-by: Alex Le <leqiyue@amazon.com>
Signed-off-by: Alex Le <leqiyue@amazon.com>
@danielblando
Copy link
Copy Markdown
Contributor

Nice catch. Thanks.

We do some other code that helps to avoid conflict tokens at
https://github.com/cortexproject/cortex/blob/master/pkg/ring/lifecycler.go#L88

would this fix the issue?

I wonder if we should use something similar to this config to check tokens everytime.
I think performance will be impacted if we just remove the tokensHasCahnged flag

Signed-off-by: Alex Le <leqiyue@amazon.com>
@alexqyle
Copy link
Copy Markdown
Contributor Author

Nice catch. Thanks.

We do some other code that helps to avoid conflict tokens at https://github.com/cortexproject/cortex/blob/master/pkg/ring/lifecycler.go#L88

would this fix the issue?

I wonder if we should use something similar to this config to check tokens everytime. I think performance will be impacted if we just remove the tokensHasCahnged flag

I updated the code to only do constantly conflic check when instance is still in JOINING state. JOINING state would last for ObservePeriod of time. And during ObservePeriod, instance would do heartbeat and call CAS and eventually would check token conflicts. So the heavy computation should only last for ObservePeriod of time. After instance becomes ACTIVE, the logic would be same as before this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants