Skip to content

Avoid O(N) session-cache scan when removing a user#48758

Open
renekloth wants to merge 1 commit intokeycloak:mainfrom
techatspree:avoid-cache-scan-on-user-delete
Open

Avoid O(N) session-cache scan when removing a user#48758
renekloth wants to merge 1 commit intokeycloak:mainfrom
techatspree:avoid-cache-scan-on-user-delete

Conversation

@renekloth
Copy link
Copy Markdown

Summary

Fixes per-user delete latency that grows with Infinispan session-cache size

Problem

PersistentUserSessionProvider.onUserRemoved invokes a full entrySet().stream() over both the user-session and client-session caches (online and offline) to evict the user's entries. The scan cost scales with the total number of cached sessions, not with the user's own.

In a 26.x cluster with PERSISTENT_USER_SESSIONS enabled and many active offline-token users, the offline-session cache fills up over uptime as sessions are lazily loaded on refresh. Observed delete latency for a user with no sessions of their own:

Uptime DELETE /admin/realms/{realm}/users/{id}
just after restart ~50ms
1 day ~150ms
2 days ~1.5s
until max cache size reached ~6s

Independent of how many users have been deleted, only of how full the cache is.

Fix

Replace the full cache scan with O(K) targeted eviction:

  1. Look up the user's user-session id's and associated client UUIDs via a new persister query findUserSessionsByUserId (no lastSessionRefresh filter — idle-expired-but-still-cached entries must also be evicted).
  2. Run the existing bulk DB delete via UserSessionPersisterProvider.onUserRemoved.
  3. Evict only the known cache keys directly via getAdvancedCache().withFlags(...).remove, bypassing JpaChangesPerformer so no per-key DB delete is issued.

Cost is now O(K) over the user's own sessions, independent of total cache size.

Async-batch fallback

When --spi-user-sessions--infinispan--use-batches=true is active, the cache may legitimately hold entries whose INSERT is still in the writer queue and not yet visible to the persister. In that mode the previous full-cache-scan eviction path is retained as a fallback to remain correct.

Naming convention

@NamedQuery names now make the lastSessionRefresh filter explicit:

  • existing findUserSessionsByUserId (with filter) → renamed to findUserSessionsByUserIdLastSessionRefresh
  • new query (no filter) → findUserSessionsByUserId

The persister method follows the same convention. Discussion item: the other find* queries with the same filter (findUserSession, findUserSessionsByBrokerSessionId, findUserSessionsByClientId, findUserSessionsByExternalClientId) could be renamed for consistency in a follow-up cleanup.

Tests

  • UserSessionPersisterProviderTest.testFindUserSessionsByUserId — verifies the new query returns idle-expired sessions that loadUserSessionsStream would filter out.
  • UserSessionPersisterProviderTest.testFindUserSessionsByUserIdWithoutClientSessions — regression guard for the LEFT JOIN: a user session without any client session must still be returned (NULL clientId case).

Reviewer notes

  • The DB-side UserSessionPersisterProvider.onUserRemoved (bulk DELETE by userId) is unchanged — it remains the authoritative DB cleanup. The new code only changes the cache-eviction path.
  • The new findUserSessionsByUserId persister method has a default implementation in the interface that throws IllegalStateException; only JpaUserSessionPersisterProvider and the persistent-session caller need the dedicated implementation.
  • Index IDX_OFFLINE_USS_BY_USER (USER_ID, REALM_ID, OFFLINE_FLAG) on OFFLINE_USER_SESSION exists since 14.0.0 and supports the new query; no schema change needed.

@renekloth renekloth requested review from a team as code owners May 6, 2026 13:35
@renekloth renekloth force-pushed the avoid-cache-scan-on-user-delete branch from 0fd36ef to 1fcab32 Compare May 6, 2026 13:55
@renekloth renekloth marked this pull request as draft May 6, 2026 13:56
@renekloth renekloth force-pushed the avoid-cache-scan-on-user-delete branch from 1fcab32 to 01e3f02 Compare May 6, 2026 14:01
@renekloth renekloth marked this pull request as ready for review May 6, 2026 14:03
@pruivo
Copy link
Copy Markdown
Member

pruivo commented May 6, 2026

@renekloth, thank you for your PR. Can you share your benchmark? I want to run it myself because it may negatively affect users with multi-az deployments: one network call versus K network calls in networks with >10ms latency.

@renekloth renekloth force-pushed the avoid-cache-scan-on-user-delete branch from 01e3f02 to 81201ba Compare May 6, 2026 19:02
Closes keycloak#48755

* PersistentUserSessionProvider.onUserRemoved scanned the entire user-session and client-session caches to evict the user's entries -> O(N) where N is number of total sessions in cache. Thus cost scaled with total cached sessions. To fix this, look up the user's session ids first and then remove the cache entries directly -> O(K) where k is number of offline Sessions for specific user

* Use still old behavior (full-cache-scan) for useBatches `--spi-user-sessions--infinispan--use-batches=true` option, because cache entries can exist whose INSERT is still in the writer queue and not yet visible to the persister.

* additional: rename namedQuery of PersistentUserSessionEntity: `findUserSessionsByUserId` -> `findUserSessionsByUserIdLastSessionRefresh`, because it is conflicting the naming convention and a query without lastSessionRefresh is required

Signed-off-by: Rene Buchholz <rene.buchholz@spree.de>
@renekloth renekloth force-pushed the avoid-cache-scan-on-user-delete branch from 57d4c00 to ef6e35e Compare May 6, 2026 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants