HBASE-29912 Codel lifoThreshold should be applied on soft queue limit…#7864
HBASE-29912 Codel lifoThreshold should be applied on soft queue limit…#7864Umeshkumar9414 wants to merge 5 commits into
Conversation
|
|
||
| // so we can calculate actual threshold to switch to LIFO under load | ||
| private int maxCapacity; | ||
| private int currentQueueLimit; |
There was a problem hiding this comment.
Because this can be updated with onConfigurationChange should this be volatile ? Just a nit.
There was a problem hiding this comment.
This is also confusing in that it seems to shadow a variable of the same name that is declared volatile?
Please don't shadow. Why do we need this separately?
There was a problem hiding this comment.
I made currentQueueLimit volatile.
Please don't shadow. Why do we need this separately?
Generally a queue should only handle hard limit and layer above it can handle soft limit and that is what was happening but for codel, we needed queue to know the soft limit as well, I didn't saw a way to use the same variable at both place.
There was a problem hiding this comment.
At least let's not shadow the parent class's variable with the same name. That is a code smell. Pick another one.
34cb3fa to
ba1bb2c
Compare
|
Yetus JDK17 Hadoop3 Unit check failed on TestNamespaceReplication Test case with timout, I checked on my local, it is passing. No related change in this PR as well. Looks like a flakey. |
ba1bb2c to
572aa68
Compare
|
cc @saintstack |
apurtell
left a comment
There was a problem hiding this comment.
Minor changes needed. Otherwise lgtm.
| int newQueueLimit = conf.getInt(configKey, queueLimit); | ||
| if (newQueueLimit > DEFAULT_CALL_QUEUE_SIZE_HARD_LIMIT) { | ||
| LOG.warn( | ||
| "Requested soft limit {} exceeds dynamic-resize ceiling {}. " |
There was a problem hiding this comment.
This WARN will be misleading.
The actual limit is queueHardLimit, initialized in the constructor as Math.max(maxQueueLength, 250).
There was a problem hiding this comment.
Actually I doesn't want to store queueHardLimit as well in the object and without storing it we won't be able to know the actual queueHardLimit. Should I store it anyway ?
There was a problem hiding this comment.
added an variable queueHardLimit in the object and changed it
| "hbase.ipc.server.callqueue.codel.lifo.threshold"; | ||
|
|
||
| public static final int CALL_QUEUE_CODEL_DEFAULT_TARGET_DELAY = 100; | ||
| public static final int CALL_QUEUE_CODEL_DEFAULT_TARGET_DELAY = 5; |
There was a problem hiding this comment.
This change is fine for master branch, and even branch-2 but will need to be left to the current default in the releasing branches per our compatibility guidelines, which disallow changes that affect operational behaviors. Existing clusters relying on the looser 100ms target will drop calls maybe more aggressively than expected or benchmarked under the same load.
There was a problem hiding this comment.
In that case let me raise an different PR for this change. I want rest of the change in release branches.
There was a problem hiding this comment.
We can skip this while backporting.
| } | ||
|
|
||
| // Wait for some calls to complete | ||
| await().atMost(2, TimeUnit.SECONDS).until(() -> completedCalls.size() >= 3); |
There was a problem hiding this comment.
This will probably flake the test.
The assertion below for maxCallIdCompleted inspect up to 5 completions. With only 3 guaranteed by the await(), we might miss a completion that would otherwise land in the 4th/5th slot.
There was a problem hiding this comment.
removed the inspection upto 5
There was a problem hiding this comment.
By the timing we are using we will only need 3 completed run.
… instead of hard limit
572aa68 to
da783dd
Compare
… instead of hard limit
I think that after HBASE-16089 codel target delay was changed to 100, that is not standard for CoDel (patch). I have changed it back.
To visualize why we should have 5 target delay with 100 interval, I used the model, With target delay 100 and interval 100 we are more prone the queue being full, although CoDel will be still helpful in LIFO mode but a full queue is problem as it can cause OOM.