Skip to content

HBASE-29912 Codel lifoThreshold should be applied on soft queue limit…#7864

Open
Umeshkumar9414 wants to merge 5 commits into
apache:masterfrom
Umeshkumar9414:HBASE-29912
Open

HBASE-29912 Codel lifoThreshold should be applied on soft queue limit…#7864
Umeshkumar9414 wants to merge 5 commits into
apache:masterfrom
Umeshkumar9414:HBASE-29912

Conversation

@Umeshkumar9414
Copy link
Copy Markdown
Contributor

@Umeshkumar9414 Umeshkumar9414 commented Mar 5, 2026

… instead of hard limit

I think that after HBASE-16089 codel target delay was changed to 100, that is not standard for CoDel (patch). I have changed it back.
To visualize why we should have 5 target delay with 100 interval, I used the model, With target delay 100 and interval 100 we are more prone the queue being full, although CoDel will be still helpful in LIFO mode but a full queue is problem as it can cause OOM.


// so we can calculate actual threshold to switch to LIFO under load
private int maxCapacity;
private int currentQueueLimit;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this can be updated with onConfigurationChange should this be volatile ? Just a nit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also confusing in that it seems to shadow a variable of the same name that is declared volatile?

Please don't shadow. Why do we need this separately?

Copy link
Copy Markdown
Contributor Author

@Umeshkumar9414 Umeshkumar9414 Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made currentQueueLimit volatile.

Please don't shadow. Why do we need this separately?

Generally a queue should only handle hard limit and layer above it can handle soft limit and that is what was happening but for codel, we needed queue to know the soft limit as well, I didn't saw a way to use the same variable at both place.

Copy link
Copy Markdown
Contributor

@apurtell apurtell Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least let's not shadow the parent class's variable with the same name. That is a code smell. Pick another one.

@Umeshkumar9414
Copy link
Copy Markdown
Contributor Author

Yetus JDK17 Hadoop3 Unit check failed on TestNamespaceReplication Test case with timout, I checked on my local, it is passing. No related change in this PR as well. Looks like a flakey.

@Umeshkumar9414
Copy link
Copy Markdown
Contributor Author

cc @saintstack

Copy link
Copy Markdown
Contributor

@apurtell apurtell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes needed. Otherwise lgtm.

int newQueueLimit = conf.getInt(configKey, queueLimit);
if (newQueueLimit > DEFAULT_CALL_QUEUE_SIZE_HARD_LIMIT) {
LOG.warn(
"Requested soft limit {} exceeds dynamic-resize ceiling {}. "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This WARN will be misleading.
The actual limit is queueHardLimit, initialized in the constructor as Math.max(maxQueueLength, 250).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I doesn't want to store queueHardLimit as well in the object and without storing it we won't be able to know the actual queueHardLimit. Should I store it anyway ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added an variable queueHardLimit in the object and changed it

"hbase.ipc.server.callqueue.codel.lifo.threshold";

public static final int CALL_QUEUE_CODEL_DEFAULT_TARGET_DELAY = 100;
public static final int CALL_QUEUE_CODEL_DEFAULT_TARGET_DELAY = 5;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is fine for master branch, and even branch-2 but will need to be left to the current default in the releasing branches per our compatibility guidelines, which disallow changes that affect operational behaviors. Existing clusters relying on the looser 100ms target will drop calls maybe more aggressively than expected or benchmarked under the same load.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case let me raise an different PR for this change. I want rest of the change in release branches.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip this while backporting.

}

// Wait for some calls to complete
await().atMost(2, TimeUnit.SECONDS).until(() -> completedCalls.size() >= 3);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will probably flake the test.
The assertion below for maxCallIdCompleted inspect up to 5 completions. With only 3 guaranteed by the await(), we might miss a completion that would otherwise land in the 4th/5th slot.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the inspection upto 5

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the timing we are using we will only need 3 completed run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants