Control Center 6.1 + improve/speed up when monitored servers get temporarily assigned to another EP

When an EP is stopped, it can take several minutes for the monitored servers to get reassigned to another EP. If there is anyway that your company can improve and/or speed up this process, it will be greatly appreciated.

What is your industry?	Non-Industry Specific
How will this idea be used? When an EP is stopped, it can take several minutes for the monitored servers to get reassigned to another EP. If there is anyway that your company can improve and/or speed up this process, it will be greatly appreciated.

Post comment

Guest

Oct 23, 2020
Hello!
What you desire is achievable, now, but please let me explain some things to you first.
There are two aspects to server reassignment.
- The first is ascertaining that an EP has stopped running, and
- The second is to reassign servers to other EPs in the cluster when one EP has stopped.
Know that Control Center 6.1.3.0 has significant performance improvements compared to 6.1.2, including one in the area of server reassignment. That said, there can still be a greater than one minute, but less than two minute, delay between when an EP actually stops, and the CEP deciding it has stopped.
This time can be lessened, but at cost to Control Center performance, and it comes with a risk of the CEP falsely deciding that an EP has stopped.
Details:
- All running EPs are expected to update the LAST_CHECKIN value for their row in the CC_SERVER table at the rate dictated by their HEARTBEAT_INTERVAL value.
- The CEP checks CC_SERVER.LAST_CHECKIN, for each EP, periodically, and if it determines the value is too old (more on that algorithm below) it will set the status of the that server to DOWN and reassign its servers according to the policies set. (Note before actually changing the status of an EP to DOWN, the CEP will make an attempt to communicate with it, and only if that fails too, will it change its status to DOWN.)
- The CC_SERVER.HEARTBEAT_INTERVAL value is set when you install and configure an EP. For EPs, the value set, by default, for HEARTBEAT_INTERVAL is 30000 (the unit is milliseconds, so this is actually 30 seconds).
- You may manually change this via SQL (while all EPs are stopped). By making this value smaller, you would make the CEP decide sooner that an EP is down than it does now, and then server reassignment would start faster.
- Making this value smaller, would increase the risk of the CEP mistakenly thinking an EP has stopped, when perhaps instead a temporary condition has prevented the EP from updating its LAST_CHECKIN value in a timely fashion. And then unnecessary, and unwanted, server reassignments would be initiated by the CEP.
- The CEP only checks the CC_SERVER.LAST_CHECKIN value for other EPs, at most, every 30 seconds (this is a hard coded value and may not be changed). Because of this, and because of the desire to not mistakenly believe a running EP has stopped, the CEP will actually wait double the HEARTBEAT_INTERVAL value (plus an extra 5 seconds for good measure) for the LAST_CHECKIN value to be updated (before it initiates the last ditch communication attempt).
So you can cause the CEP to initiate server reassignments for a downed EP faster than what happens now, but I'm not sure you really want to, and I would advise against it.
Reply
Hide replies

Guest

Oct 23, 2020

Improvements have been made to server reassignment process.

Reply
Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Shape the future of IBM!

Search existing ideas

Post your ideas

Specific links you will want to bookmark for future use

Control Center 6.1 + improve/speed up when monitored servers get temporarily assigned to another EP

Please enter your email address

RELATED IDEAS

Control Center 6.1 + improve/speed up when monitored servers get temporarily assigned to another EP