| ocf_suse_SAPCMControlZone(7) | SAPCMControlZone | ocf_suse_SAPCMControlZone(7) |
NAME¶
SAPCMControlZone - Manages Convergent Mediation ControlZone services for a single instance as HA resource.
SYNOPSYS¶
SAPCMControlZone [ start | stop | monitor | meta-data | methods | reload | usage | validate-all ]
DESCRIPTION¶
Overview
SAPCMControlZone is a resource agent (RA) for managing the Convergent Mediation (CM) ControlZone platform and UI for a single instance as HA resources.
The CM central ControlZone platform is responsible for providing services to other CM instances. Several platform containers may exist in a CM system, for high availability, but only one is active at a time. The CM central ControlZone UI is used to query, edit, import, and export data.
The SAPCMControlZone RA manages ControlZone services as active/passive resources. The RA relies on the mzsh command of ControlZone as interface. This calls are used:
- mzsh status SERVICE
- mzsh shutdown SERVICE
Currently supported services are "platform" and "ui". Please see also the REQUIREMENTS section below.
Filesystems
NFS shares with work directories can be mounted statically on all nodes. The HA cluster does not need to control that filesystems. See also manual page SAPCMControlZone_basic_cluster(7).
The ControlZone software and Java runtime environment can be installed into a central NFS share, or into the cluster nodes´ local filesystems, or both. Again, the HA cluster does not need to control that filesystems.
Another option would be to install the ControlZone software into a cluster managed filesystem on shared storage. This might fit for on-premise systems backed by SAN storage infrastructure. From SAPCMControlZone RA´s perspective this would look like node´s local filesystem. We do not discuss storage details here.
The SAPCMControlZone RA offers three ways for managing ControlZone services:
- calling mzsh always from cluster node´s local filesystem
- calling mzsh startup/shutdown centrally, but mzsh status locally
In all cases the Java runtime environment can be used either from local disk or from central NFS. If the OS provides a compatible Java, that one can be used as well.
The RA defines different MZ_HOME and JAVA_HOME as environment variables thru ~/.bashrc . See the SUPPORTED PARAMETERS and EXAMPLES section below for details.
Best Practice
* Use two independent corosync rings, at least one of them on bonded network. Resulting in at least three physical links. Unicast is preferred.
* Use three Stonith Block Device (SBD), shared LUNs across all nodes on all sites. Of course, together with hardware watchdog.
* Align all timeouts in the Linux cluster with the timeouts of the underlying infrastructure - particuarly network and storage.
* Prefer cluster node´s local filesystem over NFS whenever possible.
* Prefer OS Java runtime whenever possible.
* Check the installation of OS and Linux cluster on all nodes before doing any functional tests.
* Carefully define, perform, and document tests for all scenarios that should be covered, as well as all maintenance procedures.
* Test ControlZone features without Linux cluster before doing the overall cluster tests.
* Test basic Linux cluster features without ControlZone before doing the overall cluster tests.
* Be patient. For detecting the overall ControlZone status, the Linux cluster needs a certain amount of time, depending on the ControlZone services and the configured intervals and timeouts.
* Before doing anything, always check for the Linux cluster's idle status, left-over migration constraints, and resource failures as well as the ControlZone status. Please see also manual page SAPCMControlZone_maintenance_examples(7).
SUPPORTED PARAMETERS¶
This resource agent supports the following parameters:
USER
Optional. Unique, string. Default value: "mzadmin".
SERVICE
Optional. Unique, [ platform | ui ]. Default value: "platform".
MZSHELL
Optional. Unique, string. Default value: "/opt/cm/bin/mzsh".
MZHOME
Optional. Unique, string. Default value: "/opt/cm/".
JAVAHOME
Optional. Unique, string. Default value: "/usr/lib64/jvm/jre-17-openjdk".
MZPLATFORM
Optional. Unique, string. Default value: "http://localhost:9000".
CALL_TIMEOUT
Optional. Unique, integer. Default value: 60.
SHUTDOWN_RETRIES
Optional. Unique, integer. Default: mzsh builtin value.
SUPPORTED ACTIONS¶
This resource agent supports the following actions (operations):
start
stop
monitor
validate-all
meta-data
methods
reload
RETURN CODES¶
The return codes are defined by the OCF cluster framework. Please refer to the OCF definition on the website mentioned below. In addition return code 124 will be logged if CALL_TIMEOUT has been exceeded. Also log entries are written, which can be scanned by using a pattern like "SAPCMControlZone.*rc=[1-7,9]" for errors. Regular operations might be found with "SAPHanaControlZone.*rc=0". See SUSE TID 7022678 for maximum RA tracing.
The RA also logs mzsh return codes. For that codes, please look for the respective functions at https://infozone.atlassian.net/wiki/spaces/MD91/pages/23375910/Always+Available
EXAMPLES¶
Configuration and basic checks for ControlZone platform resources in Linux clusters. See also manual page SAPCMControlZone_maintenance_examples(7).
* Example ~/.bashrc
Environment variables MZ_PLATFORM, MZ_HOME and JAVA_HOME are needed for handling the ControlZone components. The values are inherited from the RA. The related resource parameters are MZPLATFORM, MZHOME and JAVAHOME. See also manual page bash(1). The lines in ~/.bashrc might look like:
export MZ_PLATFORM=${RA_MZ_PLATFORM:-"http://localhost:9000"}
export MZ_HOME=${RA_MZ_HOME:-"/opt/cm9/c11"}
export JAVA_HOME=${RA_JAVA_HOME:-"/opt/cm9/c11/sapmachine17"}
* Example of a simple resource group with ControlZone platform and IP address.
A ControlZone platform resoure rsc_cz_C11 is configured, handled by OS user c11adm. The local /opt/cm9/c11/bin/mzsh is used for monitoring, but for other actions /usr/sap/c11/bin/mzsh is used. This resource is grouped with an IP address resource rsc_ip_C11 into group grp_cz_C11. The resource group might run on either node, but never in parallel.
In case of ControlZone platform failure (or monitor timeout), the resource gets restarted until it gains success or migration-threshold is reached. In case of IP address failure, the resource group gets restarted until it gains success or migration-threshold is reached. If migration-threshold is exceeded, or if the node fails where the group is running, the group will be moved to the other node. A priority is configured for correct fencing in split-brain situations. See also SAPCMControlZone_basic_cluster(7) and ocf_heartbeat_IPaddr2(7).
params USER=c11adm \
MZSHELL=/opt/cm9/c11/bin/mzsh;/usr/sap/c11/bin/mzsh \
MZHOME=/opt/cm9/c11/;/usr/sap/c11/ \
MZPLATFORM=http://localhost:9000 \
JAVAHOME=/opt/cm9/c11/sapmachine17 \
op monitor interval=90 timeout=120 on-fail=restart \
op start timeout=120 \
op stop timeout=300 \
meta priority=100
params ip=192.168.1.234 \
op monitor interval=60 timeout=20 on-fail=restart
rsc_ip_C11 rsc_cz_C11
Note: To limit the impact of IP address failures on the ControlZone platform resource, the IP address resource can be place after the platform. Please check if this is possible with your CM ControlZone setup.
* Example configuration for resource ControlZone UI.
A ControlZone UI resoure rsc_ui_C11 is configured, handled by OS user c11adm. The default path to mzsh is used The resource might run on either node, but never in parallel. In case of ControlZone UI failure (or monitor timeout), the resource gets restarted until it gains success or migration-threshold is reached. If migration-threshold is exceeded, or if the node fails where the resource is running, the resource will be moved to the other node. The resource rsc_ui_C11 will start after resource group grp_cz_C11 and runs on the same node. See also SAPCMControlZone_basic_cluster(7) and ocf_heartbeat_IPaddr2(7).
params USER=c11adm SERVICE=ui \
op monitor interval=90 timeout=120 on-fail=restart \
op start timeout=120 \
op stop timeout=120
order ord_cz_first Mandatory: grp_cz_C11:start rsc_ui_C11:start
colocation col_with_cz 2000: rsc_ui_C11:Started grp_cz_C11:Started
Note: Instead of defining order and colocation, the resource rsc_ui_C11 might be just added to the resource group grp_cz_C11.
* Optional loadbalancer resource for specific environments.
In some environments a loadbalancer is used for managing access to the virtual IP addres. Thus a respective resource agent might be needed. The resource might be grouped with the IPaddr2 resoure, and starts just after the IPaddr2. In the example at hand azure-lb is the loadbalancer RA, 47011 is the used port. See also manual page ocf_heartbeat_azure-lb(7).
params port=47011 \
op monitor timeout=20 interval=10 \
op_params depth=0 \
op start timeout=20 \
op stop timeout=20
group grp_cz_C11 \
rsc_ip_C11 rsc_lb_C11 rsc_cz_C11
* Optional Filesystem resource for monitoring NFS shares.
A shared filesystem migth be statically mounted by OS on both cluster nodes. This filesystem holds work directories. It must not be confused with the ControlZone application itself. Client-side write caching has to be disabled.
A Filesystem resource is configured for a bind-mount of the real NFS share. This resource is grouped with the ControlZone platform and IP address. In case of filesystem failures, the node gets fenced. No mount or umount on the real NFS share is done. Example for the real NFS share is /mnt/platform/check/, example for the bind-mount is /mnt/check/. Both mount points have to be created before the cluster resource is activated. See also manual page SAPCMControlZone_basic_cluster(7), ocf_heartbeat_Filesystem(7) and nfs(5).
params device=/mnt/platform/check/ directory=/mnt/check/ \
fstype=nfs4 options=bind,rw,noac,sync,defaults \
op monitor interval=60 timeout=120 on-fail=fence \
op_params OCF_CHECK_LEVEL=20 \
op start timeout=120 \
op stop timeout=120
group grp_cz_C11 \
rsc_fs_C11 rsc_ip_C11 rsc_cz_C11
Note: If the cluster should try to recover locally before fencing the node, action on-fail=restart needs to be used instead of on-fail=fence.
* Alternate resource order.
All resources are managed within one group. The order is: filesystem, platform, IP address, loadbalancer, UI. The idea is to minimise impact of IP address and UI on platform. On the other hand, filesystem failures should lead to immediate cluster actions. To make this work, MZ_PLATFORM needs to point to localhost for all actions.
rsc_fs_C11 rsc_cz_C11 rsc_ip_C11 rsc_nc_C11 rsc_ui_C11
* Show configuration of ControlZone platform resource and resource group.
Resource is rsc_cz_C11, resource group is grp_C11.
* Search for log entries of SAPCMControlZone, show errors only.
* Show log entry of one specific SAPCMControlZone run.
PID of run is 8558.
* Show and delete failcount for resource.
Resource is rsc_cz_C11, node is node22. Useful after a failure has been fixed, and for testing.
# crm resource failcount rsc_cz_C11 delete node22
* Manually trigger a SAPCMControlZone probe action.
USER is mzadmin, SERVICE is platform, MZSHELL is /usr/sap/c11/bin/mzsh .
OCF_RESKEY_SERVICE=platform \
OCF_RESKEY_MZSHELL="/usr/sap/c11/bin/mzsh" \
OCF_RESKEY_MZHOME="/usr/sap/c11" \
OCF_RESKEY_JAVAHOME="/usr/sap/sapmachine17" \
OCF_ROOT=/usr/lib/ocf/ \
OCF_RESKEY_CRM_meta_interval=0 \
/usr/lib/ocf/resource.d/suse/SAPCMControlZone monitor
* Basic validation of SAPCMControlZone configuration.
The USER, MZSHELL and SERVICE are looked up in the installed system.
OCF_RESKEY_CRM_meta_interval=0 \
/usr/lib/ocf/resource.d/suse/SAPCMControlZone validate-all
* Example for identifying running CM platform processes.
The JAVA_HOME is /usr/sap/c11/sapmachine17 .
* Example for checking if a CM platform can be reached.
The MZ_PLATFORM is http://192.168.1.234:9000 , the user is mzadmin.
# su - mzadmin
~> echo $MZ_PLATFORM
~> which mzsh
~> mzsh status platform
~> exit
* Example for checking if a CM platform can not reach the database.
The user is mzadmin.
~> grep "Failed to load codeserver state from database" \
$MZ_HOME/log/platform_current.log
~> grep "Cannot connect to jdbc:sap:" \
$MZ_HOME/log/platform_current.log
~> exit
* Example for testing the SAPCMControlZone RA.
The ControlZone platform will be terminated, while controlled by
the Linux cluster. This could be done as very basic testing of
SAPCMControlZone RA integration. Terminating ControlZone platform processes
is dangerous. This test should not be done on production systems. Example
user is mzadmin.
Note: Understand the impact before trying.
2. Terminate ControlZone platform processes.
# su - mzadmin -c "mzsh kill platform"
3. Wait for the cluster to recover from resource failure.
4. Clean up resource fail-count.
5. Check ControlZone and Linux cluster for clean and idle state.
FILES¶
- /usr/lib/ocf/resource.d/suse/SAPCMControlZone
- the resource agent
- $HOME/.bashrc, e.g. /home/mzadmin/.bashrc
- the mzadmin´s ~/.bashrc, defining JAVA_HOME, MZ_HOME and MZ_PLATFORM
- $MZ_HOME, e.g. /opt/cm/
- the installation directory of a CM ControlZone service
- $MZ_HOME/bin/mzsh
- the default mzshell, used as API for managing CM ControlZone services, contains paths and URL
- $MZ_HOME/log/
- path to logfiles of mzsh as well as platform and UI
- $MZ_HOME/tmp/
- temporary files and lock files of platform and UI
- $JAVA_HOME
- the Java virtual machine, used by mzsh
REQUIREMENTS¶
* Convergent Mediation ControlZone version 9.0.1.1 or higher is installed and configured on both cluster nodes. Either the software is installed once into a shared NFS filesystem and then binaries and configuration are copied into both cluster nodes´ local filesystems. Or the software is installed per node directly. However, finally the local configuration has to be adjusted. Please refer to Convergent Mediation documentation for details.
* CM ControlZone is configured identically on both cluster nodes. User, path names and environment settings are the same.
* Only one ControlZone instance per Linux cluster. Thus one platform service and one UI service per cluster.
* The platform and UI are installed into the same MZ_HOME.
* Linux shell of the mzadmin user is /bin/bash.
* The mzadmin´s ~/.bashrc inherits MZ_HOME, JAVA_HOME and MZ_PLATFORM from SAPCMControlZone RA. This variables need to be set as described in the RA´s documentation (i.e. this manual page).
* When called by the resource agent, mzsh connnects to CM ControlZone services via network. The service´s virtual hostname or virtual IP address managed by the cluster should not be used when called by RA monitor actions.
* Technical users and groups are defined locally in the Linux system. If users are resolved by remote service, local caching is neccessary. Substitute user (su) to the mzadmin user needs to work reliable and without customized actions or messages.
* Name resolution for hostnames and virtual hostnames is crucial. Hostnames of cluster nodes and services are resolved locally in the Linux system.
* Strict time synchronization between the cluster nodes, e.g. NTP. All nodes of a cluster have configured the same timezone.
* Needed NFS shares (e.g. /usr/sap/<SID>) are mounted statically or by automounter. No client-side write caching. File locking might be configured for application needs.
* The RA monitoring operations have to be active.
* RA runtime almost completely depends on call-outs to controlled resources, OS and Linux cluster. The infrastructure needs to allow these call-outs to return in time.
* The ControlZone services are not started/stopped by OS. Thus there is no SystemV, systemd or cron job.
* As long as a ControlZone service is managed by the Linux cluster, the service is not started/stopped/moved from outside. Thus no manual actions are done. The Linux cluster does not prevent from administrative mistakes. However, if the Linux cluster detects the application running at both sites in parallel, it will stop both and restart one.
* Interface for the RA to the ControlZone services is the command mzsh. Ideally, the mzsh should be accessed on the cluster nodes´ local filesystems. The mzsh is called with the arguments startup, shutdown and status. Its return code and output is interpreted by the RA. Thus the command and its output needs to be stable. The mzsh shall not be customized. Particularly environment variables set thru ~/.bashrc must not be changed.
* The mzsh is called on the active node with a defined interval for regular resource monitor operations. It also is called on the active or passive node in certain situations. Those calls might run in parallel.
BUGS¶
In case of any problem, please use your favourite SAP support
process to open a request for the component BC-OP-LNX-SUSE.
Please report feedback and suggestions to feedback@suse.com.
SEE ALSO¶
SAPCMControlZone_basic_cluster(7),
SAPCMControlZone_maintenance_examples(7),
ocf_heartbeat_IPaddr2(7) , ocf_heartbeat_Filesystem(7) ,
crm(8) , crm_mon(8) , cs_show_cluster_actions(8) ,
nfs(5) , mount(8) , bash(1) ,
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-ocf-return-codes.html
,
https://infozone.atlassian.net/wiki/spaces/MD9/pages/4881672/mzsh ,
https://infozone.atlassian.net/wiki/spaces/MD9/pages/4849693/Setting+Environment+Variables+for+Platform
,
https://documentation.suse.com/sbp/sap/ ,
https://documentation.suse.com/#sle-ha ,
https://www.suse.com/support/kb/doc/?id=000019138 ,
https://www.suse.com/support/kb/doc/?id=000019514 ,
https://www.suse.com/support/kb/doc/?id=000019722 ,
https://launchpad.support.sap.com/#/notes/1552925 ,
https://launchpad.support.sap.com/#/notes/3079845
AUTHORS¶
F.Herschel, L.Pinne
COPYRIGHT¶
(c) 2023-2024 SUSE LLC
SAPCMControlZone comes with ABSOLUTELY NO WARRANTY.
For details see the GNU General Public License at
http://www.gnu.org/licenses/gpl.html
| 15 Apr 2024 |