Scroll to navigation

ocf_suse_SAPCMControlZone(7) SAPCMControlZone ocf_suse_SAPCMControlZone(7)

NAME

SAPCMControlZone - Manages Convergent Mediation ControlZone services for a single instance as HA resource.

SYNOPSYS

SAPCMControlZone [ start | stop | monitor | meta-data | methods | reload | usage | validate-all ]

DESCRIPTION

Overview

SAPCMControlZone is a resource agent (RA) for managing the Convergent Mediation (CM) ControlZone platform and UI for a single instance as HA resources.

The CM central ControlZone platform is responsible for providing services to other CM instances. Several platform containers may exist in a CM system, for high availability, but only one is active at a time. The CM central ControlZone UI is used to query, edit, import, and export data.

The SAPCMControlZone RA manages ControlZone services as active/passive resources. The RA relies on the mzsh command of ControlZone as interface. This calls are used:

- mzsh startup -f SERVICE
- mzsh status SERVICE
- mzsh shutdown SERVICE

Currently supported services are "platform" and "ui". Please see also the REQUIREMENTS section below.

Filesystems

NFS shares with work directories can be mounted statically on all nodes. The HA cluster does not need to control that filesystems. See also manual page SAPCMControlZone_basic_cluster(7).

The ControlZone software and Java runtime environment can be installed into a central NFS share, or into the cluster nodes´ local filesystems, or both. Again, the HA cluster does not need to control that filesystems.

Another option would be to install the ControlZone software into a cluster managed filesystem on shared storage. This might fit for on-premise systems backed by SAN storage infrastructure. From SAPCMControlZone RA´s perspective this would look like node´s local filesystem. We do not discuss storage details here.

The SAPCMControlZone RA offers three ways for managing ControlZone services:

- calling mzsh always from central shared NFS
- calling mzsh always from cluster node´s local filesystem
- calling mzsh startup/shutdown centrally, but mzsh status locally

In all cases the Java runtime environment can be used either from local disk or from central NFS. If the OS provides a compatible Java, that one can be used as well.

The RA defines different MZ_HOME and JAVA_HOME as environment variables thru ~/.bashrc . See the SUPPORTED PARAMETERS and EXAMPLES section below for details.

Best Practice

* Use two independent corosync rings, at least one of them on bonded network. Resulting in at least three physical links. Unicast is preferred.

* Use three Stonith Block Device (SBD), shared LUNs across all nodes on all sites. Of course, together with hardware watchdog.

* Align all timeouts in the Linux cluster with the timeouts of the underlying infrastructure - particuarly network and storage.

* Prefer cluster node´s local filesystem over NFS whenever possible.

* Prefer OS Java runtime whenever possible.

* Check the installation of OS and Linux cluster on all nodes before doing any functional tests.

* Carefully define, perform, and document tests for all scenarios that should be covered, as well as all maintenance procedures.

* Test ControlZone features without Linux cluster before doing the overall cluster tests.

* Test basic Linux cluster features without ControlZone before doing the overall cluster tests.

* Be patient. For detecting the overall ControlZone status, the Linux cluster needs a certain amount of time, depending on the ControlZone services and the configured intervals and timeouts.

* Before doing anything, always check for the Linux cluster's idle status, left-over migration constraints, and resource failures as well as the ControlZone status. Please see also manual page SAPCMControlZone_maintenance_examples(7).

SUPPORTED PARAMETERS

This resource agent supports the following parameters:

USER

OS user who calls mzsh, owner of $MZ_HOME (might be different from $HOME).
Optional. Unique, string. Default value: "mzadmin".

SERVICE

The ControlZone service to be managed by the resoure agent.
Optional. Unique, [ platform | ui ]. Default value: "platform".

MZSHELL

Path to mzsh. Could be one or two full paths. If one path is given, that path is used for all actions. In case two paths are given, the first one is used for monitor actions, the second one is used for start/stop actions. If two paths are given, the first needs to be on local disk, the second needs to be on the central NFS share with the original CM ControlZone installation. Two paths are separated by a semi-colon (;). The mzsh contains settings that need to be consistent with MZ_PLATFORM, MZ_HOME, JAVA_HOME. Please refer to Convergent Mediation product documentation for details.
Optional. Unique, string. Default value: "/opt/cm/bin/mzsh".

MZHOME

Path to CM ControlZone installation directory, owned by the mzadmin user. Could be one or two full paths. If one path is given, that path is used for all actions. In case two paths are given, the first one is used for monitor actions, the second one is used for start/stop actions. If two paths are given, the first needs to be on local disk, the second needs to be on the central NFS share with the original CM ControlZone installation. See also JAVAHOME. Two paths are separated by semi-colon (;).
Optional. Unique, string. Default value: "/opt/cm/".

JAVAHOME

Path to Java virtual machine used for CM ControlZone. Could be one or two full paths. If one path is given, that path is used for all actions. In case two paths are given, the first one is used for monitor actions, the second one is used for start/stop actions. If two paths are given, the first needs to be on local disk, the second needs to be on the central NFS share with the original CM ControlZone installation. See also MZHOME. Two paths are separated by semi-colon (;).
Optional. Unique, string. Default value: "/usr/lib64/jvm/jre-17-openjdk".

MZPLATFORM

URL used by mzsh for connecting to CM ControlZone services. Could be one or two URLs. If one URL is given, that URL is used for all actions. In case two URLs are given, the first one is used for monitor and stop actions, the second one is used for start actions. Two URLs are separated by semi-colon (;). Should usually not be changed. The service´s virtual hostname or virtual IP address managed by the cluster must never be used for RA monitor actions.
Optional. Unique, string. Default value: "http://localhost:9000".

CALL_TIMEOUT

Define timeout how long calls to the ControlZone service for checking the status can take. If the timeout is reached, the return code will be 124. If you increase this timeout for ControlZone calls, you should also adjust the monitor operation timeout of your Linux cluster resources. (Not yet implemented)
Optional. Unique, integer. Default value: 60.

SHUTDOWN_RETRIES

Number of retries to check for process shutdown. Passed to mzsh. If you increase the number of shutdown retries, you should also adjust the stop operation timeout of your Linux cluster resources. (Not yet implemented)
Optional. Unique, integer. Default: mzsh builtin value.

SUPPORTED ACTIONS

This resource agent supports the following actions (operations):

start

Starts the ControlZone service resource. If the mzsh startup call fails, the RA tries twice. Timeout might be adapted to match expected application timing. The RA start timeout relates to the ControlZone component property term.default.startup.timeout, which defaults to 180 seconds. Suggested minimum timeout: 120.

stop

Stops the ControlZone service resource. If the mzsh shutdown call fails, the RA tries twice. Timeout might be adapted to match expected application timing. For maximum patience, the RA stop timeout would be 300 seconds. Suggested minimum timeout: 300, default/required action on-fail=fence.

monitor

Regularly checks the ControlZone service resource status. If the mzsh status call fails, the RA tries twice. Timeout might be adapted to be greater than expected infrastructure timeouts. The RA monitor timeout also relates to the ControlZone component property pico.rcp.timeout, which defaults to 60 seconds. For maximum patience with this component, the RA monitor timeout would be 140 seconds (60+10+60+10). Suggested minimum timeout: 120, suggested interval: 120, suggested action on-fail=restart.

validate-all

Performs a validation of the resource configuration. It does basic checking of given USER, MZSHELL and SERVICE. Suggested minimum timeout: 5.

meta-data

Retrieves resource agent metadata (internal use only). Suggested minimum timeout: 5.

methods

Reports which methods (operations) the resource agent supports. Suggested minimum timeout: 5.

reload

Change parameters without forcing a recover of the resource. Suggested minimum timeout: 5.

RETURN CODES

The return codes are defined by the OCF cluster framework. Please refer to the OCF definition on the website mentioned below. In addition return code 124 will be logged if CALL_TIMEOUT has been exceeded. Also log entries are written, which can be scanned by using a pattern like "SAPCMControlZone.*rc=[1-7,9]" for errors. Regular operations might be found with "SAPHanaControlZone.*rc=0". See SUSE TID 7022678 for maximum RA tracing.

The RA also logs mzsh return codes. For that codes, please look for the respective functions at https://infozone.atlassian.net/wiki/spaces/MD91/pages/23375910/Always+Available

EXAMPLES

Configuration and basic checks for ControlZone platform resources in Linux clusters. See also manual page SAPCMControlZone_maintenance_examples(7).

* Example ~/.bashrc

Environment variables MZ_PLATFORM, MZ_HOME and JAVA_HOME are needed for handling the ControlZone components. The values are inherited from the RA. The related resource parameters are MZPLATFORM, MZHOME and JAVAHOME. See also manual page bash(1). The lines in ~/.bashrc might look like:

# MZ_PLATFORM, MZ_HOME, JAVA_HOME are set by HA RA
export MZ_PLATFORM=${RA_MZ_PLATFORM:-"http://localhost:9000"}
export MZ_HOME=${RA_MZ_HOME:-"/opt/cm9/c11"}
export JAVA_HOME=${RA_JAVA_HOME:-"/opt/cm9/c11/sapmachine17"}

* Example of a simple resource group with ControlZone platform and IP address.

A ControlZone platform resoure rsc_cz_C11 is configured, handled by OS user c11adm. The local /opt/cm9/c11/bin/mzsh is used for monitoring, but for other actions /usr/sap/c11/bin/mzsh is used. This resource is grouped with an IP address resource rsc_ip_C11 into group grp_cz_C11. The resource group might run on either node, but never in parallel.

In case of ControlZone platform failure (or monitor timeout), the resource gets restarted until it gains success or migration-threshold is reached. In case of IP address failure, the resource group gets restarted until it gains success or migration-threshold is reached. If migration-threshold is exceeded, or if the node fails where the group is running, the group will be moved to the other node. A priority is configured for correct fencing in split-brain situations. See also SAPCMControlZone_basic_cluster(7) and ocf_heartbeat_IPaddr2(7).

primitive rsc_cz_C11 ocf:suse:SAPCMControlZone \
params USER=c11adm \
MZSHELL=/opt/cm9/c11/bin/mzsh;/usr/sap/c11/bin/mzsh \
MZHOME=/opt/cm9/c11/;/usr/sap/c11/ \
MZPLATFORM=http://localhost:9000 \
JAVAHOME=/opt/cm9/c11/sapmachine17 \
op monitor interval=90 timeout=120 on-fail=restart \
op start timeout=120 \
op stop timeout=300 \
meta priority=100

primitive rsc_ip_C11 ocf:heartbeat:IPaddr2 \
params ip=192.168.1.234 \
op monitor interval=60 timeout=20 on-fail=restart

group grp_cz_C11 \
rsc_ip_C11 rsc_cz_C11

Note: To limit the impact of IP address failures on the ControlZone platform resource, the IP address resource can be place after the platform. Please check if this is possible with your CM ControlZone setup.

* Example configuration for resource ControlZone UI.

A ControlZone UI resoure rsc_ui_C11 is configured, handled by OS user c11adm. The default path to mzsh is used The resource might run on either node, but never in parallel. In case of ControlZone UI failure (or monitor timeout), the resource gets restarted until it gains success or migration-threshold is reached. If migration-threshold is exceeded, or if the node fails where the resource is running, the resource will be moved to the other node. The resource rsc_ui_C11 will start after resource group grp_cz_C11 and runs on the same node. See also SAPCMControlZone_basic_cluster(7) and ocf_heartbeat_IPaddr2(7).

primitive rsc_ui_C11 ocf:suse:SAPCMControlZone \
params USER=c11adm SERVICE=ui \
op monitor interval=90 timeout=120 on-fail=restart \
op start timeout=120 \
op stop timeout=120

order ord_cz_first Mandatory: grp_cz_C11:start rsc_ui_C11:start

colocation col_with_cz 2000: rsc_ui_C11:Started grp_cz_C11:Started

Note: Instead of defining order and colocation, the resource rsc_ui_C11 might be just added to the resource group grp_cz_C11.

* Optional loadbalancer resource for specific environments.

In some environments a loadbalancer is used for managing access to the virtual IP addres. Thus a respective resource agent might be needed. The resource might be grouped with the IPaddr2 resoure, and starts just after the IPaddr2. In the example at hand azure-lb is the loadbalancer RA, 47011 is the used port. See also manual page ocf_heartbeat_azure-lb(7).

primitive rsc_lb_C11 azure-lb \
params port=47011 \
op monitor timeout=20 interval=10 \
op_params depth=0 \
op start timeout=20 \
op stop timeout=20

group grp_cz_C11 \
rsc_ip_C11 rsc_lb_C11 rsc_cz_C11

* Optional Filesystem resource for monitoring NFS shares.

A shared filesystem migth be statically mounted by OS on both cluster nodes. This filesystem holds work directories. It must not be confused with the ControlZone application itself. Client-side write caching has to be disabled.

A Filesystem resource is configured for a bind-mount of the real NFS share. This resource is grouped with the ControlZone platform and IP address. In case of filesystem failures, the node gets fenced. No mount or umount on the real NFS share is done. Example for the real NFS share is /mnt/platform/check/, example for the bind-mount is /mnt/check/. Both mount points have to be created before the cluster resource is activated. See also manual page SAPCMControlZone_basic_cluster(7), ocf_heartbeat_Filesystem(7) and nfs(5).

primitive rsc_fs_C11 ocf:heartbeat:Filesystem \
params device=/mnt/platform/check/ directory=/mnt/check/ \
fstype=nfs4 options=bind,rw,noac,sync,defaults \
op monitor interval=60 timeout=120 on-fail=fence \
op_params OCF_CHECK_LEVEL=20 \
op start timeout=120 \
op stop timeout=120

group grp_cz_C11 \
rsc_fs_C11 rsc_ip_C11 rsc_cz_C11

Note: If the cluster should try to recover locally before fencing the node, action on-fail=restart needs to be used instead of on-fail=fence.

* Alternate resource order.

All resources are managed within one group. The order is: filesystem, platform, IP address, loadbalancer, UI. The idea is to minimise impact of IP address and UI on platform. On the other hand, filesystem failures should lead to immediate cluster actions. To make this work, MZ_PLATFORM needs to point to localhost for all actions.

group grp_cz_C11 \
rsc_fs_C11 rsc_cz_C11 rsc_ip_C11 rsc_nc_C11 rsc_ui_C11

* Show configuration of ControlZone platform resource and resource group.

Resource is rsc_cz_C11, resource group is grp_C11.

# crm configure show rsc_cz_C11 grp_C11

* Search for log entries of SAPCMControlZone, show errors only.

# grep "SAPCMControlZone.*rc=[1-7,9]" /var/log/messages

* Show log entry of one specific SAPCMControlZone run.

PID of run is 8558.

# grep "SAPCMControlZone.*\[8558\]" /var/log/messages

* Show and delete failcount for resource.

Resource is rsc_cz_C11, node is node22. Useful after a failure has been fixed, and for testing.

# crm resource failcount rsc_cz_C11 show node22
# crm resource failcount rsc_cz_C11 delete node22

* Manually trigger a SAPCMControlZone probe action.

USER is mzadmin, SERVICE is platform, MZSHELL is /usr/sap/c11/bin/mzsh .

# OCF_RESKEY_USER=mzadmin \
OCF_RESKEY_SERVICE=platform \
OCF_RESKEY_MZSHELL="/usr/sap/c11/bin/mzsh" \
OCF_RESKEY_MZHOME="/usr/sap/c11" \
OCF_RESKEY_JAVAHOME="/usr/sap/sapmachine17" \
OCF_ROOT=/usr/lib/ocf/ \
OCF_RESKEY_CRM_meta_interval=0 \
/usr/lib/ocf/resource.d/suse/SAPCMControlZone monitor

* Basic validation of SAPCMControlZone configuration.

The USER, MZSHELL and SERVICE are looked up in the installed system.

# OCF_ROOT=/usr/lib/ocf/ \
OCF_RESKEY_CRM_meta_interval=0 \
/usr/lib/ocf/resource.d/suse/SAPCMControlZone validate-all

* Example for identifying running CM platform processes.

The JAVA_HOME is /usr/sap/c11/sapmachine17 .

# pgrep -f "/usr/sap/c11/sapmachine17/bin/java.*OnOutOfMemoryError=oom platform" -l

* Example for checking if a CM platform can be reached.

The MZ_PLATFORM is http://192.168.1.234:9000 , the user is mzadmin.

# telnet http://192.168.1.234:9000
# su - mzadmin
~> echo $MZ_PLATFORM
~> which mzsh
~> mzsh status platform
~> exit

* Example for checking if a CM platform can not reach the database.

The user is mzadmin.

# su - mzadmin
~> grep "Failed to load codeserver state from database" \
$MZ_HOME/log/platform_current.log
~> grep "Cannot connect to jdbc:sap:" \
$MZ_HOME/log/platform_current.log
~> exit

* Example for testing the SAPCMControlZone RA.

The ControlZone platform will be terminated, while controlled by the Linux cluster. This could be done as very basic testing of SAPCMControlZone RA integration. Terminating ControlZone platform processes is dangerous. This test should not be done on production systems. Example user is mzadmin.
Note: Understand the impact before trying.

1. Check ControlZone and Linux cluster for clean and idle state.
2. Terminate ControlZone platform processes.
# su - mzadmin -c "mzsh kill platform"
3. Wait for the cluster to recover from resource failure.
4. Clean up resource fail-count.
5. Check ControlZone and Linux cluster for clean and idle state.

FILES

/usr/lib/ocf/resource.d/suse/SAPCMControlZone
the resource agent
$HOME/.bashrc, e.g. /home/mzadmin/.bashrc
the mzadmin´s ~/.bashrc, defining JAVA_HOME, MZ_HOME and MZ_PLATFORM
$MZ_HOME, e.g. /opt/cm/
the installation directory of a CM ControlZone service
$MZ_HOME/bin/mzsh
the default mzshell, used as API for managing CM ControlZone services, contains paths and URL
$MZ_HOME/log/
path to logfiles of mzsh as well as platform and UI
$MZ_HOME/tmp/
temporary files and lock files of platform and UI
$JAVA_HOME
the Java virtual machine, used by mzsh

REQUIREMENTS

* Convergent Mediation ControlZone version 9.0.1.1 or higher is installed and configured on both cluster nodes. Either the software is installed once into a shared NFS filesystem and then binaries and configuration are copied into both cluster nodes´ local filesystems. Or the software is installed per node directly. However, finally the local configuration has to be adjusted. Please refer to Convergent Mediation documentation for details.

* CM ControlZone is configured identically on both cluster nodes. User, path names and environment settings are the same.

* Only one ControlZone instance per Linux cluster. Thus one platform service and one UI service per cluster.

* The platform and UI are installed into the same MZ_HOME.

* Linux shell of the mzadmin user is /bin/bash.

* The mzadmin´s ~/.bashrc inherits MZ_HOME, JAVA_HOME and MZ_PLATFORM from SAPCMControlZone RA. This variables need to be set as described in the RA´s documentation (i.e. this manual page).

* When called by the resource agent, mzsh connnects to CM ControlZone services via network. The service´s virtual hostname or virtual IP address managed by the cluster should not be used when called by RA monitor actions.

* Technical users and groups are defined locally in the Linux system. If users are resolved by remote service, local caching is neccessary. Substitute user (su) to the mzadmin user needs to work reliable and without customized actions or messages.

* Name resolution for hostnames and virtual hostnames is crucial. Hostnames of cluster nodes and services are resolved locally in the Linux system.

* Strict time synchronization between the cluster nodes, e.g. NTP. All nodes of a cluster have configured the same timezone.

* Needed NFS shares (e.g. /usr/sap/<SID>) are mounted statically or by automounter. No client-side write caching. File locking might be configured for application needs.

* The RA monitoring operations have to be active.

* RA runtime almost completely depends on call-outs to controlled resources, OS and Linux cluster. The infrastructure needs to allow these call-outs to return in time.

* The ControlZone services are not started/stopped by OS. Thus there is no SystemV, systemd or cron job.

* As long as a ControlZone service is managed by the Linux cluster, the service is not started/stopped/moved from outside. Thus no manual actions are done. The Linux cluster does not prevent from administrative mistakes. However, if the Linux cluster detects the application running at both sites in parallel, it will stop both and restart one.

* Interface for the RA to the ControlZone services is the command mzsh. Ideally, the mzsh should be accessed on the cluster nodes´ local filesystems. The mzsh is called with the arguments startup, shutdown and status. Its return code and output is interpreted by the RA. Thus the command and its output needs to be stable. The mzsh shall not be customized. Particularly environment variables set thru ~/.bashrc must not be changed.

* The mzsh is called on the active node with a defined interval for regular resource monitor operations. It also is called on the active or passive node in certain situations. Those calls might run in parallel.

BUGS

In case of any problem, please use your favourite SAP support process to open a request for the component BC-OP-LNX-SUSE.
Please report feedback and suggestions to feedback@suse.com.

SEE ALSO

SAPCMControlZone_basic_cluster(7), SAPCMControlZone_maintenance_examples(7), ocf_heartbeat_IPaddr2(7) , ocf_heartbeat_Filesystem(7) , crm(8) , crm_mon(8) , cs_show_cluster_actions(8) , nfs(5) , mount(8) , bash(1) ,
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-ocf-return-codes.html ,
https://infozone.atlassian.net/wiki/spaces/MD9/pages/4881672/mzsh ,
https://infozone.atlassian.net/wiki/spaces/MD9/pages/4849693/Setting+Environment+Variables+for+Platform ,
https://documentation.suse.com/sbp/sap/ ,
https://documentation.suse.com/#sle-ha ,
https://www.suse.com/support/kb/doc/?id=000019138 ,
https://www.suse.com/support/kb/doc/?id=000019514 ,
https://www.suse.com/support/kb/doc/?id=000019722 ,
https://launchpad.support.sap.com/#/notes/1552925 ,
https://launchpad.support.sap.com/#/notes/3079845

AUTHORS

F.Herschel, L.Pinne

COPYRIGHT

(c) 2023-2024 SUSE LLC
SAPCMControlZone comes with ABSOLUTELY NO WARRANTY.
For details see the GNU General Public License at http://www.gnu.org/licenses/gpl.html

15 Apr 2024