SAPHanaSR

NAME¶

susChkSrv.py - Provider for SAP HANA srHook method srServiceStateChanged().

DESCRIPTION¶

susChkSrv.py can be used to provide a script for the SAP HANA srHook method srServiceStateChanged().

The SAP HANA nameserver provides a Python-based API ("HA/DR providers"), which is called at important points of the host auto-failover and system replication takeover processes. These so called hooks can be used for arbitrary operations that need to be executed. The method srServiceStateChanged() is called when HANA processes are failing, starting or stopping.

Purpose of susChkSrv.py is to detect failing HANA indexserver processes and trigger a fast takeover to the secondary site. With regular configuration of an HANA database, the resource agent (RA) for HANA in a Linux cluster does not trigger a takeover to the secondary site when:
- A software failure causes one or more HANA processes to be restarted in place by the HANA daemon (hdbdaemon).
- A hardware error (e.g. SIGBUS from an uncorrectable memory error) causes the indexserver to restart locally.
See also SAPHanaSR(7) or SAPHanaSR-ScaleOut(7).

The hook script susChkSrv.py is called on any srServiceStateChanged() event. The script checks for
'isIndexserver and serviceRestart and serviceWasActiveBefore and hostActive and databaseActive'. If it finds the correct entries, it executes the predefined action. As soon as the HANA landscapeHostConfiguration status changes to 1, the Linux cluster will take action. The action depends on HANA system replication status and the RA´s configuration parameters PREFER_SITE_TAKEOVER and AUTOMATED_REGISTER, see manual page ocf_suse_SAPHana(7) or ocf_suse_SAPHanaController(7).

Customising of HANA daemon timeout parameters might be needed for adapting the solution to a given environment. Please refer to SAP HANA documentation.

This hook script needs to be installed, configured and activated on all HANA nodes.

SUPPORTED PARAMETERS¶

* The "HA/DR providers" API accepts the following parameters for the ha_dr_provider_suschksrv section in global.ini:

[ha_dr_provider_suschksrv]
provider = susChkSrv: Mandatory. Must not be changed.
path = /usr/share/SAPHanaSR-angi: Mandatory. Delivered within RPM package. Please change only if requested.
execution_order = [ INTEGER ]: Mandatory. Order might depend on other hook scripts.
action_on_lost = [ ignore | stop | kill | fence ]: Action to be processed when a lost indexserver is identified.
- ignore: do nothing, just write to tracefiles.
- stop: do 'sapcontrol ... StopSystem'. If this is combined with SAPHana or SAPHanaController RA parameter 'AUTOMATED_REGISTER=true', HANA needs to release all OS resources prior to the automated registering. See also manual page ocf_suse_SAPHanaController(7).
- kill: do 'HDB kill-<signal>'. The signal can be defined by parameter 'kill_signal'. If this is combined with SAPHanaController RA parameter 'AUTOMATED_REGISTER=true', HANA needs to release all OS resources prior to the automated registering.
- fence: do 'crm node fence <host>'. This needs a Linux cluster STONITH method and sudo permission. This action is primarily meant for scale-up. For scale-out, SAPHanaSR-agent-fencing should be configured additionally, see manual page SAPHanaSR-agent-fencing(8) for details.
Optional. Default is ignore.
kill_signal = [ INTEGER ]: Signal to be used with 'HDB kill-<signal>'.
Optional. Default is 9.
stop_timeout = [ INTEGER ]: How many seconds to wait for 'sapcontrol ... StopSystem' to return. Should be greater than value of HANA parameter 'forcedterminationtimeout'. See also SAPHanaSR_basic_cluster(7).
Optional. Default is 20 seconds.
* The "HA/DR providers" API accepts the following parameter for the trace section in global.ini:
[trace]
ha_dr_suschksrv = [ info | debug ]: Optional. Default is info. Will be added automatically if not set.

* The HANA daemon TODO for the daemon section of daemon.ini:

[daemon]

terminationtimeout = [ INTEGER ]: See also SAPHanaSR_basic_cluster(7). Optional. Timeout in milliseconds. Default is 30000.
forcedterminationtimeout = [ INTEGER ]: See also SAPHanaSR_basic_cluster(7). Optional. Timeout in milliseconds. Default is 270000.

* The HANA daemon TODO for the indexserver.<tenant> section of daemon.ini:

[indexserver.<tenant>]
gracetime = [ INTEGER ]: TODO Should be 6000.
Optional. Timeout in milliseconds. Default is 2000.

RETURN CODES¶

0 Successful program execution.
>0 Usage, syntax or execution errors.

EXAMPLES¶

* Example for minimal entry in SAP HANA scale-up global configuration /hana/shared/$SID/global/hdb/custom/config/global.ini

In case of a failing indexserver, the event is logged. No action is performed. The section ha_dr_provider_suschksrv is needed on all HANA nodes. The HANA has to be stopped before the file can be changed.

[ha_dr_provider_suschksrv]
provider = susChkSrv
path = /usr/share/SAPHanaSR-angi/
execution_order = 3

* Example for entry in SAP HANA scale-up global configuration /hana/shared/HA1/global/hdb/custom/config/global.ini

In case of a failing indexserver, the complete node will get fenced by calling 'crm node fence ...'. Unlike fence actions from inside the Linux HA cluster, this hook script fence action will be issued even when the cluster is in maintenance and stonith is temporarily disabled. The fence request then is queued until the cluster is set back into active state. See example below for removing queued fence actions.
The section ha_dr_provider_suschksrv is needed on all HANA nodes. The HANA has to be stopped before the file can be changed. Alterntively use SAPHanaSR-manageProvider for applying an HA/DR provider hook configuration, see manual page SAPHanaSR-manageProvider(8).

[ha_dr_provider_suschksrv]
provider = susChkSrv
path = /usr/share/SAPHanaSR-angi/
execution_order = 2
action_on_lost = fence

* Example for entry in SAP HANA scale-out global configuration /hana/shared/HA1/global/hdb/custom/config/global.ini

In case of a failing indexserver, the HANA instance will be stopped with
'sacpcontrol ... StopSystem'. HANA timeouts might be adapted to speed up the stop. If this is combined with SAPHanaController parameter 'AUTOMATED_REGISTER=true', HANA needs to release all OS resources prior to the automated registering.
The hook script should wait maximum 25 seconds on the sapcontrol command to return.
The section ha_dr_provider_suschksrv is needed on all HANA nodes. The HANA has to be stopped before the file can be changed.
Note: HANA scale-out is supported only with exactly one master nameserver. No HANA host auto-failover.

[ha_dr_provider_suschksrv]
provider = susChkSrv
path = /usr/share/SAPHanaSR-angi/
execution_order = 2
action_on_lost = stop
stop_timeout = 25

* Example for entry in SAP HANA daemon configuration /hana/shared/HA1/global/hdb/custom/config/daemon.ini

TODO Example SID is HA1, tenant is HA1.
The sections daemon and indexserver.HA1 are needed on all HANA nodes. The HANA has to be stopped before the file can be changed. Please refer to SAP documentation before setting this parameters.

[daemon]
terminationtimeout = 45000
forcedterminationtimeout = 15000

[indexserver.HA1]
gracetime = 6000

* Example for sudo permissions in /etc/sudoers.d/SAPHanaSR .

SID is HA1. See also manual page SAPHanaSR-hookHelper(8).

# SAPHanaSR needs for susChkSrv
ha1adm ALL=(ALL) NOPASSWD: /usr/bin/SAPHanaSR-hookHelper --sid=HA1 --case=fenceMe

* Example for looking up the sudo permission for the hook script.

All related files (/etc/sudoers and /etc/sudoers.d/*) are scanned. Example SID is HA1.

# sudo -U ha1adm -l | grep "NOPASSWD.*/usr/bin/SAPHanaSR-hookHelper"

* Example for checking the HANA tracefiles for srServiceStateChanged() events.

Example SID is HA1. To be executed on the respective HANA master nameserver.
If the HANA nameserver process is killed, in some cases hook script actions do not make it into the nameserver tracefile. In such cases the hook script´s own tracefile might help, see respective example.

# su - ha1adm
~> cdtrace
~> grep susChkSrv.*srServiceStateChanged nameserver_*.trc
~> grep -C2 Executed.*StopSystem nameserver_*.trc

* Example for checking the HANA tracefiles for when the hook script has been loaded.

Example SID is HA1. To be executed on both sites' master nameservers.

# su - ha1adm
~> cdtrace
~> grep HADR.*load.*susChkSrv nameserver_*.trc
~> grep susChkSrv.init nameserver_*.trc

* Example for checking the hook script tracefile for actions.

Example SID is HA1. To be executed on all nodes. All incidents are logged on the node where it happens.

* Example for checking the hook script tracefile for node fence actions.

Example SID is HA1. To be executed on both sites' master nameservers. See also manual page SAPHanaSR-hookHelper(8).

# su - ha1adm
~> cdtrace
~> grep fence.node nameserver_suschksrv.trc

* Example for revoking a queued fence request from the Linux cluster.

This could be done if an HANA indexserver failure has triggerd an node fence action while the Linux cluster is in maintenance. Before revoking a fence request, be sure it has been issued by the HA/DR provider hook script. See example above for checking the hook script tracefile for node fence actions. Example node is node2. To be executed on that node. See also manual pages SAPHanaSR-hookHelper(8) and crm_attribute(8).
Note: This removes the node attribute terminate=true from the Linux cluster CIB. It does not touch any fencing device.

# grep fenced:.termination.was.requested /var/log/pacemaker/pacemaker.log
# crm_attribute -t status -N 'node2' -D -n terminate
# crm_attribute -t status -N 'node2' -G -n terminate

* Example for killing HANA hdbindexserver process.

This could be done for testing the HA/DR provider hook script integration. Killing HANA processes is dangerous. This test should not be done on production systems. Please refer to SAP HANA documentation. See also manual page killall(1).
Note: Understand the impact before trying.

1. Check HANA and Linux cluster for clean idle state.

2. On secondary master name server, kill the hdbindexserver process.

# killall -9 hdbindexserver

3. Check the nameserver tracefile for srServiceStateChanged() events.

4. Check HANA and Linux cluster for clean idle state.

FILES¶

/usr/share/SAPHanaSR-angi/susChkSrv.py: the hook provider, delivered with the RPM
/usr/bin/SAPHanaSR-hookHelper: the external script for node fencing
/etc/sudoers, /etc/sudoers.d/*: the sudo permissions configuration
/hana/shared/$SID/global/hdb/custom/config/global.ini: the on-disk representation of HANA global system configuration
/hana/shared/$SID/global/hdb/custom/config/daemon.ini: the on-disk representation of HANA daemon configuration
/usr/sap/$SID/HDB$nr/$HOST/trace: path to HANA tracefiles
/usr/sap/$SID/HDB$nr/$HOST/trace/nameserver_suschksrv.trc: HADR provider hook script tracefile

REQUIREMENTS¶

1. SAP HANA 2.0 SPS05 or later provides the HA/DR provider hook method srServiceStateChanged() with needed parameters.

2. No other HADR provider hook script should be configured for the srServiceStateChanged() method. Hook scripts for other methods, provided in SAPHanaSR and SAPHanaSR-ScaleOut, can be used in parallel to susChkSrv.py, if not documented contradictingly.

3. The user ${sid}adm needs execution permission as user root for the command SAPHanaSR-hookHelper.

4. The hook provider needs to be added to the HANA global configuration, in memory and on disk (in persistence).

5. HANA daemon timeout TODO

6. The hook script runs in the HANA nameserver. It runs on the node where the event srServiceStateChanged() occurs.

7. HANA scale-out is supported only with exactly one master nameserver. HANA host auto-failover is not supported. Thus no standby nodes.

8. A Linux cluster STONITH method for all nodes is needed, particularly if susChkSrv.py parameter 'action_on_lost=fence' is set.

9. If susChkSrv.py parameter 'action_on_lost=stop' is set and the RA SAPHana or SAPHanaController parameter 'AUTOMATED_REGISTER=true' is set, it depends on HANA to release all OS resources prior to the registering attempt.

10. For HANA scale-out, the susChkSrv.py parameter 'action_on_lost=fence' should be used only, if the SAPHanaSR-alert-fencing is configured.

11. If the hook provider should be pre-compiled, the particular Python version that comes with SAP HANA has to be used.

BUGS¶

In case of any problem, please use your favourite SAP support process to open a request for the component BC-OP-LNX-SUSE. Please report any other feedback and suggestions to feedback@suse.com.

AUTHORS¶

A.Briel, F.Herschel, L.Pinne.

COPYRIGHT¶

(c) 2022-2024 SUSE LLC
susChkSrv.py comes with ABSOLUTELY NO WARRANTY.
For details see the GNU General Public License at http://www.gnu.org/licenses/gpl.html

24 Jun 2024

Main

Development

Information

Community

Social media

Other