SAPHanaSR-angi

NAME¶

SAPHanaSR - Automating SAP HANA system replication in scale-up setups.

DESCRIPTION¶

Overview

This manual page SAPHanaSR provides information for setting up and managing automation of SAP HANA system replication (SR) in scale-up setups. For scale-out, please refer to SAPHanaSR-ScaleOut(7), see also SAPHanaSR-angi(7).

System replication will help to replicate the database data from one site to another site in order to compensate for database failures. With this mode of operation, internal SAP HANA high-availability (HA) mechanisms and the Linux cluster have to work together.

The SAPHanaController RA performs the actual check of the SAP HANA database instances and is configured as a multi-state resource. Managing the two SAP HANA instances means that the resource agent controls the start/stop of the instances. In addition the resource agent is able to monitor the SAP HANA databases on landscape host configuration level. For this monitoring the resource agent relies on interfaces provided by SAP. As long as the HANA landscape status is not "ERROR" the Linux cluster will not act. The main purpose of the Linux cluster is to handle the takeover to the other site. Only if the HANA landscape status indicates that HANA can not recover from the failure and the replication is in sync, then Linux will act.

An important task of the resource agent is to check the synchronisation status of the two SAP HANA databases. If the synchronisation is not "SOK", then the cluster avoids to takeover to the secondary side, if the primary fails. This is to improve the data consistency.

Besides the resource agents, the package contains HADR provider hook scripts. This hook scripts are called by HANA at specific events. One important event is when system replication gets lost. The related hook script informs the HA cluster immediately, without waiting for the next RA monitor. This ensures data consistency even in corner cases. Another use case for hook scripts is allowing manual takeover if the HANA resource is in maintenance mode, but blocking manual takeover otherwise. This prevents from dual-primary situations.

Scenarios

In order to illustrate the meaning of the above overview, some important situations are described below. This is not a complete description of all situations.

1. Site takeover after primary has failed
This is the basic use case for HANA SR automation.
1. site_A is primary, site_B is secondary - they are in sync.
2. site_A crashes.
3. site_B does the takeover and runs now as new primary. The Linux cluster always makes sure the node at site_B is shut off before any other action is initiated.

2. Prevention against dual-primary
A primary absolutely must never be started, if the cluster does not know anything about the other site. On initial cluster start, the cluster needs to detect a valid HANA system replication setup, including system replication status (SOK) and last primary timestamp (LPT). This is necessary to ensure data integrity.

The rational behind this is shown in the following scenario:
1. site_A is primary, site_B is secondary - they are in sync.
2. site_A crashes (remember the HANA ist still marked primary).
3. site_B does the takeover and runs now as new primary.
4. DATA GETS CHANGED ON NODE2 BY PRODUCTION
5. The admin also stops the cluster on site_B (we have two HANAs both
internally marked down and primary now).
6. What, if the admin would now restart the cluster on site_A?
6.1 site_A would take its own CIB after waiting for the initial fencing
time for site_B.
6.2 It would "see" its own (cold) primary and the fact that there was a
secondary.
6.3 It would start the HANA from point of time of step 1.->2. (the crash),
so all data changed in between would be lost.
This is why the Linux cluster needs to enforce a restart inhibit.

There are two options to get back both, SAP HANA SR and the Linux cluster, into a fully functional state:
a) the admin starts both nodes again
b) In the situation where the site_B is still down, the admin starts the
primary on site_A manually.
The Linux cluster will follow this administrative decision. In both cases the administrator should register and start a secondary as soon as posible. This avoids a full log partition with consequence of a DATABASE STUCK.

3. Automatic registration as secondary after site failure and takeover
The cluster can be configured to register a former primary database automatically as secondary. If this option is set, the resource agent will register a former primary database as secondary during cluster/resource start.

4. Site takeover not preferred over local re-start
The SAPHanaController RA allows to configure, if you prefer to takeover to the secondary after the primary landscape fails. The alternative is to restart the primary landscape, if it fails and only to takeover when no local restart is possible anymore. See also ocf_suse_SAPHanaController(7).
The current implementation only allows to takeover in case the landscape status reports 1 (ERROR). The cluster will not takeover, when the SAP HANA still tries to repair a local failure.

Implementation

The two HANA database systems (primary and secondary site) are managed by the same single Linux cluster. The number of nodes in that single Linux cluster usually is two.

The HANA consists of two sites with one node each.
Note: A third HANA site can be added, with another system replication (aka multi-target replication ). This will be tolerated by the cluster. Nevertheless, this additional pieces are not managed by the cluster. Therefor in multi-target setups, the AUTOMATED_REGISTER feature needs the register_secondaries_on_takeover feature of HANA.

A common STONITH mechanism is set up for all nodes across all the sites.

Since the IP address of the primary HANA database system is managed by the cluster, only that single IP address is needed to access any nameserver candidate.

Best Practice

* Use two independent corosync rings, at least one of them on bonded network. Resulting in at least three physical links. Unicast is preferred.

* Use Stonith Block Device (SBD), shared LUNs across all nodes on all sites. Of course, together with hardware watchdog.

* Align all timeouts in the Linux cluster with the timeouts of the underlying infrastructure - particuarly network, storage and multipathing.

* Check the installation of OS and Linux cluster on all nodes before doing any functional tests.

* Carefully define, perform, and document tests for all scenarios that should be covered, as well as all maintenance procedures.

* Test HANA features without Linux cluster before doing the overall cluster tests.

* Test basic Linux cluster features without HANA before doing the overall cluster tests.

* Be patient. For detecting the overall HANA status, the Linux cluster needs a certain amount of time, depending on the HANA and the configured intervals and timeouts.

* Before doing anything, always check for the Linux cluster's idle status, left-over migration constraints, and resource failures as well as the HANA landscape status, and the HANA SR status.

* Manually activating an HANA primary creates risk of a dual-primary situation. The user is responsible for data integrity. See also susTkOver.py(7).

REQUIREMENTS¶

For the current version of the package SAPHanaSR-angi, the scale-up capabilities are limited to the following scenarios and parameters:

1. HANA scale-up cluster with system replication. The two HANA database systems (primary and secondary site) are managed by one Linux cluster. The number of nodes in that single Linux cluster is two. Note: A three-node Linux cluster is possible, but not covered by current best practices.

2. Technical users and groups such as sidadm are defined locally in the Linux system. If users are resolved by remote service, local caching is necessary. Substitute user (su) to sidadm needs to work reliable and without customized actions or messages. Supported shell is bash.

3. Strict time synchronization between the cluster nodes, e.g. NTP. All nodes of the Linux cluster have configured the same timezone.

4. For scale-up the following SAP HANA SR scenarios are possible with the SAPHanaSR-angi package:
4.1 performance-optimized (memory preload on secondary)
4.2 cost-optimized (e.g. with QA on secondary)
4.3 multi-tier with HANA 2.0 (replication chaining, third node connected to second one)
4.4 multi-target with HANA 2.0 (star replication, third node connected to primary one)
4.5 single-tenant or multi-tenant (MDC) for all of the above
4.6 multiple independent HANA SR pairs (MCOS) in one cluster
Note: For MCOS, there must be no constraints between HANA SR pairs.

5. Only one system replication between the two SAP HANA databases in the Linux cluster. Maximum one system replication to an HANA database outside the Linux cluster. See also item 12 below.

6. The replication mode is either sync or syncmem for the controlled replication. Replication mode async is not supported. The operation modes delta_datashipping, logreplay and logreplay_readaccess are supported. The operation mode logreplay is default.

7. Both SAP HANA database systems have the same SAP Identifier (SID) and Instance Number (INO).

8. Besides SAP HANA you need SAP hostagent installed and started on your system. For SystemV style, the sapinit script needs to be active. For systemd style, the services saphostagent and SAP${SID}_${INO} can stay enabled. Please refer to the OS documentation for the systemd version. Please refer to SAP documentation for the SAP HANA version. Combining systemd style hostagent with SystemV style instance is allowed. However, all nodes in one Linux cluster have to use the same style.

9. Automated start of SAP HANA database systems during system boot must be switched off.

10. The RAs' monitoring operations have to be active.

11. Using HA/DR provider hook for srConnectionChanged() by enabling susHanaSR.py or susHanaSrMultiTarget.py is mandatory.

12. For scale-up, the current resource agent supports SAP HANA in system replication beginning with HANA version 2.0 SPS 05.

13. Colocation constraints between the SAPHanaController RA and other resources are allowed only if they do not affect the RA's scoring. The location scoring finally depends on system replication status an must not be over-ruled by additional constraints. Thus it is not allowed to define rules forcing a SAPHanaController resource to follow another resource.

14. The Linux cluster needs to be up and running to allow HA/DR provider events being written into CIB attributes. The current HANA SR status might differ from CIB srHook attribute after cluster maintenance.

15. Once an HANA system replication site is known to the Linux cluster, that exact site name has to be used whenever the site is registered manually. At any time only one site is configured as primary replication source.

16. Reliable access to the /hana/shared/ filesystem is crucial for HANA and the Linux cluster.

17. HANA feature Secondary Time Travel is not supported.

18. In MDC configurations the HANA database is treated as a single system including all database containers. Therefor, cluster takeover decisions are based on the complete status independent of the status of individual containers.

19. If a third HANA site is connected by system replication, that HANA is not controlled by another SUSE HA cluster. If that third site should work as part of a fall-back HA cluster in DR case, that HA cluster needs to be in standby.

20. RA and srHook runtime almost completely depends on call-outs to controlled resources, OS and Linux cluster. The infrastructure needs to allow these call-outs to return in time.

21. The SAP HANA Fast Restart feature on RAM-tmpfs as well as HANA on persistent memory can be used, as long as they are transparent to SUSE HA.

22. The SAP HANA site name is from 2 up to 32 characters long. It starts with a character or number. Subsequent characters may contain dash and underscore.

23. The SAPHanaController RA, the SUSE HA cluster and several SAP components need read/write access and sufficient space in the Linux /tmp filesystem.

24. SAP HANA Native Storage Extension (NSE) is supported. Important is that this feature does not change the HANA topology or interfaces. In opposite to Native Storage Extension, the HANA Extension Nodes are changing the topology and thus currently are not supported. Please refer to SAP documentation for details.

25. The Linux user root´s shell is /bin/bash, or completely compatible.

26. No manual actions must be performed on the HANA database while it is controlled by the Linux cluster. All administrative actions need to be aligned with the cluster. See also SAPHanaSR_maintenance_examples(7).

BUGS¶

In case of any problem, please use your favourite SAP support process to open a request for the component BC-OP-LNX-SUSE. Please report any other feedback and suggestions to feedback@suse.com.

AUTHORS¶

A.Briel, F.Herschel, L.Pinne.

COPYRIGHT¶

(c) 2015-2017 SUSE Linux GmbH, Germany.
(c) 2018-2024 SUSE LLC
The package SAPHanaSR-angi comes with ABSOLUTELY NO WARRANTY.
For details see the GNU General Public License at http://www.gnu.org/licenses/gpl.html

20 Sep 2024

Main

Development

Information

Community

Social media

Other