A highly available filesystem is a critical component for any business. File systems are of various types and capabilities. They tell the host computers how and where to store files or data.

Cloud-based filesystems are the new generation of virtual storage devices. They are reliable, scalable, high-performing, and full of features.

AWS FSx is one of the highly available filesystems. AWS FSx is a family of filesystems available for various use cases and operating systems.

Scenario

One of our customers had been using AFS drives as storage devices. These drives sync data with each other and are supported by SAP applications. The customer used the Microsoft Failover Cluster and AFS drives to determine the owner node and child nodes or host computers. The owner node has the drives visible and usable by the SAP application. In the event of failure, the Failover Cluster will move the AFS drives to the next available node, providing high availability for the SAP application. Recently, the customer faced issues when failing over to other nodes. They faced multiple downtimes due to these failures and data corruption each time the owner node was moved to the child node. They wanted to move away from AFS drives and migrate to the latest and greatest cloud-based filesystem.

They had a unique use case, and I had to understand the characteristics of the source and target destinations. The filesystem required SAP support and needed to be compatible for use with the Microsoft Failover Cluster.

Surveillance

To set up an appropriate environment, I had to understand the performance requirement of the Microsoft Failover Cluster and the SAP applications.

AWS FSx filesystem is highly scalable and high performing. It allows scaling IOPs and the throughput capacity of the filesystem. It was imperative to view the historical data of the existing IOPs and throughput usage to determine the capacity requirements. Knowing if the service will need multi–Availability Zone or single Availability Zone infrastructure is also essential.

Filesystem size, IOPS/Throughput capacity and Multi/Single AZ deployment are the key factors in determining the annual cost of the service. To make sure these values are set correctly in the FSx configuration. I recorded monthly performance patterns of the EBS volumes attached to the nodes or host computers. The next step was to determine which FSx filesystem meets our use case and performance requirement and is not more costly than using EBS (Elastic Block Storage) volumes of the same size.

Sandbox test

I started planning to deploy FSx for windows service, one of the FSx filesystems in our sandbox environment; I considered this service because it could burst the throughput capabilities within 24 hours and was cost-effective. It allowed me to present an SMB mount to the nodes or host computers. No third-party tool is required to sync data between each node. The requirement was to have the filesystem available to 3 nodes which are part of the Microsoft Failover Cluster. The cluster manager determines which node is the owner node and which is the child node.

Once I deployed the FSx for windows service to the owner node, I configured the Failover Cluster. I realised that the cluster manager does not support SMB mount as a storage disk. Bummer!

Setback

I had to scrap the idea of using FSx for windows service. Using the Microsoft Failover Cluster was necessary for the customer. I needed to understand the Microsoft Failover Cluster and gain knowledge about best practices for using the cluster application to determine which service would be compatible with the Failover Cluster.

Start over

I started learning about the Microsoft Failover Cluster, its workings, and the prerequisites. I learned that you use a Failover Cluster manager to configure the cluster. You can add multiple nodes or host computers to the cluster. A Failover Cluster needs a secondary IP address associated with each node in the cluster. Nodes should be in a different subnet for the failover to execute. And most importantly, it supports NFS and iSCSI mount for Windows OS.

So, I had to find an SAP-certified service that supported the iSCSI filesystem, was highly scalable, had high performance and was suitable for Windows OS. Remember, it should be cost-effective.

Solution

As I mentioned earlier, AWS FSx is a family of filesystems. It has multiple services available. One of those services is the FSx NetApp ONTAP service which matches the above requirements.

FSx for ONTAP combines the features, performance, capabilities, and API operations of NetApp file systems with a fully managed AWS service's agility, scalability, and simplicity.

Amazon FSx for NetApp ONTAP provides access to shared file storage over all versions of the Network File System (NFS) and Server Message Block (SMB) protocols and supports multi-protocol access (i.e., concurrent NFS and SMB access) to the same data.

Amazon FSx for NetApp ONTAP provides shared block storage over the iSCSI protocol.

The above is what I needed to mount the filesystem to the Microsoft Failover Cluster.

Setup

CloudFormation template to set up the filesystem.

https://github.com/aws-samples/mltiaz-fsxontap-eks/blob/main/FSxONTAP/FSxONTAP.yaml

A step-by-step guide to creating the iSCSI LUN (Logical Unit Number).

https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/create-iscsi-lun.html

A step-by-step guide to mounting the LUNs to Windows OS.

https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/mount-iscsi-windows.html

Migrate data from old drives to the new filesystem.

https://docs.aws.amazon.com/fsx/latest/WindowsGuide/migrate-files-to-fsx.html

Schema

  • The SSD storage starts from 1TB to 192 TB.
  • An SVM (Storage Virtual Machine) is created in your AD (AWS-managed or Self-managed) to host the volumes and LUNs (Logical Unit Numbers).
  • Multiple volumes can be created depending on the SSD storage pool size.
  • LUNs are created in each volume. Use a volume size of at least 5% larger than your LUN size. This margin leaves space for volume snapshots.
  • The minimum throughput capacity is 128Mbps, enough to support 6 LUNs.
  • It is recommended to have one LUN per volume, but you can have more than 1 LUN in a volume.
  • Restoring and resizing LUN will need the volume size and, if required, SSD storage size to be increased.

 

Success

After many days of planning and testing, the FSx NetApp ONTAP service has been deployed in the prod and non-prod environments of the customer. Migration downtime was 1 hour, respectively. After the migration, we have no outstanding issues related to the Failover Cluster. The SAP application is working as expected without performance issues. I am glad we suggested this service to the customer.