It was a lazy Sunday afternoon…

Based on the title, this already is a bad beginning, isn’t it? In the end the issue turned out to be less of a problem than it seems. It all started, as stated, on a Sunday afternoon, casually receiving an email that one node of our NetApp metrocluster had finished an unplanned takeover. Looking over that information, I noticed a rather cryptic message from “the healthy one”, as I will address it later on, saying: “METROCLUSTER AUTOMATIC UNPLANNED SWITCHOVER COMPLETE”.

At that moment, I thought to myself: “Cool, this thing actually is working”. Since there is no official oncall service, I checked the services from a couple servers and after seeing they all work as they should, just dropped it and had a very nice afternoon drink. ‘Cause why the fuck not!?

The very next morning, bear in mind – Monday time: ok, WFT happened yesterday. Apart from the fact of being shitfaced that very Sunday before you could say “switchover -forced-on-disaster”, I noticed a rather odd thing. The node that reported the takeover was healthy as crap, but I could not reach any of the interfaces of the second node, that was residing in our other data-center. Yep, we had a 2 node metrocluster and one of our nodes did not give a shit. Sweat, shivers, panic… well, not really, but you get the idea.

Long story short, the node #2 was down for a rather unspecified reason. Thankfully, the console interface worked as it should – which is a rather surprising fact at the customer site. After a little fiddling around and frobnicating a couple bizzbazzes, I did an unsurprising

ssh admin@netappconsole.da-datacenter.com
console> events all

Whis is always a good thing to start with… giving me:

Record 1979: Sun Oct 20 14:13:48.356465 2019 [IPMI.notice]: e605 | 02 | EVT: 6f02ffff | PCM_Status | Assertion Event, "Fault"
 Record 1980: Sun Oct 20 14:25:09.832429 2019 [SP.critical]: Heartbeat stopped
 Record 1981: Sun Oct 20 14:25:09.844134 2019 [Trap Event.warning]: hwassist loss_of_heartbeat (30)
 Record 1982: Sun Oct 20 14:26:06.062343 2019 [ASUP.notice]: First notification email | (HEARTBEAT_LOSS) WARNING | Sent
 Record 1983: Sun Oct 20 14:13:38.000000 2019 [Controller.notice]: Appliance panic. See logs for cause of panic.
 Record 1984: Sun Oct 20 14:26:17.730326 2019 [SP.critical]: Filer Reboots
 Record 1985: Sun Oct 20 14:26:18.528417 2019 [IPMI.notice]: e705 | 02 | EVT: 6f406fff | Sensor 255 | Assertion Event, "Storage OS stop/shutdown"
 Record 1986: Sun Oct 20 14:26:53.204775 2019 [SP.notice]: PCIE Config Info received from BIOS
 Record 1987: Sun Oct 20 14:26:53.212795 2019 [SP.notice]: ScratchPad Config Info received from BIOS
 Record 1988: Sun Oct 20 14:27:02.444664 2019 [IPMI.notice]: e805 | c0 | OEM: f9ff7020ff2b | ManufId: 150300 | Undefined
 Record 1989: Sun Oct 20 14:27:02.544488 2019 [IPMI.notice]: e905 | c0 | OEM: f9ff7020ff2c | ManufId: 150300 | Undefined
 Record 1990: Sun Oct 20 14:27:02.639678 2019 [IPMI.notice]: ea05 | c0 | OEM: f9ff7020ff2d | ManufId: 150300 | Undefined
 Record 1991: Sun Oct 20 14:27:02.732490 2019 [IPMI.notice]: eb05 | c0 | OEM: f9ff7020ff2e | ManufId: 150300 | Undefined
 Record 1992: Sun Oct 20 14:27:03.180136 2019 [IPMI.notice]: (PUA) Enable power to all PCIe slots
 Record 1993: Sun Oct 20 14:27:00.000000 2019 [SysFW.notice]: BIOS Version: 10.3
 Record 1994: Sun Oct 20 14:27:03.571040 2019 [IPMI.notice]: (PUA) Enable power to all PCIe on board device
 Record 1995: Sun Oct 20 14:27:03.789752 2019 [IPMI.notice]: (PUA) P_stat :slots=0x0,onboard_devs=0x0,final
 Record 1996: Sun Oct 20 14:27:03.789811 2019 [IPMI.notice]: (PUA) Power status of all PCIe slots unchanged
 Record 1997: Sun Oct 20 14:27:14.852740 2019 [IPMI.notice]: ec05 | 02 | EVT: 6fc200ff | System_FW_Status | Assertion Event, "Unspecified"
 Record 1998: Sun Oct 20 14:27:27.631420 2019 [IPMI.notice]: ed05 | 02 | EVT: 6fc201ff | System_FW_Status | Assertion Event, "Memory initialization in progress"
 Record 1999: Sun Oct 20 14:27:41.720717 2019 [IPMI.notice]: ee05 | 02 | EVT: 6fc203ff | System_FW_Status | Assertion Event, "Memory Initialization done"
 Record 2000: Sun Oct 20 14:28:12.816518 2019 [IPMI.notice]: ef05 | 02 | EVT: 6fc220ff | System_FW_Status | Assertion Event, "Bootloader is running"
 Record 2001: Sun Oct 20 14:28:24.128745 2019 [IPMI.notice]: f005 | 02 | EVT: 6fc22fff | System_FW_Status | Assertion Event, "Storage OS Running"
 Record 2002: Sun Oct 20 14:28:43.824487 2019 [IPMI.notice]: f105 | 02 | EVT: 0301ffff | Attn_Sensor1 | Assertion Event, "State Asserted"
 Record 2003: Sun Oct 20 14:30:06.700487 2019 [IPMI.notice]: f205 | 02 | EVT: 0300ffff | Attn_Sensor1 | Assertion Event, "State Deasserted"
 Record 2004: Sun Oct 20 14:31:44.000000 2019 [Controller.notice]: Appliance user command halt.
 Record 2005: Sun Oct 20 14:31:47.639532 2019 [IPMI.notice]: f305 | 02 | EVT: 6f406fff | Sensor 255 | Assertion Event, "Storage OS stop/shutdown"
 Record 2006: Sun Oct 20 14:31:48.504343 2019 [SP.critical]: Filer Reboots

Wellp, a panic, a reboot and now no connectivity? Let’s look at where the node is at:

console> system log

================ Log #9 start time Sun Oct 20 14:27:24 2019
 NetApp Data ONTAP 9.3P10
 Copyright (C) 1992-2018 NetApp.
 All rights reserved.
 
 *
 Press Ctrl-C for Boot Menu. *
 * 
 
 cryptomod_fips: Executing Crypto FIPS Self Tests.
 cryptomod_fips: Crypto FIPS self-test: 'CPU COMPATIBILITY' passed.
 cryptomod_fips: Crypto FIPS self-test: 'AES-128 ECB, AES-256 ECB' passed.
 cryptomod_fips: Crypto FIPS self-test: 'AES-128 CBC, AES-256 CBC' passed.
 cryptomod_fips: Crypto FIPS self-test: 'CTR_DRBG' passed.
 cryptomod_fips: Crypto FIPS self-test: 'SHA1, SHA256, SHA512' passed.
 cryptomod_fips: Crypto FIPS self-test: 'HMAC-SHA1, HMAC-SHA256, HMAC-SHA512' passed.
 cryptomod_fips: Crypto FIPS self-test: 'PBKDF2' passed.
 cryptomod_fips: Crypto FIPS self-test: 'AES-XTS 128, AES-XTS 256' passed.
 cryptomod_fips: Crypto FIPS self-test: 'Self-integrity' passed.
 Attempting to use existing varfs on /dev/nvrd1
 ifconfig: interface e5a does not exist
 ifconfig: interface e5b does not exist
 Oct 20 16:28:22 Power outage protection flash de-staging: 12 cycles
 OS2SP configured successfullyOct 20 16:30:51 [av2e110p:fcvi.update.link.state:notice]: FC-VI adapter: Physical link is up on port 5c. 
 Oct 20 16:30:51 [node2:fcvi.update.link.state:notice]: FC-VI adapter: Physical link is up on port 5a. 
 Oct 20 16:30:51 [node2:fcvi.update.link.state:notice]: FC-VI adapter: Physical link is up on port 5b. 
 Oct 20 16:30:51 [node2:fcvi.update.link.state:notice]: FC-VI adapter: Physical link is up on port 5d. 
 Oct 20 16:30:51 [node2:fcvi.qlgc.rmt.link.up:notice]: FC-VI adapter: Link to partner node over port 5a is up. Partner port id = 0x50000, partner node's system id = 537161992.  
 Oct 20 16:30:51 [node2:fcvi.qlgc.rmt.link.up:notice]: FC-VI adapter: Link to partner node over port 5c is up. Partner port id = 0x50100, partner node's system id = 537161992.  
 Oct 20 16:30:51 [node2:fcvi.qlgc.rmt.link.up:notice]: FC-VI adapter: Link to partner node over port 5b is up. Partner port id = 0x60000, partner node's system id = 537161992.  
 Oct 20 16:30:51 [node2:fcvi.qlgc.rmt.link.up:notice]: FC-VI adapter: Link to partner node over port 5d is up. Partner port id = 0x60100, partner node's system id = 537161992.  
 sk_allocate_memory: large allocation, bzero 12150 MB in  1822 ms
 sk_allocate_memory: large allocation, bzero 4096 MB in  612 ms
 sk_allocate_memory: large allocation, bzero 18950 MB in  2834 ms
 sk_allocate_memory: large allocation, bzero 37901 MB in  5683 ms
 cryptomod_fips: Executing Crypto FIPS Self Tests.
 cryptomod_fips: Crypto FIPS self-test: 'CPU COMPATIBILITY' passed.
 cryptomod_fips: Crypto FIPS self-test: 'AES-128 ECB, AES-256 ECB' passed.
 cryptomod_fips: Crypto FIPS self-test: 'AES-128 CBC, AES-256 CBC' passed.
 cryptomod_fips: Crypto FIPS self-test: 'CTR_DRBG' passed.
 cryptomod_fips: Crypto FIPS self-test: 'SHA1, SHA256, SHA512' passed.
 cryptomod_fips: Crypto FIPS self-test: 'HMAC-SHA1, HMAC-SHA256, HMAC-SHA512' passed.
 cryptomod_fips: Crypto FIPS self-test: 'PBKDF2' passed.
 cryptomod_fips: Crypto FIPS self-test: 'AES-XTS 128, AES-XTS 256' passed.
 cryptomod_fips: Crypto FIPS self-test: 'Self-integrity' passed.
 AutoPartAFFDetermination: total_disks: 96 num_internal_disks: 0 num_ssds: 48 num_unknowns: 0 num_mediator_disks: 0 num_not_supported: 0 all_ssd? false
 Oct 20 16:31:42 [node2:mcc.config.bootConfigNotRead:ALERT]: Configuration could not be read from persistent storage due to internal error Failed to retrieve persistent storage (MBX) due to no mailbox disks being found. The system will try to retrieve or set the configuration automatically later, during boot. 
 Oct 20 16:31:42 [node2:raid.assim.tree.noRootVol:error]: No usable root volume was found! 
 disk:L#: rtID/plx/rgid/rgdn nplx/ngrp/ndsk/tot gen/time shtdwn vnm
 WARNING: Neither label appeared to be valid, the labels may be corrupt.
 random_switch_fqdn:8.126L7:  :Invalid Label!
 CRC 0x9d94b571, mgc 0x0, lbl V0, sz 0 , flags 0x0 SPARE
 Halting node. Ensure healing of DR root aggregates has been performed and has completed successfully before booting this node.
 Uptime: 4m11s
 System halting…
 ================ Log #9 end time Sun Oct 20 14:31:47 2019
 ================ Log #10 start time Sun Oct 20 14:31:48 2019
 BIOS Version: 10.3
 Portions Copyright (C) 2014-2017 NetApp, Inc. All Rights Reserved.
 Initializing System Memory …
 Loading Device Drivers …
 Configuring Devices …
 CPU = 2 Processor(s) Detected.
 Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz (CPU 0)
 CPUID: 0x000406F1. Cores per Processor = 18
 Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz (CPU 1)
 CPUID: 0x000406F1. Cores per Processor = 18
 524288 MB System RAM Installed.
 SATA (AHCI) Device: SV9MST6D120GLM41NP
 Boot Loader version 6.0.1 
 Copyright (C) 2000-2003 Broadcom Corporation.
 Portions Copyright (C) 2002-2016 NetApp, Inc. All Rights Reserved.

Not so good, ay? So it clearly says we have to heal the root aggr. This can be performed from the “healthy one”. Once you try it though, you suddenly get and error. Have I lost my mojo, or is it something I cannot fix right away, as instructed the by all-mighty system?

rror: command failed: Heal DR root aggregates cannot be performed in the current DR mode. Run "metrocluster
        node show" to view the DR mode of the local nodes, and run "heal root-aggregates" only in one of the
        following situations:
    1. The DR mode of all local nodes is "heal aggrs completed".    2. The DR mode of at least one local node is "heal roots failed".

Now the important part. We never had data aggregates on the node2… the closest we get a data aggregate on node2, well, is this: (or at least I thought so)

node1 Aggregates:
Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
 
 aggr1_node1_FSAS 
            34.15TB    2.49TB   93% online      35 node1         raid_dp,
                                                                    mirrored,
                                                                    normal
 aggr2_node1_ssd 
            62.71TB   13.30TB   79% online      56 node1         raid_dp,
                                                                    mirrored,
                                                                    normal
 root_aggr_node1 
             3.10TB   154.1GB   95% online       1 node1         raid_dp,
                                                                    mirrored,
                                                                    normal
 node2 Switched Over Aggregates:
 Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
 
 aggr1_node2_MCC 
             3.10TB    3.08TB    1% online       2 node1         raid_dp,
                                                                    mirrored,
                                                                    normal
 root_aggr_node2 
                 0B        0B    0% offline      0 node1         raid_dp,
                                                                    mirrored,
                                                                    normal
 5 entries were displayed.

The little touch of mojo, that I referred to earlier… comes at this very moment, admittedly, me not knowing the whole story and being new to metrocluster, 38 hours later, saying that “You need to heal the data then the root aggregates before booting the node.” Well, what do you know, the data aggregates I thought we do not have take precedence and need to be healthy first. Customer first, MOFOS!

From here on, piece of cake… of course I have no idea what caused the fairly new system to fail for the fifth time now (yeah, the first 4 were disks), but here we go….

node1::> metrocluster heal aggregates 
 [Job 3743] Job is queued: MetroCluster Heal Aggregates Job.                                                     
 Job 3743 is running in the background.

node1::> metrocluster operation show 
   Operation: heal-aggregates
       State: successful
  Start Time: 10/21/2019 18:47:21
    End Time: 10/21/2019 18:47:22
      Errors: -
node1::> metrocluster heal -phase root-aggregates 
 [Job 3744] Job succeeded: Heal Root Aggregates is successful.                                                   
 node1::> metrocluster operation show              
   Operation: heal-root-aggregates
       State: successful
  Start Time: 10/21/2019 18:53:58
    End Time: 10/21/2019 18:53:59
      Errors: -
node1::> aggr show
 Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
 
 root_aggr_node2 
                  -         -     - unknown      - node1         -
 node1 Aggregates:
 Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
 
 aggr1_node1_FSAS 
            34.15TB    2.49TB   93% online      35 node1         raid_dp,
                                                                    mirrored,
                                                                    normal
 aggr2_node1_ssd 
            62.71TB   13.30TB   79% online      56 node1         raid_dp,
                                                                    mirrored,
                                                                    normal
 root_aggr_node1 
             3.10TB   154.1GB   95% online       1 node1         raid_dp,
                                                                    mirrored,
                                                                    normal
 node2 Switched Over Aggregates:
 Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
 
 aggr1_node2_MCC 
             3.10TB    3.08TB    1% online       2 node1         raid_dp,
                                                                    mirrored,
                                                                    normal

The rest? Is history. Just boot the sucker!

 LOADER-A> boot_ontap
 Loading X86_64/freebsd/image2/kernel:0x200000/14339232 0xfacca0/13549928 Entry at 0xffffffff802cdbb0
 Loading X86_64/freebsd/image2/platform.ko:0x1c99000/2806552 0x1f47318/455240 0x1fb6560/558928 
 Starting program at 0xffffffff802cdbb0
 NetApp Data ONTAP 9.3P10
 Copyright (C) 1992-2018 NetApp.
 All rights reserved.
....

The morale from the story? Well, fuck me, this was unexpected… Good night to ya all, we’re up again.