Friday, January 21, 2022

Experimental TrueNAS Server Build

 More and more, there is open source software appearing that will do something that used to require a very expensive piece of hardware and/or software.  TrueNAS (with TrueNAS Core, formerly FreeNAS) is one such open source project.

With a view to moving my organisation's data to TrueNAS, as a test case I decided to configure an old circa-2012 HPE Proliant SE1220 with TrueNAS. It was quite an adventure, and this naturally lead to a blog post.

At this point, I'd like to thank Ben Pridmore, of First Nations Media, for productive discussions and suggestions at all stages of this investigation, and for collaboration with the hardware issues.

I think a lot of organisations have old server hardware lying around, and if not there is incredibly cheap superseded hardware online to play around with. 

First, let's go over some terms and background, because the tech moves quickly, and perhaps, like me, this is the first time you've had to look at SAS or HBAs in any kind of depth. I'll assume you know what RAID is, since that is foundational for the whole post.

SATA

I'm sure everyone is familiar with SATA, which has been the hard drive interface of choice for a long time now. SATA stands for Serial ATA (ATA standing for Advanced Technology Attachment - hello, marketing terminology!).  SATA took over from PATA, or IDE as it was also known, with PATA being Parallel ATA or originally just called ATA before there was a need to distinguish between serial and parallel versions.

A SATA III Interface can deliver 6Gbit/s (600Mbyte/s) with SATA II half that, and SATA I half again (1.5 Gbit/s). SATA can be used only for single drives: one drive per SATA port/cable.

SAS and SAS Expanders

SAS evolved out of SCSI, and fulfils a similar role to SATA, but it's a higher end product used in servers and enterprise hardware. It has multi channels, better bidirectional throughput, higher signalling voltages (hence greater maximum cable length) and a number of other advantages, and the common speeds are currently 3Gbit/s, 6Gbit/s and 12 Gbit/s.

The key hardware item to be aware of is the SAS Expander, which is the basis of any server RAID  unit, allowing typically up to 16 SATA connections on a backplane to be connected to a single SAS cable. With that cable plugged into a compatible SAS controller, this will allow the OS to access individual drives similarly as if each were connected to the motherboard via its own SATA controller.

See these good articles/posts for details: 

    http://sasexpanders.com/faq/

    https://www.truenas.com/community/resources/dont-be-afraid-to-be-sas-sy-a-primer-on-basic-sas-and-sata.48/

The question that immediately came to mind was about bottlenecking, given that we are accessing all those drives through one cable. The above article makes the point that most mechanical drives operate at around 140Mbyte/s (1400Mbit/s), and given multiple channels and that only several drives in the array are likely to be operating at once, in general there is ample bandwidth to avoid saturation.

With SSDs however, the situation is very different. With a typical 500Mbyte/s (5Gbit/s) bandwidth, several SSDs may rapidly saturate a SAS connector. High bandwidth SAS plus low disk numbers may be necessary for smooth operation of an SSD array.

Host Bus Adapters (HBAs)

The controller cards necessary to manage a SAS expander's drives fall into two categories: HBA and RAID. A HBA card transparently connects the drives on the SAS expander to the motherboard and OS - it doesn't try to provide any management layer, caching or additional smarts. Conversely, a RAID card undertakes the management of the drives into a RAID array - this is 'hardware RAID' - and the motherboard and OS sees often only one 'logical' drive, or several, depending on how many logical drives have been set up in the RAID configuration. The RAID card manages all aspects of the RAID array and the OS is simply the 'end user', seeing what the RAID card wants it to.

An HBA or RAID card has operating firmware and a separate firmware BIOS (often referred to as the 'SAS BIOS') that can be accessed during startup (just like the motherboard BIOS - it's easy to get confused!). The SAS BIOS can be used to set up things like boot devices (for HBA) or RAID configuration (for RAID cards). 

For many models of card, the operating firmware and the BIOS can be flashed with different versions of the firmware that convert the behaviour to HBA or RAID card. However a card designed to work in one mode may not be as reliable in the other. 

SAS Card Compatibility with TrueNAS

The HBA/RAID issue is  a central one in the TrueNAS forums.  ZFS and hence TrueNAS are designed to perform with direct and full access to the disk hardware through a HBA: TrueNAS is 100% software RAID.  

This is at odds with hardware raid - it is definitely not recommended to use ZFS on top of hardware RAID:

https://www.truenas.com/community/threads/if-i-had-to-use-hardware-raid-which-option-is-more-preferable.77954/

Likewise, you can use a RAID card in JBOD mode and switch off as much RAID functionality as possible, but there will still not be direct access to the individual disks, and this is going to be a red flag.

But remember that a lot of RAID cards can be re-flashed into HBA mode. How about that option?

Unfortunately the consensus is, just because you can flash a particular card as an HBA and it appears to work, doesn't mean that it's a good idea to do so. 

TrueNAS and ZFS can drive the hardware extremely hard during data rebuilds, and this is likely over time to expose any weaknesses in the controller card. 

Here's some of the debate:

https://www.truenas.com/community/resources/whats-all-the-noise-about-hbas-and-why-cant-i-use-a-raid-controller.139/

https://www.truenas.com/community/resources/multiply-your-problems-with-sata-port-multipliers-and-cheap-sata-controllers.177/

The TL;DR; of all this is: if you don't want to roll the dice in regard to your data, buy and use a TrueNAS recommended HBA card to replace any RAID card you might have.

The LSI 9211-8i (PCIe 2.0 6Gbit/s), LSI 9207-8i (PCIe 3.0 6Gbit/s) and LSI 9300-8i (PCIe 3.0 12Gbit/s) appear to be the gold standard and available quite cheaply online.

The post states that 'the LSI 9240-8i, IBM ServeRAID M1015, Dell PERC H200 and H310, and others are readily available on the used market and can be converted to LSI 9211-8i equivalents.'

My server contained a RAID card (the HP SmartArray P212) so I ordered an LSI 9211-8i HBA card second hand online for around $60US.

Anatomy of the Server

First, let's have a quick look at the anatomy of the server in light of the above discussion.

This is a top view of the server. The top area of the picture, inside the green rectangle, is the SAS Expander - an enclosure where the 12 SATA drives go (these are 2TB 7200RPM drives).  If you look along the bottom edge of the drives, you can see the edge of a circuit board running along the entire length of the expander.  The chassis and circuit board are basically a drop-in unit. They attach to the power supply, all the SATA drives plug directly in to the circuit board, and the whole thing plugs into the rest of the server via a single SAS cable.

There's a photo from the front of the server showing the drives, following the below photo.

The next block down, inside the aqua rectangle, are eight fans - of no configuration consequence, but they are very loud on startup.

Inside the purple rectangle is the area for two processors and RAM for each (only one is installed). There is a near invisible clear plastic air-directing cover over this area, to which I've taped a couple of screws during disassembly.

The metal box inside the red rectangle is a PCI extender, containing a SAS controller card and a matched pair of hard drives for use as mirrored system drives for the server (these are also attached by cable to the SAS Expander backplane). The third photo contains detail of what's inside.



Below is the server with the PCI extender box removed. It has been flipped over 180 degrees: when fitted, the PCI connectors, seen from the top in the green rectangle, fit downwards into the two black PCI slots towards the top of the image. 

The LSI 9211-8i, in the tan rectangle, is shown fitted to the PCI extender slot. Note that the only single connection to it is the SAS cable from the SAS extender, which is the long cable with the black braided cover. Below, on the table an in the red rectangle, is the removed  HP SmartArray P212, with its memory module and battery (some RAID cards have battery backed RAM to preserve the integrity of their write cache in the event of power failure).

The dual system disks (aqua rectangle) enclosure can be seen poking out from underneath the LSI HBA card. It was tempting to try to remove these from the SAS Expander and try to plug them directly into two of the six vacant SATA ports on the motherboard, but a the enclosure had small backplane through which power was delivered and I was unsure as to what other smarts might be involved. Rather than reroute power and possibly open up a can of worms in regard to the SATA interfaces, I just left these disks alone.


Setting up TrueNAS

After fitting the LSI 9211-8i HBA card and reassembling the PCI extender chassis, I proceeded to install TrueNAS by creating a bootable USB with the latest version as instructed on the TrueNAS site.

The install went smoothly, all drives were detected, and I was able to mark both the system drives for install, ending up with a mirrored system disk configuration.

On rebooting, however, I found that the server would cycle through all the boot options and end up cycling at network boot, which from experience is where the boot cycle goes to die. I checked the motherboard BIOS and it was set to boot from the HBA card, but wasn't detecting anything bootable.

On googling this, it became clear that the problem was that an unconfigured HBA would just try to boot from the first two available drives, which were very likely to be the data drives. It was necessary to boot into the SAS BIOS and configure the boot order.

Configuring Boot Order in the SAS BIOS

At this point, I did not know what firmware version my HBA card was running and had not fired up the SAS BIOS at all. In hindsight, it would have been good to check this before commencing any operations involving the SAS expander (such as the TrueNAS installation!).

As it happened, there was a problem with the SAS BIOS on the card which prevented me from booting into the BIOS to make these configuration changes, but it appears this problem is mainly specific to HP hardware, so for now I'm going to pretend I didn't have this problem and go ahead with the boot configuration as it should have happened (and did happen once the issue was fixed). I'll return to the other problem, which required removing the HBA card and re-flashing it in another computer, in the next section.

Booting the server takes a while, and eventually the screen displays something like 'hit any key for Option ROM'. At this point, there is no message telling you what keys to hit, but you need to hit Ctrl+C to boot into the SAS BIOS. After a pause, there is a message about the LSI configuration tool, and a few more keystrokes and you are in the SAS BIOS screen.

Once in there, you'll see a single line for the SAS expander, and it's necessary to hit enter a few times to expand the disk tree (there are a few useful YouTube videos covering this whole process). Then you'll see the below.

Bay 12 and 13 here are the system disks (the highlighting obscures the details of the bottom one) and we need to mark them as boot and alternate boot using Alt+B and Alt+A. Hitting Alt-M displays a handy instructional screen showing all the special key codes.

Presumably the motherboard was previously trying to boot from Bay 0, which explains the lack of success.



After saving the config, the server booted straight into TrueNAS and after a bit of further configuration, we were up and running.

Problem with BIOS on HP Hardware and Flashing the HBA Card

As mentioned, initially I couldn't fire up the SAS BIOS. When I hit Ctrl+C, after a few seconds I got:
Fatal pci express device error B00/D09/F00 E0
Worrying that I had some PCI problem with the card, I found Googling initially suggested changing card slots. But then, luckily, I ran across this rambling but ultimately very useful post on this exact issue:


It turns out that while the latest firmware for this card (P20) works fine, the P20 BIOS is not compatible with this (and clearly a range of other) HP hardware. To get things working, it's necessary to flash the card with the P20 firmware, but the P19 BIOS! (the post has a detailed matrix of firmware and BIOS versions, reproduced below).

FWBIOSDL380 G7DL380 G6
P19P19works (old)works (old)
P20/< .07P19data corrupted(!!AVOID!!)data corrupted(!!AVOID!!)
P20/< .07P20data corrupted/DEATH on CONFIG2(!!AVOID!!)data corrupted/DEATH on CONFIG(!!AVOID!!)
P20/.07P20works/DEATH on CONFIG2(AVOID!)works/DEATH on CONFIG (AVOID!)
P20/.07P19works (THIS!)works (THIS!)
P20/< .07 means all 20.00.XX.00 versions of the firmware earlier than 20.00.07.00. BIOS versions follow a different numbering scheme, with P19 = 7.37.00.00 and P20 = 7.39.02.00 (my numbers, there might be others)

At this point, I had not been able to fire up the BIOS, so I didn't actually know what versions of firmware and BIOS were on the card. I would have to download the manufacturer's drivers and boot from a USB to probe and possibly flash the system.

Now I had another challenge: I did not have Windows installed on the system, and I was doubtful if I could get it to boot from a DOS disk.

Anyway, I duly headed off to the Broadcom site, and after a bit of searching managed to find the right files. Trying to download Asset Type 'All' broke the web site, and it took me a while to realise that I had to specify 'Firmware' to get a result.



There are basically two files for the firmware update, a '9211-81 ... FW_BIOS ... for_MSDOS_WINDOWS', and an 'Installer .. for_MSDOS_WINDOWS'. There is an IT and IR version of the firmware - it's recommended to stick with the IT version for HBA in a modern environment.
I found also that the FW_BIOS package contained all the files needed from the Installer package, so there was actually no need to download the Installer package.
I've hilighted the files I ended up using in red:


But firstly, the challenge of being able run the flash tool. I tried creating an MSDOS boot USB and loaded the DOS version of the flash tool onto it, but as suspected, the server hardware would not recognise this.

At this point, it was really not possible to use the server to boot into the flash tool without installing Windows on it. My two options were to find some really old hardware that would allow DOS boot, or to find a Windows machine that I could fit the HBA card to in order to flash it.

Luckily, I have a modern Windows PC as a spare that I sometimes use for development and gaming. Fitting the HBA card to it was easy (there's no need to attach drives to the HBA card) and I was able to boot into Windows normally.

I copied the three files above into a temporary folder, opened up a CMD window, and ran the flash tool. Using the -listall switch I was able to see immediately that (referring to the firmware matrix in the previously mentioned post) both the firmware and BIOS were at v20.

I:\upd>sas2flash.exe -listall
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2008(B2)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS2008(B2)     20.00.07.00    14.01.00.08    07.39.02.00     00:09:00:00

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.

At this point, I went back to the Broadcom site and downloaded the P19 version of the firmware and BIOS package. I then replaced just the .rom file (from the 'sasbios_rel' folder) in my temporary folder with the P19 version, and ran the update as below. I also ran an additional command, not listed, to delete the firmware first, but it reported errors that seemed to indicate that it was no longer necessary to run this command in the Windows versions. I would nonetheless follow the instructions on the Broadcom site here.

I:\upd>dir
 Volume in drive I is Temp Install
 Volume Serial Number is 0A7D-0D64

 Directory of I:\upd

15/01/2022  05:09 PM    <DIR>          .
15/01/2022  05:09 PM    <DIR>          ..
11/03/2016  04:30 PM           722,708 2118it.bin
19/03/2014  11:36 AM            83,159 mptbios.txt
19/03/2014  11:39 AM           201,216 mptsas2.rom
11/03/2016  04:29 PM           166,912 sas2flash.exe
               4 File(s)      1,173,995 bytes
               2 Dir(s)  10,252,136,448 bytes free

I:\upd>sas2flash.exe -f 2118it.bin -b mptsas2.rom
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2008(B2)

        Executing Operation: Flash Firmware Image

                Firmware Image has a Valid Checksum.
                Firmware Version 20.00.07.00
                Firmware Image compatible with Controller.

                Valid NVDATA Image found.
                NVDATA Version 14.01.00.00
                Checking for a compatible NVData image...

                NVDATA Device ID and Chip Revision match verified.
                NVDATA Versions Compatible.
                Valid Initialization Image verified.
                Valid BootLoader Image verified.

                Beginning Firmware Download...
                Firmware Download Successful.

                Verifying Download...

                Firmware Flash Successful.

                Resetting Adapter...
                Adapter Successfully Reset.

        Executing Operation: Flash BIOS Image

                Validating BIOS Image...

                BIOS Header Signature is Valid

                BIOS Image has a Valid Checksum.

                BIOS PCI Structure Signature Valid.

                BIOS Image Compatible with the SAS Controller.

                Attempting to Flash BIOS Image...

                Verifying Download...

                Flash BIOS Image Successful.

                Updated BIOS Version in BIOS Page 3.

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.

I:\upd>sas2flash.exe -listall
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2008(B2)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS2008(B2)     20.00.07.00    14.01.00.08    07.37.00.00     00:09:00:00

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.

I:\upd>

This appeared to have worked as desired. After this, I removed the HBA card from my Windows box, reinserted it to the server, and was then able to boot into the SAS BIOS normally and make the configuration changes as outlined previously.

That's about it for this post. Before I go, I'll include one last useful link of informational videos from the TrueNAS forums. Good background:

https://www.truenas.com/community/resources/informational-videos-mostly-about-sas-hardware.105/


ADDENDUM:

After building this, of the 12 disks in the server, 6 of them were fairly quickly (ie. within a couple of weeks) knocked out of the RAID pool due to repeated warnings. Initially I wan't sure if the problem was hardware (SATA ports, SAS setup or HBA card), but the remaining drives seem stable. It's 2022 and this is a 2012 era server that has been in operation up until 2020, so it's not really surprising that the disks are starting to become unreliable - the consensus is about 3-5 years as a typical RAID disk's reliable lifespan.

I'm pretty sure if I was able to populate the server with new disks, this would fix the issues (unfortunately I don't have spare newer disks, or a budget for brand new ones).

However this hilights again how hard TrueNAS/ZFS drives the disks, and how sensitive it is in reporting and reacting to problems. I suspect in the original hardware RAID configuration, we wouldn't have even heard about problems with the disks.

TrueNAS will potentially place a lot more stress on your hardware than other RAID setups, so the hardware has to be good quality and well integrated. The payback is extremely reliable storage and near-paranoid reporting of any issues.