I have a Proliant Microserver in my home lab that I use as a NAS and networked backup storage. Even though I can trust the data to be reliably preserved on disk using RAID I cannot trust data to be reliably preserved in RAM if I don’t get ECC RAM installed.
This is quite easy to do as the user manual describes what kind of RAM you can use. In my case it’s 16GB of ECC RAM. I found two 8GB sticks online and this is the changelog behind this upgrade.
Quite easy to do. The user manual describes, with pictures, how to do this so there is not a lot that can go wrong.
The system accepted the two sticks as expected and I could log without problems and fiddling with the BIOS.
While it’s nice to have ECC RAM and you definitely get that warm fuzzy feeling that your data it’s now safe we are just getting started. We have to think about what will happen when the ECC RAM yells at you because a bit flipped! And we have to think about what will happen when one of the DIMMs dies and you have to replace it.
The obvious solution is to ignore such reports and to manually go over all DIMMs when you suspect one of them is acting funny. The less obvious solution is setting up monitoring now to get a little more information later.
The easiest way to set up monitoring is to install rasdaemon and to learn how to use it. The only interesting part of the setup is that you get to make a map from the in kernel logical locations to the actual motherboard physical locations.
This map is obviously very specific and depends on how the actual electrical connections were done on the physical motherboard. The map follows this format.
Vendor: <motherboard vendor name> Model: <motherboard model name> <label>: <mc>.<row>.<channel> ...
Which on my HP Prolian Microserver Gen 8 looks like this. This map was created by having only one stick present at boot and looking at the path availabe in sysfs.
Vendor: HP Model: ProLiant MicroServer Gen8 DIMM_1A: 0.0.0, 0.1.0 DIMM_2B: 0.0.1, 0.1.1
The only change from the how to guide is that I decided to store this map
/etc/edac/labels.d/microserver and symlink it to
/etc/ras/dimm_labels.d/microserver. This was necessary as
edac-ctl was not
picking up the mapping too.