Quick notes on expanding a ZFS RaidZ Pool – Solaris 11 Express. Howto (see bottom for update)

So you have what was once a gargantuan ZFS RaidZ1 array, but the family videos, pictures, plus the super cool time windowed (via snapshot) backup method you have created for all your local machines have stuffed up the pool completely. Like me you view just dumping another pair of mirrored drives into the pool to be a hokey kluge that will create dissimilar infrastructure you will have to remember for years (in the event of a failure). Like me you have also heard that you can replace your drives one at a time with larger drives and with the successful replacement of the last drive the array will magically expand in size.

The long/short of my migration:

Whenever you turn your system on ZFS will automatically find your array drives wherever they are and form the array on boot-up. For my migration I bought an external eSata dock (one of the ones where you pop the drive in the top).

For each drive replacement I followed the following procedure.

1. Pop open a shell, become root. (I modded my permissions so pfexec works for me. I show how to do this in another post here on the blog. You can SU if you like)  $pfexec bash will give you a root shell. Get a status of my pool and make note of the device names (shown in bold).

#zpool status

NAME        STATE     READ WRITE CKSUM
mypool    ONLINE       0     0     0
raidz1-0  ONLINE       0     0     0
c9t4d0
ONLINE       0     0     0
c9t3d0
ONLINE       0     0     0
c9t2d0
ONLINE       0     0     0
c9t5d0
ONLINE       0     0     0

2. Shut down the machine.

3. Remove the drive I plan to replace from it’s current location (bay, sata, power, et al)

4. Place that drive into the eSata dock

5. Put the new larger drive in the place of the old drive.

6. ZFS worked out where the old drive was on boot up.

7. Become root, look at the devices in the system with the format command (note the ctrl-d will get you out of the format command). As you see one of the devices that was in my zpool before I swapped drives is now one of the new 2tb drives I’m putting into the pool. From running the format command before I put a drive into the eSata dock I know that any drive in the dock will be c7t513d0, but you could have run before and after format commands to look for the changes. Do be careful and make sure you know where your old and new drives are before the next step though…

#format

Searching for disks…done

AVAILABLE DISK SELECTIONS:
0. c7t512d0 <ATA    -WDC WD2500AAKS–0953 cyl 30398 alt 2 hd 255 sec 63>
/pci@0,0/pci8086,3a42@1c,1/pci1458,b000@0/disk@200,0
1. c7t513d0 <ATA-SAMSUNG HD103UI-0953-931.51GB>
/pci@0,0/pci8086,3a42@1c,1/pci1458,b000@0/disk@201,0
2. c9t0d0 <ATA-WDC WD6401AALS-0-3B01-596.17GB>
/pci@0,0/pci1458,b005@1f,2/disk@0,0
3. c9t1d0 <ATA-WDC WD6401AALS-0-3B01-596.17GB>
/pci@0,0/pci1458,b005@1f,2/disk@1,0
4. c9t2d0 <ATA-WDC WD20EARS-00M-AB51-1.82TB>
/pci@0,0/pci1458,b005@1f,2/disk@2,0
5. c9t3d0 <ATA-WDC WD20EARS-00M-AB51-1.82TB>
/pci@0,0/pci1458,b005@1f,2/disk@3,0
6. c9t4d0 <ATA    -WDC WD20EARS-00-AB51 cyl 60798 alt 2 hd 255 sec 252>
/pci@0,0/pci1458,b005@1f,2/disk@4,0
7. c9t5d0 <ATA-WDC WD20EARS-00M-AB51-1.82TB>
/pci@0,0/pci1458,b005@1f,2/disk@5,0
Specify disk (enter its number):
^D

8. This was an interesting little annoyance. It seems that the zpool replace command would only work after a zpool status command was run. Running the replace without running the status first gives you the following.

#zpool replace mypool c7t513d0 c9t4d0
cannot replace c7t513d0 with c9t4d0: no such device in pool

So we know we need to run a status first then follow it with the replace command…

#zpool status mypool

pool: mypool
state: ONLINE
scan: scrub canceled on Sat Jan 15 20:56:30 2011
config:

NAME          STATE     READ WRITE CKSUM
mypool      ONLINE       0     0     0
raidz1-0    ONLINE       0     0     0
c7t513d0  ONLINE       0     0     0
c9t3d0    ONLINE       0     0     0
c9t2d0    ONLINE       0     0     0
c9t5d0    ONLINE       0     0     0

errors: No known data errors

#zpool replace mypool c7t513d0 c9t4d0

9. Run another status so you know what is going on

#zpool status mypool

pool: mypool
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Jan 15 21:19:26 2011
64.4M scanned out of 3.07T at 8.05M/s, 111h4m to go
15.5M resilvered, 0.00% done
config:

NAME             STATE     READ WRITE CKSUM
mypool         ONLINE       0     0     0
raidz1-0       ONLINE       0     0     0
replacing-0  ONLINE       0     0     0me
c7t513d0   ONLINE       0     0     0
c9t4d0     ONLINE       0     0     0  (resilvering)
c9t3d0       ONLINE       0     0     0
c9t2d0       ONLINE       0     0     0
c9t5d0       ONLINE       0     0     0

errors: No known data errors

10. When the process is complete I believe it is advisable to scrub the drives to ensure all is well. #zpool scrub mypool This will also take a while and you can check on the status of the scrub with #zpool status mypool

Notes:

  • When replacing a drive Zpool status will show long estimated times as the 111 hours in red above. The numbers kept increasing for at least 2 hours and actually made it up to 423 hours remaining, but after 2 to 3 hours data actually started moving and the estimates became much more realistic. This was true for each drive I replaced. I can tell you than to complete a 4 drive RaidZ1 array ~85% full took about 12 hours per drive.
  • One crazy note… My server shut down current connections and failed to open the console on the machine during the copy. It started to fail all connection attempts with out of memory errors… Not good! Maybe I should not have been running virtual machines while it was resilvering on another pool… Dunno, but it was definitely strange. The resilver succeeded, and the machine did let me in after a couple of hours. I did realize that after installing Oracle Solaris 11 Express I had forgotten to limit ZFS ARC Cache (which I had done before: good reference here ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache. So before the last drive swap I set the ZFS ARC Cache limit to 7 gigs of memory via the following “set zfs:zfs_arc_max = 0x1C0000000”.

Warning:

  • Remember that in a RaidZ1 array any loss of 2 drives at one time will loose you the entire array! I know I’m paranoid, but I have lost raid 5 arrays this way in the past, so imagine the following: You are upgrading a multi-drive RaidZ1 array. If you did not precondition the drives (have them powered up and drive testing over a few days – most home users do not do this), you will have more than one drive in the array that has been spinning for less than 24 hours. My experience with drive failures is as follows.
    • If a drive does not make it past power on you are OK, you stop migration and get a different drive… no problem.
    • Next hurdle, the drives that fail within 48 hours, should still be a low percentage, but there will be some.
    • The final more insidious are the drives that go flaky and start loosing sectors, then fail. This usually takes a few weeks.

Since most of the failures are when drives are relatively new, the odds of having two new drives in an array fail at the same time are far greater than the odds of having two simultaneous failures in a seasoned array.So the average home user will probably get a rack of 4 new huge hard drives on their front porch and run to the server and start swapping out their array. Having all brand new drives in the array, the odds that two will fail in the next week are FAR greater than the odds that you will experience two simultaneous failures after the system has been spinning for a week and even less after a month.

  • Some strategies to consider as you expand your home ZFS RaidZ1 array:
  • Expand safely. Replace one drive a week with the newer drive, or alternately season drives in another system for a week before you start putting them into your production array.
  • As long as you have not replaced the last larger drive, each drive is still held to the size dictated by your original array. You can avoid having to have spares of the new larger drive size by keeping your old drives and swapping them back in the event of a failure (until the last drive is replaced and ZFS starts using the full size of the drives).
  • I VERY highly advise weekly scrubbing of the home array. Monitoring the ‘zpool status’ after scrubs is the easiest way I know of to identify a flaky drive that is loosing sectors. An easy way do a weekly scrub is by adding a shell script to your crontab as follows:
    • I have the following line for each of my pools in a shell file I call zfsmaint.sh “# zpool scrub <my zfs pool name>
    • ” You can add this to your crontab by:   “#crontab -e”
      and adding the following (of course replacing the <your home dir> and the zpool command above should be in a shell script called zfsmaint.sh in your homedir):
      0 23 * * 1 /export/home/<your home dir>/zfsmaint.sh
      If you are having problems with the VI editor, please go look up VI commands on the web.

General:

Excellent ZFS Reference: ZFS_Best_Practices_Guide

Future wishes…

When I started my ZFS array there was no RaidZ2 or RaidZ3 (double/triple redundancy), but now there is… Sun never built an upgrade path I really hope Oracle will see this as an issue and make an upgrade path available. At the trivial cost of another disk I would like to move to a 2 drive redundant array without having to build a whole extra array to move it through.

UPDATE:

I wanted to make this walk through for everyone out there as a compilation of all the individual blogs/guides I had to use to perform the task. After all was said and done. It did not work. Apparently Oracle broke the auto-expand in Solaris 11. I went through the steps of setting the pool auto-expand property and trying to force the pool to expand with the new ‘zpool online -e’ command. Nothing worked. So I ended up copying my data to another pool, creating a new RaidZ2 pool (that I wanted anyway) and copying the data back. This was done via the zfs send/recv function via SSH to another server. After playing around the command line to do this is as follows:

Create a snapshot in your local machine via

zfs snapshot <mypool>/<filesystem>@<shapshot>

so

# zfs shapshot tank/myshare@today

Then my destination backup server was at 192.168.1.67. I created a pool in it called tank2. zfs automatically copied the snapshot and created a myshare filesystem and snapshot in tank2.

zfs send <source_pool>/<source_filesystem>@<snapshot> | ssh <account>@<server_ip> pfexec /sbin/zfs recv <dest_pool>/<dest_filesystem>@<dest_snapshot>

or

# zfs send tank/myshare@today | ssh myaccount@192.168.1.67 pfexec /sbin/zfs recv tank2/myshare@today

After I copied all the filesystems to the new server I did a scrub on the new server to ensure the drives/data were good. Then destroyed the pool on the original server. Created the new pool (which now filled up the drives) and copied everything back. From the console on the new server and the IP address of the oldserver is 192.168.1.68

# zfs send tank2/myshare@today | ssh myaccount@192.168.1.67 pfexec /sbin/zfs recv tank/myshare@today

When it is all moved, scrub the pool and bob’s your uncle… :)

2 Responses to “Quick notes on expanding a ZFS RaidZ Pool – Solaris 11 Express. Howto (see bottom for update)”

  1. evilt says:

    I just moved a friends data to ZFS on Linux this morning. To do this I simply plugged his drives into the new machine, followed the snapshot directions above, created the destination zpool and did:

    sudo zfs send tank2/myshare@today | sudo zfs receive tank/myshare

    it built the myshare in the tank pool and moved the files.

    Of course I could have used rsync, but that wouldn’t have been nearly as fun…

  2. The Completely Evil Blog » Blog Archive » Notes on building ZFS pools on linux says:

    […] evilt on Quick notes on expanding a ZFS RaidZ Pool – Solaris 11 Express. Howto (see bottom for update) […]

Leave a Reply

Line and paragraph breaks automatic.
XHTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Comments Protected by WP-SpamShield Spam Filter