Posted by: nlfiedler | July 19, 2009

Improvements to Burstsort for Java

Recently I had been side tracked by the need to do something about the network attached storage situation at home, so I had taken a break from my software projects. But a gentleman by the name of Kimo Crossman wouldn’t let me forget about burstsort. He has been sending me links to research papers on various parallel algorithms and other cache-related subjects, and making suggestions for how to improve my open source Java implementation of the original burstsort. In particular, Kimo felt that applying principles from parallel algorithms would be the most important means for improving the performance of burstsort. I have very much appreciated his help and I thoroughly enjoy reading academic papers. In return, I’ve made an effort recently to work on a few improvements for burstsort4j.

Parallelized

The first major change was to introduce a parallelized version of burstsort. In this initial attempt, it only parallelizes the bucket sorting. The building of the trie and buckets still happens in a single thread. However, once the structure is built, the multi-threaded version creates a thread pool of size equal to the number of available processors, and creates sort jobs that are then run in parallel. These jobs run independently of one another, copying the sorted output to the original array without any unnecessary synchronization. As a result, the overall runtime is shortened considerably on a dual-core CPU, which in my case is the Intel Core 2 Duo in my MacBook Pro.

Engineered Burstsort

With the publication of the WEA 2008 paper by Ranjan Sinha and Anthony Wirth came a newly engineered variation of the original burstsort. In particular, this algorithm made much better use of memory, while still being nearly as fast as the original algorithm. It primarily makes a change in the structure of the buckets, where instead of a single dynamic array they now have an array that points to other arrays of pointers to strings. These sub-buckets, as they are called, are grown at a slower pace and stop growing at a much lower threshold than the original algorithm. Once a sub-bucket is filled, a new sub-bucket is created, and so on until the overall bucket size reaches a threshold equal to that of the original algorithm. As a result, the memory usage is dramatically improved.

Needless to say, my excitement was very high at this point. I desperately wanted to implement this redesigned burstsort in Java as soon as I could. But, certain other obligations got in the way for a time, and after a few months I finally wrote the Java version of the engineered burstsort. After fixing one small mistake it was working and it was better than I could have imagined. Not only did the memory efficiency go from about 25 percent to 95 percent, it was often a little bit faster than my original implementation.

What’s Next

There are yet more improvements to make. First of all, the WEA 2008 paper offers a second improvement, which is to copy the string tails from a bucket to a string buffer and sort them there. That is, the string buffer would only be used during the bucket sorting phase and would be re-used after each bucket is sorted. I have an idea to use a large character array and an implementation of CharSequence to create lightweight strings.

Secondly, I want to experiment with the parallel version of burstsort. In particular, try out the suggestions made by Kimo to parallelize the building of the trie/bucket structure. I think it can definitely be done, the only question will be how much contention there will be on the trie nodes. As a means to test these parallel algorithms I’ve bought a quad-core AMD Phenom CPU and mainboard to use as my development machine. I’m really looking forward to seeing the results.

Posted by: nlfiedler | July 18, 2009

Mirroring ZFS root pool with messy disks

Recently I needed to re-purpose my old OpenSolaris-based file server as a development box (upgraded to AMD Phenom X4, woohoo!). Since I wasn’t planning on making backups on reliable schedule, I wanted to mirror the boot disk, which in ZFS is a cinch. Surprisingly this took much longer than I had assumed when reading the ZFS documentation. Mostly this was due to the messy disks I was using, that is, they had been configured as data disks so they all had an EFI label. Since most PC BIOS’s don’t support EFI labels, ZFS requires that all rpool devices have an SMI label. What’s more, the partition table has to be just right, and I couldn’t figure out how to use the format command to achieve that. Turns out there’s an easier way.

  1. Install OpenSolaris on the first disk as usual.
  2. Make sure the second disk has an SMI label instead of an EFI label so it can be attached to the rpool. Invoke the format -e command (the -e is important, it enables expert options like setting the label type), choose the second disk, type  “label”, and choose the SMI option. If it does not allow this change because you have to delete the partitions, then type “fdisk” and delete the partitions and create a new one (Solaris2 type), save the changes to disk, and now you can change the label type. Type “quit” to save your changes to disk. [opensolaris.org]
  3. The partition table of the second disk needs to be made identical to the first one. As the root user, type prtvtoc /dev/rdsk/disk1 | fmthard -s - /dev/rdsk/disk2 where disk1 and disk2 might look like c0d0s2 and c0d1s2 (note the use of slice 2 here, the “whole disk” slice). [opensolaris.org]
  4. Attach the second disk to the root pool: zpool attach rpool disk1 disk2 (where disk1 and disk2 are the device names, typically including the slice, such as c7d0s0 and c7d1s0). You may find it necessary to force the attach as ZFS may complain about slices overlapping. If s0 overlaps s2 that’s actually normal, so just add the -f flag.
  5. Make sure the boot loader is copied to the second disk so it is bootable in the event the first disk becomes unbootable: installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/disk2 [opensolaris.org]

See, it was a cinch after all. That is, once you knew what to do.

[Update: Removed the remark about using s0 or s2 interchangably. So far the posts I've seen mostly lean toward s2, but the prtvtoc output for s0 and s2 on my disks were identical, so it would not have made a difference in my case. Nonetheless, go with the flow. Also added references to original sources in case you were to think I was some kind of genius or something.]

Posted by: nlfiedler | July 13, 2009

Building a Network Attached Storage box

Introduction

In an earlier post I compared a Drobo to a custom built storage system based on OpenSolaris, with the conclusion that while a Drobo is convenient, building your own server offers many advantages. In this post I want to show how I went about building a new storage system based on an article that came out at the end of last year. Among the improvements over my old server are lower power consumption and a smaller form factor. In fact, the case that I’ve chosen has the same footprint as the Drobo, and only about five inches taller. In addition, this particular case has hot swappable drive bays, so disks can be replaced while the system is powered on.

Before I get into the particulars, I want to share some useful links that may help if you’re new to OpenSolaris and/or building a system from parts. First of all, there’s the First-Timer’s Guide to Building a Computer from Scratch at LifeHacker. I find that the hardest part of building your own system is getting the parts list right. To that end, check out Ars Technica’s excellent system buyer’s guide, which offers recipes for various types of systems and provides advice on choosing suitable parts. Once the system is put together you might be asking yourself which operating system to choose. For me, OpenSolaris with ZFS is a no-brainer — it really is an excellent software stack that gives me piece of mind knowing my that data is as safe as it can be (for the amount of money I’m willing to spend). But you don’t have to take my word for it, see what Simon and Scott have to say about OpenSolaris and ZFS. If setting up and administering OpenSolaris is too daunting for you, then check out FreeNAS — it’s based on FreeBSD with a port of ZFS and is designed to be easier to set up and maintain.

Power Consumption

One of the primary goals with building a new storage server was to minimize power consumption. According to a recently purchased Kill-A-Watt device, my original web server box consumes about 100W while idle, and 120W while the disks are active. During power-on, the consumption peaks at around 260W. This is actually not too bad for a server-class machine. As for the new server, it starts off at around 36W, then hits 117 when everything spins up at once, then levels out at 62. During active reading and writing, the consumption peaks at 71W and typically hovers around 67W. Not too bad, around half of the consumption of my original machine.

Bill of Materials

These prices are based on what was available in March of 2009, so these may have changed by now. Nonetheless, it shows just how little there is to buy, with half of the parts available from a single retailer.

Chenbro ES34069 eWiz.com $153
Intel D945GCLF2 newegg.com $79
Kingston 2GB RAM newegg.com $22
Panasonic CD/DVD-ROM logicsupply.com $51
Seagate 80GB 2.5″ HDD newegg.com $50
SYBA SATA II NCQ eforcity.com $43
Flexible PCI riser logicsupply.com $22

That adds up to $420 for everything you need to equal a Drobo and DroboShare. That’s $150 less than the price of the Drobo plus DroboShare on Amazon, and this system is faster and the components can be sourced from multiple vendors. That, by the way, was one of my goals in building a new system; I didn’t want to be locked in to a particular hardware vendor.

Regarding the selection of the Chenbro case, it was a luxury. It’s rather pricey for an ITX case, but the LogicSupply review convinced me it was worth it. Another option would be to use the Enlight PR-42A1 which can be had for around $60. The major difference is the Enlight is an ATX case, so you would have to get a different mainboard, and chances are it would not have a low power chipset.

You may have noticed that the mainboard does not support ECC (error-checking and correcting) memory. Yes, that was a bit of a let down, but to get ECC you typically have to go with server-grade parts, which cost more and often consume more power.

Operating System

Unlike with pre-built systems, you get to choose your operating system when building your own system. There are many people who go with Windows Home Server, but obviously that costs money. Free options include FreeBSD, FreeNAS, Linux and OpenSolaris. Being a long time user of Solaris, I went with OpenSolaris. It has the advantage of including the reference implementation of ZFS. While I had been a Linux user for 10 years, and ran a software project/web server on Linux for half that time, I know a better solution when I see one.

Disk Performance

Given that this system was built to replace a Drobo, I wanted to compare their performance. However, there is no reasonable way to compare the performance of a Drobo with a storage server such as the one I’ve built here. Firstly, the Drobo does not have native Ethernet, so it has to rely on an external device connected over USB. That alone is going to add delay and turn the comparison into an apples and oranges argument. What I can say though is that the new server feels faster than the Drobo, and is certainly fast enough for my needs. It is on par with the server-grade equipment that this box is replacing, and meets our file sharing and Time Machine backup requirements perfectly.

Conclusion

If you’re new to building a storage server, I hope that I’ve given you some inspiration to learn more. If you have the time and a handful of tools, building your own system is pretty easy. And if you haven’t spent a lot of time learning how to configure an operating system, then let FreeNAS come to your aid. In short, you can choose your own parts from any of a number of suppliers, install whatever operating system you like, and set up a pretty reliable vault for your data.

Posted by: nlfiedler | March 1, 2009

Installing Logwatch on OpenSolaris

Typically, installing Logwatch is fairly trivial. On Linux, you’d just use the package installer command and you’re done. On OpenSolaris, there doesn’t seem to be a packaged version of Logwatch (yet), so installing from the source tarball is necessary. Fortunately, there’s a shell script that performs the installation. The bad news is this script finds /usr/sbin/install which is the Solaris version of install. This version behaves very differently from those found in other Unix variants. The Logwatch installer is expecting the behavior of the install script found on Linux, so it fails miserably on OpenSolaris.

The good news is, there’s a simple solution. Just install the SUNWscp package. This is the “source compatibility package”, which installs numerous commands that help OpenSolaris behave more like other Unix systems. The Logwatch installer script prepends the /usr/ucb directory to the PATH when it runs, so it finds the install script that it is expecting, and thus it installs Logwatch perfectly. The only thing left is to add the crontab entry, as shown at the end of the install output.

One last note about Logwatch, and it concerns that crontab entry. It seems that the default configuration for Logwatch is to print the report rather than sending an email to the default recipient, root. However, the example crontab entry is redirecting all output to /dev/null. So how exactly is one supposed to get a daily report? The answer is to edit the /etc/logwatch/conf/logwatch.conf file, adding Print = no at the end of the file. That tells Logwatch to email the report rather than printing. It’s a mystery to me why that’s the default given the example crontab entry they display during the installation process. But at least it’s easy to fix, and nicely demonstrates how easy it is to customize Logwatch without touching the default configuration files.

Posted by: nlfiedler | February 26, 2009

Making netatalk discoverable in OpenSolaris

Previously I described how I had set up netatalk on my OpenSolaris storage server. That step went a long way to making it easy to use Time Machine to backup my Mac to the server. But having read kremalicious, I wanted to find a way to make the netatalk daemon discoverable by the Macs on my network. The same technique that Matthias used on Linux was not going to work on OpenSolaris. For starters, OpenSolaris doesn’t use avahi, it has it’s own solution in the form of the dns/multicast service. In place of creating a static configuration file, you use the dns-sd command line client on OpenSolaris. While this tool is not meant to be used to register long running services, it’s the only feasible solution at the moment. But just running that command in the background and leaving its fate to the Gods is not good enough for me. It should be monitored using the Service Management Framework. This turns out to be surprisingly easy once you’ve read the SMF guide on BigAdmin.

Start by installing netatalk, if you haven’t already, as described in an earlier post. Next, create the SMF manifest file that will register the AFP daemon as a discoverable service. Name the file dnssd_afp.xml and place it in the /var/svc/manifest/site directory.

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type="manifest" name="dnssd_afp">
  <service
     name="site/dnssd_afp"
     type="service"
     version="1">

    <single_instance/>

    <dependency
       name="filesystem-local"
       grouping="require_all"
       restart_on="none"
       type="service">
      <service_fmri value="svc:/system/filesystem/local:default"/>
    </dependency>

   <dependency
       name="dns-multicast"
       grouping="require_all"
       restart_on="none"
       type="service">
      <service_fmri value="svc:/network/dns/multicast:default"/>
    </dependency>

    <exec_method
       type="method"
       name="start"
       exec="/lib/svc/method/dnssd_afp"
       timeout_seconds="60">
      <method_context>
        <method_credential user="root" group="root"/>
      </method_context>
    </exec_method>

    <exec_method
       type="method"
       name="stop"
       exec=":kill"
       timeout_seconds="60">
    </exec_method>

    <instance name="default" enabled="false" />

    <stability value="Unstable" />

    <template>
      <common_name>
        <loctext xml:lang="C">
          dns-sd registration of afp daemon
        </loctext>
      </common_name>
      <documentation>
        <manpage title="dns-sd" section="1M" manpath="/usr/man"/>
      </documentation>
    </template>
  </service>
</service_bundle>

Change the ownership of the manifest to root:sys and make it read-only by all but the root user. For this service, we’ll need to write a shell script that spawns dns-sd as a background process, otherwise SMF will timeout waiting for the service to start (the SMF documentation is better at explaining this than I am).

#!/sbin/sh
#
# Registers the AFP daemon with dns-sd.
#
/usr/bin/dns-sd -R chihiro _afpovertcp._tcp local 548 &
/usr/bin/dns-sd -R chihiro _device-info._tcp. local 548 model=Xserve &
# Sleep to ensure service has enough time to start up,
# otherwise SMF will timeout waiting for it to be ready.
sleep 5

Place the shell script in the /lib/svc/method directory, change the ownership to root:bin, and make the file executable by all and writable only by root. Now we’re ready to import the service configuration and start the service. Import it using the command pfexec svccfg -v import /var/svc/manifest/site/dnssd_afp.xml, then start the service using pfexec svcadm enable dnssd_afp, and finally check that the service is running with svcs dnssd_afp. At this point, if the dns-sd process were to unexpectedly die, SMF will immediately restart it. That’s one of the many advantages of SMF over initd, which does not monitor the processes that it initiates. With the AFP service now registered, any Mac on your network should see your storage server as a Mac-compatible file share, which will appear automatically in the Finder sidebar. If you’ve added a shared volume to your Login Items previously, you can remove it, you won’t need it any more.

Posted by: nlfiedler | February 22, 2009

Setting up smartmontools on OpenSolaris

As a follow-up to the previous entries concerning my new storage server, I thought I’d talk about installing and configuring the smartmontools monitoring software in OpenSolaris. Like most open source software, it’s fairly easy to compile and install on OpenSolaris, it’s the automating part that’s a little different from Linux, for which smartmontools was developed.

To get started, download the latest release of the smartmontools source and extract it to a temporary directory. Next, make sure you have the gcc-dev packages installed, otherwise compiling the source is going to be a challenge (if which gcc returns nothing, run pfexec pkg install gcc-dev). Now you can build and install the tools quite easily with the following commands.

  1. ./configure
  2. make
  3. pfexec make install

At this point the smartd and smartctl binaries are installed under /usr/local, along with the manual pages and a sample configuration file, /usr/local/etc/smartd.conf. There are just a couple of changes to be made in the configuration file, and a few notes before proceeding. First off, as of today, ATA disk support in smartmontools on Solaris is not there, so SCSI emulation is used instead. While this gives us basic health status, it seems to prevent any detailed SMART data from being collected. It may also be the reason why I can’t run self-tests on my disks. This all worked in Linux with these same disks, so I’m guessing it’s due to the lack of ATA support in Solaris. Secondly, before you can monitor your disks, you’ll need to know the labels for those disks. I found zpool status worked quite well.

$ zpool status
  pool: rpool
 state: ONLINE
 scrub: scrub completed after 0h6m with 0 errors on Sun Feb 15 02:21:37 2009
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          c0d0s0    ONLINE       0     0     0

errors: No known data errors

  pool: yubaba
 state: ONLINE
 scrub: scrub completed after 0h59m with 0 errors on Sun Feb 15 03:15:02 2009
config:

        NAME        STATE     READ WRITE CKSUM
        yubaba      ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c4t0d0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     0
            c5t0d0  ONLINE       0     0     0
            c5t1d0  ONLINE       0     0     0

errors: No known data errors

Not only does it show which disks are in which pools, but it gives you the names of the disks that smartmontools expects, namely c4t0d0 and so on. Now we are ready to make changes to the smartd.conf file.

The first change to make in smartd.conf is to comment out the DEVICESCAN line, which is fine if you want to scan all disks in your system, but I found that smartmontools didn’t like my rpool disk, and it wanted me to declare the disk types as “scsi” for it to do anything at all. Next we have to tell smartd which disks to monitor, so I added the following lines to the end of the smartd.conf file:

/dev/rdsk/c4t0d0 -d scsi -H -m root
/dev/rdsk/c4t1d0 -d scsi -H -m root
/dev/rdsk/c5t0d0 -d scsi -H -m root
/dev/rdsk/c5t1d0 -d scsi -H -m root

This seems to work, as invoking pfexec smartd -q onecheck resulted in output like this:

$ pfexec smartd -q onecheck
smartd version 5.38 [i386-pc-solaris2.11] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Opened configuration file /usr/local/etc/smartd.conf
Configuration file /usr/local/etc/smartd.conf parsed.
Device: /dev/rdsk/c4t0d0, opened
Device: /dev/rdsk/c4t0d0, is SMART capable. Adding to "monitor" list.
Device: /dev/rdsk/c4t1d0, opened
Device: /dev/rdsk/c4t1d0, is SMART capable. Adding to "monitor" list.
Device: /dev/rdsk/c5t0d0, opened
Device: /dev/rdsk/c5t0d0, is SMART capable. Adding to "monitor" list.
Device: /dev/rdsk/c5t1d0, opened
Device: /dev/rdsk/c5t1d0, is SMART capable. Adding to "monitor" list.
Monitoring 0 ATA and 4 SCSI devices
Device: /dev/rdsk/c4t0d0, opened SCSI device
Device: /dev/rdsk/c4t0d0, SMART health: passed
Device: /dev/rdsk/c4t1d0, opened SCSI device
Device: /dev/rdsk/c4t1d0, SMART health: passed
Device: /dev/rdsk/c5t0d0, opened SCSI device
Device: /dev/rdsk/c5t0d0, SMART health: passed
Device: /dev/rdsk/c5t1d0, opened SCSI device
Device: /dev/rdsk/c5t1d0, SMART health: passed
Started with '-q onecheck' option. All devices sucessfully checked once.
smartd is exiting (exit status 0)

So far so good, but what about having smartd run at bootup, and continuously monitoring the disk status? In Linux, you’d use initd, but since this is OpenSolaris, we’ll use the Service Management Framework (SMF) instead. To do that, paste the following text into /var/svc/manifest/site/smartd.xml, change the file ownership to root:sys, and invoke pfexec svccfg -v import /var/svc/manifest/site/smartd.xml. Then check that the service is running (svcs smartd), and if not, enable it using pfexec svcadm enable smartd.

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type="manifest" name="smartd">
  <service
     name="site/smartd"
     type="service"
     version="1">
    <single_instance/>
    <dependency
       name="filesystem-local"
       grouping="require_all"
       restart_on="none"
       type="service">
      <service_fmri value="svc:/system/filesystem/local:default"/>
    </dependency>
    <exec_method
       type="method"
       name="start"
       exec="/usr/local/etc/rc.d/init.d/smartd start"
       timeout_seconds="60">
      <method_context>
        <method_credential user="root" group="root"/>
      </method_context>
    </exec_method>
    <exec_method
       type="method"
       name="stop"
       exec="/usr/local/etc/rc.d/init.d/smartd stop"
       timeout_seconds="60">
    </exec_method>
    <instance name="default" enabled="true"/>
    <stability value="Unstable"/>
    <template>
      <common_name>
        <loctext xml:lang="C">
          SMART monitoring service (smartd)
        </loctext>
      </common_name>
      <documentation>
        <manpage title="smartd" section="1M" manpath="/usr/local/share/man"/>
      </documentation>
    </template>
  </service>
</service_bundle>

A this point we have a managed service that is checking the health of our disks, and if anything comes up, it will send an email to the root user. While I would have liked to also set up short and long self-tests, I can live without it for now. In the mean time, I’ve got a weekly cron job that scrubs the data on the disks using zpool scrub, which will identify any data read errors on the disks and attempt to correct them automatically. I’ll talk more about that in a subsequent post, where I’ll describe how I set up OpenSolaris and ZFS on my storage server.

Posted by: nlfiedler | February 17, 2009

Comparing Drobo and DroboShare to an OpenSolaris storage server

In my previous entry I recalled the trial of recovering from a corrupted HFS+J volume that contained my Time Machine backups. During that process, I became aware of some of the drawbacks of the storage appliance on which my backups were stored, the Drobo. During all of that intensive reading and writing of large chunks of data, it became obvious that the Drobo was a bit on the slow side. Meanwhile, I had come across a video tutorial of ZFS, the file system that Sun created a few years back. It was so impressive, I decided then and there that the solution was to replace the Drobo with a server running OpenSolaris and ZFS. Granted, there are some advantages to an appliance like the Drobo, and in this entry I’ll describe the pluses and minuses as I see them. [Update: if you want a proper review of the Drobo, rather than my take on home storage solutions, check out George Ou's balanced analysis, as well as Marcel Binder's review on Tom's Hardware. They are reviewing the first generation Drobo, which is what I am discussing here.]

Drobo Advantages
It’s an appliance, you plug it in and stick in the drives and that’s basically all you need to do. In fact, the disks do not even have to be the same size, although if they are then it won’t waste any of the capacity. With four 200GB drives, the Drobo provided 554GB of usable storage space. And it does this while consuming a modest 36W of power.

Drobo Disadvantages
While the list of advantages was rather brief, this list is going to be significantly longer. First, if you are using a Drobo, you almost certainly want to share it on a LAN, in which case you need to buy a DroboShare (another $200 on top of the $500 for the empty Drobo, plus whatever you had to pay for the disks). To make this work properly, you need to run the Drobo Dashboard, which has a couple of idiosyncrasies. First, you have to completely disable the firewall in Mac OS X in order for the Dashboard to detect the DroboShare. Second, the Dashboard application files must be owned by a particular user, the one that installed the software. That would not be a problem except that the application won’t launch for any other user. While the DroboShare can be mounted without using the Dashboard, it seems there is a file ownership issue with at least some of the files on the Drobo, making them appear to be corrupted. The Drobo support tech was stumped and unable to resolve that particular issue, and could not bring themselves to admit that the Dashboard software was flawed.

During all of that Time Machine volume copying I was doing, the used space on the Drobo climbed upward. But as I deleted the botched copies, the Drobo was reluctant to return the space to the free side of the usage dial. Weeks went by and mysteriously one day the space all came back. I’ll never know why, just one of the mysteries of the black box that is the Drobo. In fact, that is exactly what the Drobo is: a closed, black box. There is no access into it, no remedy when something goes wrong, no tool for diagnosis. If the box fails for any reason, you have exactly one place to turn to for help (if you’re Bill Streeter, you’ll know exactly how true that is). You better hope your support contract is still good ($50/year, $150 if you let it lapse). In the worst case, you have to buy a new Drobo just so you can get the data off of your disks. Yes, that is indeed the truth. Like any proprietary RAID-like device, the Drobo’s on-disk format is a closely guarded secret. If the device fails and you can’t find a replacement, your data is gone forever.

While its power consumption is modest, it has insufficient airflow around the disks and as a result they run a bit too hot for my taste. Now that they are in the new storage server, they are much cooler. Speaking of power, the power connector on the Drobo is notoriously loose. It’s fallen out numerous times over the year that I’ve used my Drobo. The tech support person suggested taping the power cord to the side of the Drobo. Brilliant.

Last, but not least, the Drobo has a weird capacity “limit” of 2TB. Apparently it has something to do with the fact that it’s only interface to the world is through a USB port. I would assume the second generation Drobo is better at this, but I’m not willing to spend $500 to find out. Regardless, your capacity is limited to whatever you can find in four disks, as that is the maximum number of disks any single Drobo can take. But, if you’ve got the money, you can plug two Drobos into a single Drobo Share. I doubt too many people have done that, as it would cost $1200 plus the cost of the disks. Meanwhile, I could build a system to hold 10+ disks for a fraction of that cost.

[Update: one more disadvantage, in case there weren't enough. With the Drobo and DroboShare, you have to have a paid support plan to get the firmware updates. So you better hope they find all the bugs before your support contract expires. With OpenSolaris, updates are free and new releases are made about twice a year.]

OpenSolaris and ZFS
The only disadvantage to running a server with OpenSolaris is that you have to install, configure, and maintain it. But hey, I’ve been doing that for years so it’s no trouble for me. In fact, the recent releases of OpenSolaris are remarkably easy to set up and administer. Most of the standard configuration is done using the graphical interface, and for everything else there are well-written manual pages. As for the advantages of OpenSolaris and ZFS, there are many. First of all, ZFS is the most amazing file system on the planet. It’s incredibly easy to set up a storage pool and create file systems. It handles stripes, mirrors, and data/parity formats (RAID 0, 1, 5, and 6) depending on your replication needs. You can add as many disks as your hardware can handle, and you can configure them any way you like.

ZFS has invincible data integrity. Unlike most, if not all, RAID 5 implementations, it does not suffer from the infamous write hole. Instead, it never overwrites live data, so all writes go to free blocks, with data being written first, then the meta data blocks, and finally the über blocks. If power is lost at any point, when the system comes back, it will only see valid data and meta data. What’s more, all blocks are check-summed, and that checksum is stored in the parent block. This checksum carries upward to the über block, which effectively has a fingerprint of the entire file system. This guards against the worst kind of data loss, the silent kind, as it detects the occasional bit rot that some disks can suffer. With built-in data replication, these problems can be automatically corrected on the fly.

But what about data portability, which was a major issue for me with the Drobo? Well, get this: ZFS is open source. What’s more, it’s been ported to BSD and Mac OS X. So even if my storage server were to suddenly die, I not only have a choice of vendors to repair/replace the hardware, but I also have a choice of operating systems to access the data in the storage pool. But, in my opinion OpenSolaris is the best choice as it has the reference implementation of ZFS. What’s more, OpenSolaris supports SMB, NFS, iSCSI, and AFP (via netatalk; see earlier blog entry), so I have several choices for how to access the storage over the network.

ZFS has built-in support for snapshots. In fact, it has a feature similar to Time Machine, called Time Slider, that makes automatic snapshots and manages their expiration, much like Time Machine. With snapshots in place, if I ever run into a corrupted HFS disk image again, I can roll back the file system to an earlier snapshot. Granted, I may lose some data, but it’s better than losing the entire image to an uncorrectable error.

Final Notes
I lied earlier when I said there was only one disadvantage to running a storage server instead of a Drobo. The one other issue is the power consumption of most server-class systems is over 100W. In fact, my current server consumes at least 102W and often hits 120W while actively doing work. But, I have a plan to replace the hardware with low power parts, ones primarily aimed at the mobile and embedded market. I plan to talk about that more in a future entry.

While I could certainly format the Drobo using ZFS, I would only gain snapshots and on-disk consistency. [Update: the Drobo does not support ZFS, only NTFS, HFS+, and FAT32; see Drobo knowledge base article #0043. Nonetheless...] Performance would still be rather poor, and disk management and repair would rely entirely on the Drobo itself. As far as ZFS would know, the Drobo would appear as one big disk. That is not the recommended scenario for ZFS according to its creators.

Earlier I mentioned that Drobo provided 554GB of usable space with the four 200GB disks I had installed. In comparison, ZFS provided just 548GB. Not too bad considering I’m getting rock solid data integrity and automated snapshots.

One final point, in regards to reclaiming disk space after deleting files. With ZFS, the freed space was returned within seconds, unlike the many weeks that it took the Drobo to realize I had deleted 100GB of data.

All in all, I’m very happy with the decision I’ve made. I now have a fast, reliable, serviceable, and manageable storage box that I can update with newer software versions indefinitely, as well as easily grow the capacity as my needs change.

Posted by: nlfiedler | February 17, 2009

Time Machine and invalid sibling link, now what?

Last month I was playing around with the timedog script to determine where all the disk space was going on my Time Machine backups. I had mounted the remote disk image using hdiutil attach, when a few minutes later Time Machine kicked off a backup. In a moment of poor judgment, I tried to stop the backup and unmount the second mount of the backup volume that TM had established. At first nothing appeared to have gone wrong as a result of that action, but the next time TM tried to make a backup, it said the volume was corrupted and it couldn’t make another backup. No matter, I was sure Disk Utility or fsck could correct the problem, surely the error wasn’t too serious. Ah, but this error was no run of the mill error. It was in the fact the dreaded invalid sibling link error. This is the error that makes most disk repair tools turn pale and shirk away into the corner. In most cases, nothing will fix the broken link, and there’s no telling just what might be lost as a result.

But, I wasn’t going to give up too easily. After all, I had over a year’s worth of backups that I wanted to recover. I Googled around for days, reading forum posts and blog entries, and anything else that might bring some hope to my dire situation. In many cases, others who had this problem tried fsck or Disk Warrior, and some of them were successful. By this time, I had tried fsck -r several dozen times, to no avail. Being the inventive type, I tried using hdiutil convert to create a new disk image from the corrupt one. But, as you can probably guess, all that accomplished was creating a new disk image with the same invalid sibling link. That meant that a simple disk block copy was not going to work, I had to try something that would copy the files one by one from the corrupt volume to a new one. Knowing that Time Machine makes gratuitous use of hard links, I needed a copy program that knew how to manage the hard links. Otherwise, a simple copy would result in a TM volume that was many times the size of the original.

Using rsync -H was the first thing I tried, but it ran out of memory before it managed to copy anything. Next, I tried SuperDuper!, which was recommended by quite a few people. The free version would only perform a whole disk copy, erasing the destination and starting from scratch. That was fine since that was exactly what I wanted. Sadly, it too failed after about 36 hours, reporting a “type 8 error”. It had managed to copy quite a bit of the TM backups, but I wasn’t satisfied, I wanted everything.

At this point I began to realize that there was only one way I was going to make a faithful copy of the Time Machine volume in a reasonable amount of time. Yep, that’s right, I would have to write a script that would accomplish my goal. Being that Python is the language in which I am strongest, after Java, I chose to put its built-in file manipulation routines to good use. I figured I could use the same approach that timedog was using, comparing the inode values of the directory entries from one snapshot to the next. In this way, I could know which entries were hard links and which were new files. After just four weeks of spending my spare time hacking away in Python, I finally succeeded.

Introducing timecopy.py, the fruit of my labor. It’s a Python script that traverses a Time Machine volume and reproduces its contents to the destination of your choice. Aside from knowing about the hard links that Time Machine creates, it also copies over the extended attributes that convince TM to accept the copied backups. As a result of writing this tool, I have recovered my Time Machine backups and everything seems to be working fine once again. Granted, I’ll never know what files may have been lost due to the invalid sibling link error, but at least I managed to save the vast majority of the data.

Posted by: nlfiedler | February 8, 2009

Building netatalk on OpenSolaris 2009.06

For a few months now, basically since the Drobo started supporting third party applications, I have been using my Drobo, via a DroboShare, as a Time Machine backup for my MacBook Pro. I used the BackMyFruitUp toolkit to set up the DroboShare as an AFP server, so the Mac saw it as an Apple-compatible network file share. One particularly fun step in that process was migrating my existing TM volume over to the Drobo, but that’s another story. This story is about how I replaced the Drobo and DroboShare with a server running OpenSolaris.

Installing OpenSolaris

To start with, I took the old web/file server I had sitting around since last year and installed OpenSolaris 2009.06. Why OpenSolaris you may ask, considering I had been a Linux user for over a decade? Well, I have one word for you: ZFS. If you don’t think ZFS is the most impressive file system on the planet, then you haven’t watched the three hour presention. But, I can’t give ZFS all of the credit for prompting the switch to OpenSolaris. There are plenty of very good reasons to use OpenSolaris and ZFS is just one of them.

Okay, so once OpenSolaris was on the system disk, what next? Well, you should update the installed packages and reboot into the new boot environment. Next we’ll need the C compiler and related packages: pfexec pkg install gcc-dev

The system is now capable of compiling software from source, in particular netatalk. Try to get the latest stable release, at least version 2.0.4, in which a weird permissions issue introduced in Leopard has been resolved. But first, we must install a compatible version of Berkeley DB.

Installing Berkeley DB

The one prerequisite of netatalk is Berkeley DB, version 4.1.25 or higher. Compiling it is pretty straightforward. Start by adding /usr/local/lib to the library load path (pfexec crle -u -l /usr/local/lib). Then compile and install Berkeley DB like so (consult their build instructions for details, but it basically goes like this):

  1. cd build_unix
  2. ../dist/configure --prefix=/usr/local
  3. make
  4. pfexec make install

Installing netatalk

With the recent versions of netatalk, it expects the Solaris directory structure to look differently than what OpenSolaris has these days. To accommodate this, make a symbolic link from /usr/ucbinclude to /usr/include so that netatalk builds. Then edit sys/Makefile.in, removing ’solaris’ from line 231 (to skip building the modules we don’t really need), then save the file and you’re ready to compile. Here I’m skipping the DDP bits that don’t compile cleanly on OpenSolaris, and I’m giving PAM a miss because it’s more work to set it up.

  1. ./configure --disable-ddp --without-pam
  2. make
  3. pfexec make install

Now comes the configuration stage. This setup suits my own needs, so if you want additional services then check out the netatalk documentation for more information. In general though, you will probably want to make similar changes to the default configuration, so I’ll detail what I’ve done for my environment.

  • Edit /usr/local/etc/netatalk/afpd.conf, adding the following line at the end of the file (this sets up the encrypted password authentication method and tells clients not to save the password, although that seems to be ignored on OS X):
- -transall -uamlist uams_dhx.so -nosavepassword
  • Edit /usr/local/etc/netatalk/netatalk.conf, changing “yes” to “no” for the atalk and papd services (atalk is for pre-OSX systems, and papd is for printer sharing).
  • Edit /usr/local/etc/netatalk/AppleVolumes.default, adding the following (changing the default ~ line as well):
~ cnidscheme:cdb options:usedots,invisibledots,upriv perm:0770
/tank/shared "Shared" allow:@staff cnidscheme:cdb options:usedots,invisibledots,upriv perm:0770
/tank/nathan_backup "Nathan Backup" allow:nfiedler cnidscheme:cdb options:usedots,invisibledots,upriv perm:0770
/tank/antonia_backup "Antonia Backup" allow:akwok cnidscheme:cdb options:usedots,invisibledots,upriv perm:0770

Without the cnidscheme parameter, warnings will be issued by afpd, so just set it to something reasonable (cdb or dbd). The usedots option tells netatalk to use dots instead of “:2e” for encoding dot files, while invisibledots says to make the dot files invisible by default. Now about the permissions issue alluded to above (see this discussion for details). With Tiger, newly created files would be writable by others, but in Leopard the permissions are wacky, so the latest netatalk has a work around for that. Add the upriv option and perm:0770 to force the permissions for new files to allow others to read and write to them. After all, this is a shared volume, it’s silly if no one else can access the files.

With the configuration complete, you can start the netatalk services. I’m assuming that it’s not running already, in which case you can just run this command: pfexec /etc/init.d/atalk start

Connecting and Permissions

Now at this point you should be able to connect to the server from your Mac, using the Connect to Server feature in Finder (you can use the Cmd+K shortcut). Type in something like “afp://myserver” in the dialog, replacing myserver with the name of your server, and you will be prompted for a name and password. Use whatever you have for your user accounts on the OpenSolaris server. You could configure netatalk to use PAM, allowing authentication against LDAP or some other service, but for simplicity I just use the system accounts. Once you’ve authenticated, you will be prompted to select an available shared volume. It doesn’t seem to matter which one you pick since the server will be added to the Finder sidebar, and from there you can browse to any of the shared volumes. As for accessing the files on the server, make sure the ownership and permissions are set up such that the user you connect as can read and write to those areas. For instance, the nfiedler user has read/write permission to /tank/nathan_backup, and that same user is a member of the staff group, and the /tank/shared area is owned by the staff group and is group writable. So far this seems to be working for us, but if you have better ideas then by all means please leave a comment.

I can’t take the credit for uncovering this information. In fact, this entry is just pulling together the different bits of information into a single, concise set of instructions. The original blog that I encountered was written by Marc Haisenko, and for step-by-step instructions on configuring netatalk on Linux, I found the kremalicious blog by Matthias Kretschmann.

Posted by: nlfiedler | December 27, 2008

Burstsort for Java

Shortly before working at Quantcast, I became interested in a sorting algorithm that I had not heard of before, called Burstsort. I found it while browsing Wikipedia, reading about various methods of sorting. Burstsort, in case you haven’t heard of it already, is very fast for large sets of strings, much faster than quicksort and its friends, including multikey quicksort and radix sort. It works by inserting the strings to be sorted into a shallow trie structure, where buckets are used to store the string references, to reduce memory usage. The buckets are “burst” when they exceed a certain size, and these buckets are sorted using a multikey quicksort. The structure is then traversed in order to retrieve the sorted strings. As a result, Burstsort is cache friendly and thus runs considerably faster than algorithms that are not cache-aware.

Along with the original paper is a C implementation, but as far as I could tell, there was no Java implementation, at least not in open source. So, after reading all of the Burstsort papers several times, I finally started writing a Java implementation of the original algorithm. You can find the project on Google Code, at the burstsort4j project page. The initial implementation is basically a rewrite of the original C code. After fixing a few bugs that I introduced during the rewrite, it appears to be working well and is indeed much faster than the other algorithms (quicksort and its multikey variant). Of course, I also rewrote those based on their C implementations, so it could be due to mistakes made on my part. Hopefully, since this is all open source now, others can evaluate the code and point out any mistakes I may have made.

In the mean time, I’ll be working on the newer algorithms, in particular the CP-burstsort and the “bucket redesign” Burstsort. The goal there is to reduce the memory usage, without trading off substantially from the run time.

Older Posts »

Categories