Burstsort for Java

Shortly before working at Quantcast, I became interested in a sorting algorithm that I had not heard of before, called Burstsort. I found it while browsing Wikipedia, reading about various methods of sorting. Burstsort, in case you haven’t heard of it already, is very fast for large sets of strings, much faster than quicksort and its friends, including multikey quicksort and radix sort. It works by inserting the strings to be sorted into a shallow trie structure, where buckets are used to store the string references, to reduce memory usage. The buckets are “burst” when they exceed a certain size, and these buckets are sorted using a multikey quicksort. The structure is then traversed in order to retrieve the sorted strings. As a result, Burstsort is cache friendly and thus runs considerably faster than algorithms that are not cache-aware.

Along with the original paper is a C implementation, but as far as I could tell, there was no Java implementation, at least not in open source. So, after reading all of the Burstsort papers several times, I finally started writing a Java implementation of the original algorithm. You can find the project on Google Code, at the burstsort4j project page. The initial implementation is basically a rewrite of the original C code. After fixing a few bugs that I introduced during the rewrite, it appears to be working well and is indeed much faster than the other algorithms (quicksort and its multikey variant). Of course, I also rewrote those based on their C implementations, so it could be due to mistakes made on my part. Hopefully, since this is all open source now, others can evaluate the code and point out any mistakes I may have made.

In the mean time, I’ll be working on the newer algorithms, in particular the CP-burstsort and the “bucket redesign” Burstsort. The goal there is to reduce the memory usage, without trading off substantially from the run time.

About these ads
This entry was posted in Algorithms, Computing and tagged , . Bookmark the permalink.

3 Responses to Burstsort for Java

  1. Kimo C says:

    Interesting, I have followed this sort and the CP version as well.

  2. Nathan Fiedler says:

    It looks like the updated burstsort (http://www.springerlink.com/content/35022477853m05v7/) will be easy to implement because it’s just a few modifications of the original algorithm. Eventually I’d like to compare all of the variations in Java and compare with the published results from Sinha. The trick is going to be getting more realistic data without spending bunches of money.

  3. Nathan Fiedler says:

    Just found their sample data on Sinha’s research page: http://ranjan.sinha.googlepages.com/home

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s