Performance Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 524 posts at DZone. You can read more from them at their website. View Full User Profile

Haskell: An Impressively Non-Performant Union Find

01.03.2013
| 1886 views |
  • submit to reddit

 

I’ve spent the best part of the last day debugging a clustering algorithm I wrote as part of the Algorithms 2 course, eventually coming to the conclusion that the union find data structure I was using wasn’t working as expected.

In our algorithm we’re trying to group together points which are ‘close’ to each other and the data structure is particular useful for doing that.

To paraphrase from my previous post about how we use the union find data structure:

We start out with n connected components i.e. every point is in its own connected component.

We then merge these components together as calculate the neighbours of each point until we’ve iterated through all the points and have grouped all the points into the appropriate components.

I came across 3 libraries which implement this data structure – union-find, equivalence and persistent-equivalence.

union-find seemed like it had the easiest API to understand so I plugged it into my program only to eventually realise that it wasn’t putting the points into components as I expected.

I eventually narrowed the problem down to the following example:

> let uf = emptyEquivalence (0,9)
[(0,0),(1,1),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9)]
 
> components $ equate 0 1 uf
[(0,0),(1,0),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9)]
 
> components $ equate 8 9 $ equate 0 1 $ uf
[(0,0),(1,0),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,8)]
 
> components $ equate 0 8 $ equate 8 9 $ equate 0 1 $ uf
[(0,0),(1,0),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,0),(9,8)]

We start out with a union-find where every point is in its own component. The next line puts points ’0′ and ’1′ into the same component which it does by making indexes ’0′ and ’1′ of the array both have the same value, in this case 0, which is known as the component’s leader.

We continue doing that for points ’8′ and ’9′ which works fine and our union find now consists of 8 components – the ones with leaders 8 & 0 which have two elements and then ones with leaders 2,3,4,5,6 & 7 which only contain themselves.

Things go wrong on our next step where we try to join nodes ’0′ and ’8′. As I understand it what should happen here is that all the points connected to either ’0′ or ’8′ should end up in the same component so we should have a component containing points ’0′, ’1′, ’8′ and ’9′ but ’9′ has been missed off in this case.

The implementation is deliberately written to work like this so I thought I’d try writing my own version based on the following Ruby version:

class UnionFind
  def initialize(n)
    @leaders = 1.upto(n).inject([]) { |leaders, i| leaders[i] = i; leaders }
  end
 
  def connected?(id1,id2)
    @leaders[id1] == @leaders[id2]
  end
 
  def union(id1,id2)
    leader_1, leader_2 = @leaders[id1], @leaders[id2]
    @leaders.map! {|i| (i == leader_1) ? leader_2 : i }
  end
end

This is my Haskell equivalent which I adapted from the union-find one that I mentioned above:

module Leaders (UnionSet, create, components, numberOfComponents, indexes, inSameComponent, union) where
 
import Control.Concurrent.MVar
import Control.Monad
import Data.Array.Diff as ArrayDiff
import Data.IORef
import qualified Data.List
import Data.Maybe
import System.IO.Unsafe
import qualified Data.Set as Set
 
arrayFrom :: (IArray a e, Ix i) => (i,i) -> (i -> e) -> a i e
arrayFrom rng f = array rng [ (x, f x) | x <- range rng ]
 
ref :: a -> IORef a
ref x = unsafePerformIO (newIORef x)
 
data UnionSet i = UnionSet { leaders :: IORef (DiffArray i i) }
 
create :: Ix i => (i, i) -> UnionSet i
create is = UnionSet (ref (arrayFrom is id))
 
extractComponents :: Ix i => DiffArray i i -> [(i, i)]    
extractComponents  = Set.toList . Set.fromList . ArrayDiff.assocs
 
components :: Ix i => UnionSet i -> [(i,i)]
components (UnionSet leaders) = unsafePerformIO $ do
    l <- readIORef leaders
    return (extractComponents l)
 
numberOfComponents :: Ix i => UnionSet i -> Int
numberOfComponents (UnionSet leaders) = unsafePerformIO $ do
    l <- readIORef leaders
    return (length $ extractComponents l) 
 
indexes :: Ix i => UnionSet i -> [(i,i)]
indexes (UnionSet leaders) = unsafePerformIO $ do
    l <- readIORef leaders
    return (ArrayDiff.assocs l)       
 
inSameComponent :: Ix i => UnionSet i -> i -> i -> Bool
inSameComponent (UnionSet leaders) x y = unsafePerformIO $ do
    l <- readIORef leaders
    return (l ! x == l ! y)
 
union x y (UnionSet leaders)  = unsafePerformIO $ do
    ls <- readIORef leaders
    let leader1 = ls ! x 
        leader2 = ls ! y
        newLeaders = map (\(index, value) -> (index, leader2)) . filter (\(index, value) -> value == leader1) $ assocs ls        
    modifyIORef leaders (\l -> l // newLeaders)
    return $ UnionSet leaders

We can recreate the above example like so:

> indexes $ Leaders.union 0 8 $ Leaders.union 8 9 $ Leaders.union 0 1 $ create (0,9)
[(0,9),(1,9),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,9),(9,9)]

Unfortunately it takes 44 seconds to execute which is mainly down to the call to assocs on line 46. assocs gives us a list of all the indexes and their corresponding values which we use to work out which indexes need to be updated with a new leader.

The rest of the code is mostly boiler plate around getting the array out of the IORef. The IORef allows us to have a mutable array in this instance. There is a page on the c2 wiki which explains how to use IORef in more detail.

Although using the DiffArray allows us to provide a pure external interface around its use, it is known to be 10-100x slower than an MArray.

I’ve been playing around with a version of the union-find data structure which makes use of an MArray instead and has decreased the execution time to 34 seconds.

Unless anyone has any ideas for how I can get this to perform more quickly I’m thinking that perhaps an array isn’t a good choice of underlying data structure at least when using Haskell.



Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)