Big Data/Analytics Zone is brought to you in partnership with:

Justin Bozonier is the Product Optimization Specialist at GrubHub formerly Sr. Developer/Analyst at Cheezburger. He's engineered a large, scalable analytics system, worked on actuarial modeling software. As Product Optimization Specialist he is currently leading split test design, implementation, and analysis. The opinions expressed here represent my own and not those of my employer. Justin is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Identifying Features in Images with Cluster Analysis

10.29.2012
| 3443 views |
  • submit to reddit
What Did I Learn to Do?

Over the past couple of weeks I learned about two of the more popular data clustering algorithms: K-Means Clustering and Density Based Clustering (loosely conforms to DBSCAN). In this post I'll be showing my own implementation of DBSCAN and how I used it to separate images into separate files containing assorted features from the original image. Here's a quick example to better explain: My algorithm identified all of the white space in the Cheezburger logo and separated it into a new image file. It looks like this:

Before:
1image
After (gray areas are transparent now):
0image
It also extracted the text:
3image
0image
It should be noted I'll just be discussing the naive implementation of the algorithm. It takes FOREVER to process a larger limit. I will also discuss why that is and some abstract thoughts around how it could be improved.
 
Why Am I Learning This Now?
The main goal of this exercise is not to learn to do image processing nor is it to learn to write the best or fastest clustering algorithm. The point of this experiment was to learn enough about clustering that I could write my own naive implementation. It's one thing to say I know something and it's a whole other to know enough to be able to craft it.
 
What is Data Clustering?
Data clustering is something anyone who has ever looked at a scatter plot before has done. It's where you look for the areas where many points seem to group together and label them as separate groups. A good example is on Wikipedia, here's the image they use:
Image
They have already separated the data into two groups for us. This is also a great example of why one might choose to use a density based data clustering algorithm over something like a k-means algorithm. 
K-means works best at identifying circular groupings like the blue group above. K-means is also very straightforward to create. The biggest downside to that algorithm though is that the red cluster in the above graph would be split into multiple groups because K-means can't account for the long, curved shape. A density based clustering algorithm however can gracefully handle both but gets a bit more tricky as the algorithm essentially becomes a graph traversal problem that, given large numbers of points, prevents the simpler recursive graph travesal solutions from being tractable.
 
How I Applied All of This to Images
As I've been learning more about data analysis I keep relearning a saying I've heard before: "Images are a great data source for machine learning." In this case I chose to use images as a multi-dimensional data source using each pixel as an "ImagePoint" in a 6-dimensional space. Red, green, and blue were the first three dimensions. The last three were the alpha transparency value as well as the X and Y coordinates of the pixel in the image.
My main hypothesis was that I should see images separate by colors somewhat based upon their relative visual location. The Cheezburger logo above is a perfect example of my expectation.
 
Step By Step Through Code
Before we dive into the code there are three concepts you should be aware of that are key to how I implemented DBSCAN:
  1. Smallest Cluster- The minimum number of points necessary to consider a group of points a cluster.
  2. Cluster Distance- The maximum distance between two points for them to be "dense" enough to be considered for inclusion into a cluster.
The names used above are not defacto names but my own from the code.
 
Step 1. Load All Pixels Into Memory
To start with I load all of the image's pixels into memory and convert them to the ImagePoint objects I discussed earlier. For the sake of this conversation let's just establish that function call looks like this:
1
var image_bounds = GetImagePointsFromFile(filename_of_image_to_load, image_points);
I get back the image dimensions so that once I write each cluster's image data to separate files I can position the images in the same places they would have appeared in the original in order to help me see how it worked.
 
Step 2. Cluster!
This step isn't quite as trivial as the previous one. I'll write out the psuedo-code first and then show you the original code.
  1. Take each unvisited image point in the original image space
  2. Mark the point as visited and then find all neighboring points
  3. Neighboring points are found by calculating the distance between the given point and all other points and only keeping the ones that are within the Cluster Distance.
  4. Then, if we have more neighboring points than the Smallest Cluster size we've found ourselves a new cluster!
  5. Create a new cluster and add it to our list of clusters and then also add the given image point to the new cluster.
  6. Now we can expand our newly created cluster!
  7. From here we start exploring every image point our given image point is densely connected to. Basically, it's time to follow the bread crumbs to the end of the road.
  8. Now let's go to all of the neighboring points we haven't visited yet and add them to our "connected image points to be examined" list.
  9. Using this list we'll keep track of the points we started with as well as any new ones that need to be examined. As we find points that meet our criteria we'll be adding them to our current cluster
  10. We also need to mark these points as having been queued to be visited so that they aren't added to our list multiple times and waste time.
  11. Now, while we still have connected image points to be examined
  12. Grab one of them and if it hasn't already been visited, let's start our visit!
  13. Start the visit by marking the image point as being visited
  14. Add this point to our cluster
  15. Find all of the neighboring image points to this image point that is itself a neighboring image point to the image point we've gotten at step 1! Complicated much?
  16. Add all of these newest neighboring image points to our "connected image points to be examined" list IF:
    1. They have not already been visited
    2. They are not already queued for a visit
    3. There are more of them than the Smallest Cluster size.
  17. As we add the image points to the "connected image points to be examined" list mark them as "Queue for Visit".
There are a couple of key simplifying assumptions I'm making. One is that a point only need be visited once. Also that once a single point in our outer loop identifies a new cluster, that cluster will be built in its entirety with the inner loop. This means that we never have to check whether or not a point has already been added to a cluster.
Here is the definition of the ImagePoint class:
 
public class ImagePoint
{
    public bool Visited;
    public int X;
    public int Y;
    public byte R;
    public byte G;
    public byte B;
    public byte A;

    public bool AlreadyFoundCandidates;
    public bool QueuedForVisit;

    public double DistanceTo(ImagePoint other_point)
    {
        return Math.Sqrt(
                Math.Pow(other_point.R - this.R, 2)
            + Math.Pow(other_point.G - this.G, 2)
            + Math.Pow(other_point.B - this.B, 2)
            + Math.Pow(other_point.A - this.A, 2)
            + Math.Pow(other_point.X - this.X, 2)
            + Math.Pow(other_point.Y - this.Y, 2));
    }
}
 
Especially important is the distance function. If I wanted to, it's entirely possible to change the algorithm dramatically just by adding or removing what factors I want to include in the distance computation. One thing I could choose to remove is the X and Y coordinates for example and then I'd up with clusters of pixels that have smoothly transitioned colors (sort of).
 
Anyway, now the actual code that utilizes this:
 
private static IEnumerable<DensityCluster> GetImageClusters(List<ImagePoint> original_image_space, double cluster_distance, int smallest_cluster)
{
    var clusters = new List<DensityCluster>();
    foreach (var image_point in original_image_space.Where(x => !x.Visited))
    {
        Console.WriteLine("Finding cluster for image point @ ({0}, {1})", image_point.X, image_point.Y);
        image_point.Visited = true;
        var neighboring_points = FindNeighboringPointsFor(original_image_space, image_point, cluster_distance);
        if (neighboring_points.Count >= smallest_cluster)
        {
            var cluster = new DensityCluster();
            clusters.Add(cluster);

            cluster.Add(image_point);

            Console.WriteLine("Added to cluster... searching around @ ({0}, {1})", image_point.X, image_point.Y);

            ExpandCluster(cluster, original_image_space, neighboring_points, cluster_distance, smallest_cluster);
        }
    }

    clusters = clusters.Where(x => x.Points.Count >= smallest_cluster).ToList();
    return clusters;
}

private static void ExpandCluster(DensityCluster cluster,
                                    IEnumerable<ImagePoint> original_image_space,
                                    IEnumerable<ImagePoint> neighboring_points,
                                    double cluster_distance,
                                    int smallest_cluster)
{
    foreach (var neighboring_point in neighboring_points)
        neighboring_point.QueuedForVisit = true;

    var connected_image_points_to_be_examined = new Stack<ImagePoint>(neighboring_points.Where(x => !x.Visited));
    while (connected_image_points_to_be_examined.Count > 0)
    {
        var connected_image_point_to_be_examined = connected_image_points_to_be_examined.Pop();
        if (!connected_image_point_to_be_examined.Visited)
        {
            connected_image_point_to_be_examined.Visited = true;
            cluster.Add(connected_image_point_to_be_examined);
            var distant_neighbor_image_points = FindNeighboringPointsFor(original_image_space,
                                                                            connected_image_point_to_be_examined,
                                                                            cluster_distance);
            if (distant_neighbor_image_points.Count >= smallest_cluster)
            {
                foreach (
                    var distant_neighbor_image_point in
                        distant_neighbor_image_points.Where(x => !x.Visited && !x.QueuedForVisit))
                {
                    distant_neighbor_image_point.QueuedForVisit = true;
                    connected_image_points_to_be_examined.Push(distant_neighbor_image_point);
                }
            }
        }
    }
}

public static IList<ImagePoint> FindNeighboringPointsFor(IEnumerable<ImagePoint> image_points, ImagePoint image_point, double cluster_distance)
{
    return image_points.Where(x => x.DistanceTo(image_point) <= cluster_distance).ToArray();
} 
Step 3: Create an image for each cluster
This is as simple as looping through all of our clusters and writing each pixel contained within to a Bitmap on disk. C# makes this fairly straight-forward (one reason I didn't do this in Ruby actually). It doesn't mean I couldn't have done it in Ruby, just that it wasn't immediately apparent as to how.
 
public static void CreateComponentImages(IEnumerable<DensityCluster> clusters, int image_width, int image_height, string filepath_to_save_component_images_to)
{
    var counter = 0;
    foreach (var cluster in clusters)
    {
        Console.WriteLine("Saving cluster {0} to file", counter);
        using (var image = new Bitmap(image_width, image_height))
        {
            foreach (var image_point in cluster.Points)
            {
                image.SetPixel(image_point.X, image_point.Y,
                                Color.FromArgb(image_point.A, image_point.R, image_point.G, image_point.B));
            }

            image.Save(
                filepath_to_save_component_images_to + "_group_" + counter + ".bmp");
        }
        counter++;
    }
}
What Could Be Done Better?
My algorithm takes on the order of hours to run on any medium sized image. Some simple profiling has shown that my reliance on 

image_points.Where(x => x.DistanceTo(image_point) <= cluster_distance).ToArray()

Is where the bottleneck lies (really within the Where expression). If anyone has any concrete tips around how I could cache that data to decrease the run time, I'm all ears. (EDIT: Apparently using a KD-Tree helps significantly with nearest neighbor searches like this. I forsee yet another blog post coming soon!)

Final Results!

As a final test I ran the algorithm over this meme:

2image

 

And here are some samples of the images it was broken into:

A slice of the pie chart

1image

 

The meme watermark:

Image

That's all! Thanks for reading!

 
 
 
Published at DZone with permission of Justin Bozonier, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)