Identifying Features in Images with Cluster Analysis
Over the past couple of weeks I learned about two of the more popular data clustering algorithms: K-Means Clustering and Density Based Clustering (loosely conforms to DBSCAN). In this post I'll be showing my own implementation of DBSCAN and how I used it to separate images into separate files containing assorted features from the original image. Here's a quick example to better explain: My algorithm identified all of the white space in the Cheezburger logo and separated it into a new image file. It looks like this:
- Smallest Cluster- The minimum number of points necessary to consider a group of points a cluster.
- Cluster Distance- The maximum distance between two points for them to be "dense" enough to be considered for inclusion into a cluster.
1
|
var image_bounds = GetImagePointsFromFile(filename_of_image_to_load, image_points); |
- Take each unvisited image point in the original image space
- Mark the point as visited and then find all neighboring points
- Neighboring points are found by calculating the distance between the given point and all other points and only keeping the ones that are within the Cluster Distance.
- Then, if we have more neighboring points than the Smallest Cluster size we've found ourselves a new cluster!
- Create a new cluster and add it to our list of clusters and then also add the given image point to the new cluster.
- Now we can expand our newly created cluster!
- From here we start exploring every image point our given image point is densely connected to. Basically, it's time to follow the bread crumbs to the end of the road.
- Now let's go to all of the neighboring points we haven't visited yet and add them to our "connected image points to be examined" list.
- Using this list we'll keep track of the points we started with as well as any new ones that need to be examined. As we find points that meet our criteria we'll be adding them to our current cluster
- We also need to mark these points as having been queued to be visited so that they aren't added to our list multiple times and waste time.
- Now, while we still have connected image points to be examined
- Grab one of them and if it hasn't already been visited, let's start our visit!
- Start the visit by marking the image point as being visited
- Add this point to our cluster
- Find all of the neighboring image points to this image point that is itself a neighboring image point to the image point we've gotten at step 1! Complicated much?
- Add all of these newest neighboring image points to our "connected image points to be examined" list IF:
- They have not already been visited
- They are not already queued for a visit
- There are more of them than the Smallest Cluster size.
- As we add the image points to the "connected image points to be examined" list mark them as "Queue for Visit".
public class ImagePoint
{
public bool Visited;
public int X;
public int Y;
public byte R;
public byte G;
public byte B;
public byte A;
public bool AlreadyFoundCandidates;
public bool QueuedForVisit;
public double DistanceTo(ImagePoint other_point)
{
return Math.Sqrt(
Math.Pow(other_point.R - this.R, 2)
+ Math.Pow(other_point.G - this.G, 2)
+ Math.Pow(other_point.B - this.B, 2)
+ Math.Pow(other_point.A - this.A, 2)
+ Math.Pow(other_point.X - this.X, 2)
+ Math.Pow(other_point.Y - this.Y, 2));
}
} private static IEnumerable<DensityCluster> GetImageClusters(List<ImagePoint> original_image_space, double cluster_distance, int smallest_cluster)
{
var clusters = new List<DensityCluster>();
foreach (var image_point in original_image_space.Where(x => !x.Visited))
{
Console.WriteLine("Finding cluster for image point @ ({0}, {1})", image_point.X, image_point.Y);
image_point.Visited = true;
var neighboring_points = FindNeighboringPointsFor(original_image_space, image_point, cluster_distance);
if (neighboring_points.Count >= smallest_cluster)
{
var cluster = new DensityCluster();
clusters.Add(cluster);
cluster.Add(image_point);
Console.WriteLine("Added to cluster... searching around @ ({0}, {1})", image_point.X, image_point.Y);
ExpandCluster(cluster, original_image_space, neighboring_points, cluster_distance, smallest_cluster);
}
}
clusters = clusters.Where(x => x.Points.Count >= smallest_cluster).ToList();
return clusters;
}
private static void ExpandCluster(DensityCluster cluster,
IEnumerable<ImagePoint> original_image_space,
IEnumerable<ImagePoint> neighboring_points,
double cluster_distance,
int smallest_cluster)
{
foreach (var neighboring_point in neighboring_points)
neighboring_point.QueuedForVisit = true;
var connected_image_points_to_be_examined = new Stack<ImagePoint>(neighboring_points.Where(x => !x.Visited));
while (connected_image_points_to_be_examined.Count > 0)
{
var connected_image_point_to_be_examined = connected_image_points_to_be_examined.Pop();
if (!connected_image_point_to_be_examined.Visited)
{
connected_image_point_to_be_examined.Visited = true;
cluster.Add(connected_image_point_to_be_examined);
var distant_neighbor_image_points = FindNeighboringPointsFor(original_image_space,
connected_image_point_to_be_examined,
cluster_distance);
if (distant_neighbor_image_points.Count >= smallest_cluster)
{
foreach (
var distant_neighbor_image_point in
distant_neighbor_image_points.Where(x => !x.Visited && !x.QueuedForVisit))
{
distant_neighbor_image_point.QueuedForVisit = true;
connected_image_points_to_be_examined.Push(distant_neighbor_image_point);
}
}
}
}
}
public static IList<ImagePoint> FindNeighboringPointsFor(IEnumerable<ImagePoint> image_points, ImagePoint image_point, double cluster_distance)
{
return image_points.Where(x => x.DistanceTo(image_point) <= cluster_distance).ToArray();
} public static void CreateComponentImages(IEnumerable<DensityCluster> clusters, int image_width, int image_height, string filepath_to_save_component_images_to)
{
var counter = 0;
foreach (var cluster in clusters)
{
Console.WriteLine("Saving cluster {0} to file", counter);
using (var image = new Bitmap(image_width, image_height))
{
foreach (var image_point in cluster.Points)
{
image.SetPixel(image_point.X, image_point.Y,
Color.FromArgb(image_point.A, image_point.R, image_point.G, image_point.B));
}
image.Save(
filepath_to_save_component_images_to + "_group_" + counter + ".bmp");
}
counter++;
}
}image_points.Where(x => x.DistanceTo(image_point) <= cluster_distance).ToArray()
Is where the bottleneck lies (really within the Where expression). If anyone has any concrete tips around how I could cache that data to decrease the run time, I'm all ears. (EDIT: Apparently using a KD-Tree helps significantly with nearest neighbor searches like this. I forsee yet another blog post coming soon!)
Final Results!
As a final test I ran the algorithm over this meme:
And here are some samples of the images it was broken into:
A slice of the pie chart
The meme watermark:
That's all! Thanks for reading!
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)





