Big Data/Analytics Zone is brought to you in partnership with:

Justin Bozonier is the Product Optimization Specialist at GrubHub formerly Sr. Developer/Analyst at Cheezburger. He's engineered a large, scalable analytics system, worked on actuarial modeling software. As Product Optimization Specialist he is currently leading split test design, implementation, and analysis. The opinions expressed here represent my own and not those of my employer. Justin is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Algorithm of the Week: Monte Carlo Methods

04.09.2013
| 7399 views |
  • submit to reddit

First: Why?

Monte Carlo methods are excellent for problems that are complex enough that an exact solution is nigh impossible and 100% perfect accuracy is unnecessary. What kind of problems? Well the one I’m going to tackle in this post is finding area under a curve (AKA Integration for the Calculus inclined).

But what is it?

Monte Carlo methods are when you run MANY random “simulations” and use the results to compute a reasonably accurate solution. In the case of our integration example, we will randomly generate thousands of points, calculate the percentage that occurred under the curve in question, then use that knowledge to compute the area.

Black magic!

INORITE? But seriously. Here’s a visual aid for what I’ll be doing: 

(from the wikipedia article here)

I’ll be choosing a bunch of random points along the x-axis and finding where the function says my maximum and minimum y-values will be. Once I have this, my function splits this rectangular area into two parts. Here is where the magic happens. We know how to compute the area of any rectangle. By randomly testing areas within the rectangle and tracking what proportion of our tests fell under the curve we’ll end up with a percentage. That percentage will represent the percent of the rectangular area falls under the curve of our function.

It seemed so simple until I typed it out…

The Code

Let’s start with the highest level in my code to give you a summary of what all I’m doing.

var integrate = function(fn, left_x, right_x, sample_count){
	var random_x_points = generate_x_values(left_x, right_x, sample_count);
	var y_bounds = compute_y_bounds(fn, random_x_points);
	var random_points = create_random_points(random_x_points, y_bounds);
	var points_under_curve = count_points_under_curve(fn, random_points)

	var percent_under_curve = points_under_curve / (1.0*random_points.length);
	var rectangle_area = (y_bounds.upper_bound-y_bounds.lower_bound)*(right_x-left_x)
	var area_under_curve = percent_under_curve * rectangle_area;

	return area_under_curve;
};

First off let’s look at the parameters. The integrate function takes in a function ‘fn’ to evaluate, a left and a right x boundary for our definite integral… and a sample count. The sample count is a weird one. Remember the image above where we randomly shot thousands of points onto our graph? That’s what this is. This is the number of random points we’ll sample across the given area we’re interested in computing the area.

In fact, an interesting point is the first couple lines of the function. First I generate a bunch of random x values, then I comb over those and figure out what the y-boundaries are. This way I can create a rectangle that includes as small of an area as possible.

Once I’ve gotten my x points and discovered my y-bounds I have enough information to create all of my random points. When that’s done I take all of my random points and compare them with the actual function and see how many of my random points fall under the curve of the function.

The rest of the function is pretty straight forward once you visualize it. I’ve essentially demarcated a rectangular area that contains the function and all of my sample points. We have our left and right boundary (given by left_x and right_x) and an upper and lower bound (given by y_bounds.upper_bound and y_bounds.lower_bound).

You can see my whole example and the code here: Click here to try it out

In the sample I’ve hard coded the function to be something that seemed challenging: sin(log(x)^2). This is how I invoke it:

var fn = function(x){
	return Math.sin(Math.pow(Math.log(x),2));
};
var result = integrate(fn, 0, 2, 10000);

But one gotcha

There’s one specific gotcha and that’s when the curve dips below the x-axis. In Calculus you might remember having to test for this and correcting the negative area to be positive (depending on your application of the technique). This method doesn’t have that issue. Instead by default you might end up with code that is testing the area under the curve all the way down to the lower y boundary.

In the “count_points_under_curve” function here you can see where I’m testing fn_value for being above or below zero before deciding which definition of “under the curve” I decide to test for.

var count_points_under_curve = function(fn, random_points){
	var points_under_curve = 0;
	for(var i in random_points){
		var random_point = random_points[i];
		var fn_value = fn(random_point.x);
		if(fn_value >= 0){
			if(random_point.y < fn_value){
				points_under_curve += 1;
			}
		} else {
			if(random_point.y > fn_value){
				points_under_curve += 1;
			}
		}
	}
	return points_under_curve;
};

Next time

This is one way to use Monte Carlo methods. I have another post I’m working on where I will be using Monte Carlo methods to explore split testing and hopefully verify some of the theory. In the past I’ve also used it for examining a simplified model of software development.

If you ever think, “I could totally find the answer if I could code a million different examples and see the results.” You’ve probably found a great reason to wield this tool.

Published at DZone with permission of Justin Bozonier, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)