Big Data/Analytics Zone is brought to you in partnership with:

Justin Bozonier is the Product Optimization Specialist at GrubHub formerly Sr. Developer/Analyst at Cheezburger. He's engineered a large, scalable analytics system, worked on actuarial modeling software. As Product Optimization Specialist he is currently leading split test design, implementation, and analysis. The opinions expressed here represent my own and not those of my employer. Justin is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Algorithm of the Week: Genetic Algorithms, Pt. 2 - Making It Run

06.11.2013
| 10355 views |
  • submit to reddit

We’re going to do something that initially seemed hard to me. Given a formula let’s find the formula for its indefinite integral. In this previous post I showed that we can find the definite integral of a function using a stochastic process known as a Monte Carlo Siumulation. Because we can find the value of the definite integral at any points we want we can use another technique to turn that data into a formula. The technique we’ll be using is known as a genetic algorithm.

The Solution

Let’s start with seeing the interactive demo and analyzing how a couple of concrete cases worked out for me. Click here for the demo

What is a Genetic Algorithm?

A genetic algorithm is a way of searching for a pretty good solution to a problem by randomly guessing solutions and then combining the ones that do the best. A genetic algorithm is composed of several different concepts.

  • Gene Pool- A population of programs.
  • Chromosome- An individual program. In the prior article I go into extreme depth on how I specify and parse these. Read it here.
  • Mutation- The process through which a program is randomly altered.
  • Breeding- The process used that combines two or more programs in the hopes of creating a new chromosome that has a better fitness than any of the parent chromosomes.
  • Fitness- A test performed on each chromosome to ascertain how good of a solution it is to the problem at hand.
  • Generation- A single iteration of chromosome breeding and fitness testing.

Imagine that you were breeding dogs to be the most aggressive. You would have a kennel (gene pool) filled with many individual dogs (chromosomes). The most aggressive ones (defined by some fitness criteria) you breed together. Occasionally one of the puppies might be born EXTREMELY aggressive… or passive due to some mutation. Once you get to the desired level of aggression you’re done. We’re going to do the same thing.

Connecting the concepts with our challenge

We’ll start with a population of completely randomly generated math formulae. We’re looking for the formulae that most closely match the integral of another function which provide in our fitness test. It’s important to note that the formula we’re fitness testing for remains constant. In our case, we’ll have the formula we’re integrating written in Javascript. We’ll then use Monte Carlo methods to find the definite integral at a series of random points specified by our full algorithm. The chromosomes that come closest to the correct answers will be bred together.

At some point in the process, it’s very common for the chromosomes to start all looking very much alike. For this reason we’ll also implement mutations.

What does this process look like?

Best score: 0.5309003809188878   Generation #0
Best score: 0.5309003809188878   Generation #1
Best score: 0.5309003809188878   Generation #2
Best score: 0.5309003809188878   Generation #3
Best score: 0.5309003809188878   Generation #4
Best score: 0.5309003809188878   Generation #5
Best score: 0.5309003809188878   Generation #6
Best score: 0.5309003809188878   Generation #7
Best score: 0.3806625060392219   Generation #8
Best score: 0.3806625060392219   Generation #9
Best score: 0.37563382661080047  Generation #10
Best score: 0.37563382661080047  Generation #11
Best score: 0.011973503793781733 Generation #12

This was me finding the indefinite integral for the sin(x). It converged remarkably quickly. The first generation, its fitness score was at about .53 and the first generation it scored below .13 the program halted. Turns out the first time it went lower than .13 it actually got an .119. The numbers don’t mean much except that they went down. It’s all relative.

The solution that is found is ( -672 ) - ( cos( x ) )

The actual answer is -cos(x) + c… So you might recognize this is exactly right. The constant of -672 is inconsequential and can be removed. All in just twelve generations!

The Algorithmic Big Picture

Below is the main program loop. It consists of the following parts, in order:

  1. Randomly generate the questions and corresponding correct answers to be used for the fitness test.
  2. Start iterating through each generation of chromosomes. In each generation…
  3. Generate as many random chromosomes as we have room for in the gene pool.
  4. Test the fitness of each one and sort them from best to worst.
  5. Breed each chromosome with the one immediately after it in quality order.
  6. Repeat until a chromosome has achieved our target fitness or until we’ve iterated through the maximum number of generations.
var main = function(data){
	var gene_pool = [];
	var winning_scores = null;

	var questions = generate_questions(data.points_to_test_each_generation);
	var correct_answers = generate_correct_answers(questions, 
		function(x){
			return eval(data.javascript_formula_text);
		});

	for(var i=0; i<data.generation_count; i++){
		var gene_pool = generate_random_chromosomes(gene_pool, data.genes_per_generation);
		var gene_scoreboard = test_fitness(gene_pool, questions, correct_answers);

		winning_scores = gene_scoreboard;//.slice(0, data.winners_per_generation);

		postMessage("Best score: " + String(gene_scoreboard[0].score));
		if(i > 5 && gene_scoreboard[0].score <= data.target_fitness){
			postMessage(
				{
					'type':'great_find', 
					'value': gene_scoreboard.slice(0, data.winners_per_generation)
				});
			break;
		}

		var winning_genes = winning_scores.slice(0,data.genes_per_generation-30)
			.map(
				function(elem){
					return elem.gene;
				});

		gene_pool = breed_with_next_best(
			winning_genes,
			data.mutation_chance,
			data.max_constant,
			data.winners_per_generation,
			data.genes_per_generation);
	}

	postMessage({'type': 'final_result', 'value': winning_scores});
};

Breeding Strategy

I saw a documentary one time about how deer or gazelle or something breeds. Let’s just pretend it was deer since I can’t find an actual reference to this child hood memory. The males form a series of loose rings around the alpha male and the female deer try to make it to the inner most ring they can. Of course the whole time the male deer are also fighting to move up to the vaunted alpha male status.

I’m doing something similar here.

I have an array filled with chromosomes that are sorted by their fitness score, best scores first. I then breed each chromosome with the next one down in quality (the next in the array).

Fitness Function

For the fitness what I’m doing is finding the integral of small segments of the curve one unit wide. I do this so that when I use Monte Carlo methods to find the area under the curve I can keep the density of my random points at least somewhat consistent.

What I do is create 70 random x points and find the integral on the domain x, x+1. I keep each of the 70 questions in an array and ask every chromosome in every iteration the to give me the integral for each of the 70 x values.

I at first tried to come up with a separate set of questions for each generation but found that seemed to make the process more volatile.

var test_fitness = function(chromosomes, fitness_questions, correct_answers){
	var gene_pool = chromosomes;
	var gene_scoreboard = [];
	for(var gene_index=0; gene_index<gene_pool.length; gene_index++){
		var gene = gene_pool[gene_index];
		var chromosome = new Chromosome(gene);
		var gene_answers = generate_gene_answers(fitness_questions, chromosome.evaluate);
		var fitness_score = std_dev(correct_answers, gene_answers)/Math.sqrt(correct_answers.length-1);
		if(!isNaN(fitness_score)){
			gene_scoreboard.push({'gene':gene, 'score':fitness_score});
		}
	}

	gene_scoreboard.sort(function(left, right){
		if(isNaN(left.score)){
			return 1;
		}
		if(isNaN(right.score)){
			return -1;
		}
		return left.score < right.score 
					? -1 
					: left.score > right.score 
						? 1 
						: 0;
	});

	return gene_scoreboard;
};

Here we use the standard error as a fitness function. Initially I happened upon using the distance formula, but then I found it changing as I added more test points. To account for this I began dividing by the number of test points but then realized it was super close to just using the standard error and thought I’d just do that instead.

Here’s my standard deviation implementation:

var std_dev = function(left_answers, right_answers){
	var squared_differences = [];
	for(var i=0; i < left_answers.length; i++){
		var left = left_answers[i];
		var right = right_answers[i];
		var squared_difference = Math.pow(left-right,2);
		squared_differences.push(squared_difference);
	}
	var sum = squared_differences.reduce(function(a, b) { return a + b; }, 0);
	return Math.sqrt(sum);
};

Solutions as Chromosomes

I talked in my previous post on Genetic Algorithms on everything that goes into encoding mathical formulae as chromosomes. You can refer to it here.

Here is how I generate random chromosomes:

var randomize = function(max_length, max_constant){
  var chromosome_components = [];
  var chromosome_definition = [
    '*','/','+','-','n','x','q','s','c','t','^','l'
  ];

  var random_length = Math.floor(Math.random() * max_length);
  for(var i = 0; i<random_length; i++){
    var chromosome_sequence = Math.floor(Math.random() * chromosome_definition.length);
    var random_sequence = chromosome_definition[chromosome_sequence];
    if(random_sequence == 'n'){
      random_sequence += String(Math.floor(Math.random() * 2*max_constant)-max_constant);
    }
    chromosome_components.push(random_sequence);
  }
  return chromosome_components.join('');
};

Each character in the chromosome definition list represents one of the possible operators in the chromosome grammar I’ve created.

Breeding Solutions

When breeding the solutions I use a method that chooses a random mid-point on the two chromosomes being bred and swaps the segments divided by the midpoint. It thus creates two “children” chromosomes.

var breed_with_next_best = function(genes, mutation_chance, max_constant, winners_to_keep, max_population){
	var previous_winners = genes.slice(0, winners_to_keep);
	var babies = previous_winners;
	var second_to_last_gene_index = genes.length-2;
	for(var gene_index=0; gene_index <= second_to_last_gene_index; gene_index++){
		var better_gene = genes[gene_index];
		var worse_gene = genes[gene_index + 1];
		var baby_genes = breed_by_gene_swap(better_gene, worse_gene);
		babies.push(baby_genes[0]);
		babies.push(baby_genes[1]);
	}
	var chromosome_factory = new Chromosome('');
	return babies.slice(0, max_population).map(function(elem){
		var mutated_chromosome = chromosome_factory.mutate(elem, mutation_chance, max_constant);
		return mutated_chromosome;
	});
};

Mutating Solutions

After breeding occurs I run all of the chromosomes through a mutation process that affects X% of the chromosomes. If a chromsome is selected for mutation there are one of four strategies that may be applied to it.

Once the program decides whether or not a mutation will occur, it decides how many mutations to apply with a maximum of 15. The number 15 is a limit I picked arbitrarily. Here’s the code that handles the mutation logic.

var token_map = {
  '*': {'type':'two_term_operator', 'value':'*'},
  '/': {'type':'two_term_operator', 'value':'/'},
  '+': {'type':'two_term_operator', 'value':'+'},
  '-': {'type':'two_term_operator', 'value':'-'},
  '^': {'type':'two_term_operator', 'value':'^'},
  'n': {'type': 'constant', 'value':null},
  'x': {'type': 'input_constant', 'value':'x'},
  'q': {'type': 'one_term_operator', 'value':'q'},
  'l': {'type': 'one_term_operator', 'value':'l'},
  's': {'type': 'one_term_operator', 'value':'s'},
  'c': {'type': 'one_term_operator', 'value':'c'},
  't': {'type': 'one_term_operator', 'value':'t'},
};

var get_values = function(dictionary){
  var values = Object.keys(dictionary).map(function(key){
    return dictionary[key];
  });
  return values;
};

var stringify = function(tokens){
  var text = '';
  for(var token_index in tokens){
    var token = tokens[token_index];
    if(token.type === 'constant'){
      if(isNaN(token.value)){
        text = 'n' + text;
      }else{
        text = 'n' + String(token.value) + text;
      }
    } else {
      text = String(token.value) + text;
    }
  }
  return text;
};

var get_mutation_choice = function(max_constant){
  var mutation_choices = get_values(token_map);
  var mutation_chosen = mutation_choices[Math.floor(Math.random() * (mutation_choices.length))];
  if(mutation_chosen.type === 'constant'){
    mutation_chosen.value = Math.floor(Math.random() * 2*max_constant)-max_constant;
  }
  return mutation_chosen;
};

var mutate = function(chromosome, chance, max_constant){
  var do_mutation = (chance >= Math.random());
  if(do_mutation){
    var tokens = lex_this(chromosome);

    var mutations_to_insert = Math.floor(Math.random()*15 + 1);

    var mutation_decider = Math.random();      
    if( mutation_decider < .25){
      for(var mutation_index=0; mutation_index<mutations_to_insert; mutation_index++){
        var mutation_chosen = get_mutation_choice(max_constant);
        var index_to_mutate = Math.floor(Math.random() * (tokens.length));
        tokens[index_to_mutate] = mutation_chosen;
      }
    } else if(mutation_decider < .5) {
      for(var mutation_index=0; mutation_index<mutations_to_insert; mutation_index++){
        var mutation_chosen = get_mutation_choice(max_constant);
        tokens.push(mutation_chosen);
      }
    } else if(mutation_decider < .75) {
      for(var mutation_index=0; mutation_index<mutations_to_insert; mutation_index++){
        var mutation_chosen = get_mutation_choice(max_constant);
        tokens.unshift(mutation_chosen);
      }
    } else {
      for(var mutation_index=0; mutation_index<mutations_to_insert; mutation_index++){
        if(tokens.length > 1){
          var index_to_mutate = Math.floor(Math.random() * (tokens.length));
          tokens.splice(index_to_mutate, 1);
        }
      }
    }
    
    var new_chromosome = stringify(tokens);
    return new_chromosome;
  } else{
    return chromosome.slice(0,50);
  }
};

The Demo

Well if you’ve made it this far and you haven’t checked out the demo yet you can play with it here: Click here for the demo

Open Challenges

Given more complex formulae the algorithm tends to prematurely converge on local optima. There are techniques to handle this such as using hamming distance to dictate mutations to be more or less likely and thus keep genetic diversity high.

Beyond that, I had to use Web Workers so that the page would be responsive while processing the algorithm. With a little more work, I bet one could run a couple of simulations in parallel at a minimal extra overhead cost and cross breed the highest quality chromosomes from each simulation.

Currently my breeding implementation is fairly simplistic. There are several different techniques which exist that might make this process work better. The technique I’m using is somewhat of an “elitist” solution which can cause rapid convergence because it focuses on breeding the best with the best. It can however also lead to converging prematurely into local optima like I’m experiencing. It might be smart to do some sort of randomized elitism. Perhaps just let the fitness score influence the randomness of the mate pairings.

Any questions or thoughts make sure to hit me up on one of the contact options in the upper left of this page.

Thank you for reading!

(Note: This article and the opinions expressed are solely my own and do not represent those of my employer.) 

Published at DZone with permission of Justin Bozonier, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)