Sunday, June 28, 2015

64% of Canadians Live South of Seattle!!! (Python)

On these long summer days, it's easy to remember that Seattle is REALLY far north.  This time of year, our days are only a few minutes shy of 16 hours long!  That's great, though we pay for it with our long winter nights.

Of course, I'm reminded that we're not SO far north.  I have a few friends in Vancouver, BC, who have a more extreme situation.  But that's Canada.  It's not called the Great White NORTH for nothing (btw, I'm an old SCTV fan).

But wait, are Canadians really "northerners" compared to us?  I mean, I sometimes hear the statistic that 90% of the Canadian population lives within 100 miles of the US border.  Because they love us.  Or because being really far north isn't very fun.  Here's a source:  https://www.cia.gov/library/publications/the-world-factbook/geos/ca.html

And another thing - if you take a good look at a map (preferably a globe), you'll see that a large portion of Eastern Canada isn't that far north at all.  The border takes a big southern dip in the map just east of the Great Lakes that doesn't come back north til the edge of Maine.  That big dip, whatever Canadians call it, seems to hold a large portion of Canada's major cities (Toronto, Ottawa, Montreal, Quebec City...).



It seemed likely to me that most of the population of Canada lived in that dip, which would incidate that most of the population of Canada actually lives farther south than Seattle!

Note:  On most maps, because of that whole sphericalish nature of Earth, lots of parts of Canada that are south of Seattle actually appear to easily be north of Seattle.  If you look at this map from Wikimedia, it looks like St. John's, way on the east coast of Canada, easily looks to be north of Seattle.  It isn't.  At latitude 47.5675°N, it is south of Seattle (47.6097°N).  Quebec, at 46.816667°N, is also south of Seattle.



Anyway, I wanted to know how many Canadians live south of Seattle.

So I set out to answer that question, programmatically.

My first thought was to find some Canadian census data.  I quickly found an official looking site from the 2011 census (http://www12.statcan.gc.ca/census-recensement/2011/dp-pd/hlt-fst/pd-pl/Table-Tableau.cfm?LANG=Eng&T=801&S=51&O=A), and zeroed in on a large list of Canadian population centers with populations of at least 1000 people.  I figured I could take these centers, find their locations, compare them to Seattle, and have my answer!  People living outside of those areas would probably be an insignificant portion of the population (MISTAKE!  My big city biases bit me there).

This solution was a little more interesting than my final solution, so I'll briefly describe it here:  
Given the spreadsheet of all population centers
Go through the spreadsheet, feed the population center name into website (www.findlatitudeandlongitude.com), extract the lat/lon of that population center.
(missing population centers needed to be researched manually)

As I mentioned earlier, my big city biases bit me there.  I thought, how much of the population actually lived outside of population centers of 1000 people?  Well, of the 33+ million Canadians, around 10 million of them.  Oops.  That seemed like too large a part of the population to not account for in my solution, so I went back to the drawing board.

How about taking all of the county populations?  Unlike the United States, Canada doesn't have counties everywhere, at least not calling them 'counties'.  So that thought was a no go.

Then I discovered that Canada does have "census districts".  200+ of them, and they cover the entire country, and the sum of the populations of all of them was the total population of Canada!  Yay!  So I had all of the population data I needed, and just needed their locations, so I could compare the latitude with Seattle.

The easiest answer came from Wikipedia's List_of_census_divisions_of_Canada_by_population.  Each census division was listed (their source was the same 2011 Canadian census source I was using), but in addition to the data from the census, was a column listing a demonstrative Canadian population center, including a link that typically included the lat/lon of that population center.  So it ended up being more of a generic Wikipedia scraping project than a web scraping project.  A little less diverse, but it would give me the answer.  And, Wikipedia can be somewhat standardized, so I was generally able to find the lat/lon programmatically (with a small number of exceptions which I handled manually).

So the code is in Python, and I got most of my information from Wikipedia (Google searched when I needed more info).

I realize that here's a margin for error regaring lat/lon.  My program is treating the entire population of a census district (or Seattle) as residing in the exact lat/lon point found.  I didn't think this was too much of a problem, especially when I took a manual look at the populations roughly around the same latitude as Seattle, which accounted for only a few hundred thousand Canadians who might incorrectly be counted as south of Seattle when they are north, or vice versa.  Like the people of Grates Cove, NL (latitude 48.1619°N), counted as being south of Seattle (47.6097°N) because they are in Canadian Census Division Newfoundland Division 1, which is listed in Wikipedia as being at 47.55°N. I just wanted an estimate, so this is good enough.  If we pick a location latitudinally closer to Canadian population centers, the estimate will be worse.


How I Did It

First, I tried Webdriver.  In my first iteration of this problem, I'd tried to use a website, findlatitudeandlongitude.com, to get the lat/lon coordinates of the population centers I was considering.  This required inputting the name of the city into the form, submitting the info, and then retrieving the lat/lon. That was mostly fine.  Occasionally I couldn't find a population center there, so I'd have to find it elsewhere, but that was the only problem.  
When I discovered that using population centers missed 10 million of Canada's 33+ million people, I found the Census Districts, and that led me to Wikipedia.  There, most of the Census District pages included a lat/lon, and I decided that was representative enough for my estimation needs.
I no longer needed to manipulate forms, so I switched to mechanize and BeautifulSoup.

Mechanize gives you a browser-like object to use to interact with web pages.
I opened up the Canada Census Division page on Wikipedia, and put it through BeautifulSoup, which allowed me to more easily parse through the data.

    br = mechanize.Browser()

    url = "http://en.wikipedia.org/wiki/List_of_census_divisions_of_Canada_by_population"
    response = br.open(url)
    soup = BeautifulSoup(br.response().read())
    
    table = soup.find_all('tr')

Once I grabbed the table, I quickly went through each row, extracting the information I needed from each Census Division - name, population, and location.

    for row in table:
        print "Evaluating row: ", rowNum
        #print row.text
        print "-----------------------"
        columns = row.find_all('td')
        try:
            # Test that the first column is a number, indicating it was used in 2011 census
            int(columns[0].find('span','sorttext').text)

            # Column 2 - Census Division 
            name = columns[2].find('span','sorttext').text.encode('utf-8')

You might notice that I encoded the name into a utf-8.  This is because when I first wrote the data to a .csv file (you'll see this a few lines down), it complained about ansi not being able to handle Canadian city names like Thérèse-De Blainville.  After doing a little research, including reading Joel's Article on Unicode, I decided to just force it to utf-8 for simplicity.  I only sorta cared about each city's name.

Going back to getting location, it was a little more involved because that data wasn't in the table.  In the table, each division's name had a link to that division's own Wikipedia page.  Often, those pages had a lat/long for that division.  

# Get the lat/lon coordinates of a municipality from the city's wikipedia page
def GetCoordinatesFromWikipediaPage(br, url):
    response = br.open(url)
    soup = BeautifulSoup(br.response().read())
    latlon = soup.find(class_="geo")
    if latlon is not None:
        return tuple([float(x) for x in latlon.text.split(';')])
        print latlon.text
    else:
        return None

When it didn't, I had to go back to the main page's table and look at the column for the Illustrative census subdivision, and look at that page for a lat/long.  That was successful, except for Quebec City, which is a weird kind of census division for some reason.  I ended up getting the lat/long for that manually.

Once I retrieved a census division's data, I stuck it in a list. And I put that list in a list of lists that represented all off the census division. That, I output to a csv file.  

    # Write all locations to the outfile
    with open(outFile, "wb") as csv_file:
        writer = csv.writer(csv_file, delimiter=',')
        for line in locations:
            writer.writerow(line)

And that was the end of my webwork.  To calculate what percentage of Canada's population lives south of Seattle, I took Seattle's latitude (47.6097°N), and ran it against the latitudes in my csv file, tallying the populations of divisions greater than and less than my latitude.  

def FindNorthAndSouthPopulations(lat):
    north = 0
    south = 0
    with open('CanadaPopulations.csv','rb') as csvfile:
        divisions = csv.reader(csvfile, delimiter=',')
        for division in divisions:
            print division[0], division[2], division[4]
            if float(division[4]) < lat:
                south += int(division[2])
            else:
                north += int(division[2])
            print north, south
    print "North = ", north
    print "South = ", south
    return north, south

def FindPercentageOfCanadaPopulation(latitude):
    populations = FindNorthAndSouthPopulations(latitude)
    totalPopulation = populations[0] + populations[1]
    percentNorth = float(populations[0]) / totalPopulation * 100
    percentSouth = float(populations[1]) / totalPopulation * 100
    print "Total population = ", totalPopulation
    print "Population living north of latitude {0:.5f} is {1}.".format(latitude, populations[0])
    print "Population living south of latitude {0:.5f} is {1}.".format(latitude, populations[1])
    print "{0:.2f}% of Canada's population lives north of latitude {1:.5f}!".format(percentNorth, latitude)
    print "{0:.2f}% of Canada's population lives south of latitude {1:.5f}!".format(percentSouth, latitude)

My answer?  64.18% of Canada's population (21,486,482 people) live south of Seattle, according to the 2011 census.  Take that, my southerner friends (most of you Canadians).

To borrow a phrase from Game of Thrones (from the show - I don't think the quote is in the book), you're from south of the 47.6th parallel, that makes you a southerner to me.

How Others Compare

Here are a few other cities' results when I ran them through my program.  The closer the city is latitudinally to large Canadian population centers, the less accurate the calculation will be (I'm looking at the Vancouver and Detroit estimates, in particular).


City Latitude % of Canadians Living Farther South
Juneau, AL 58.3000°N 99.66%
(just north of) Vancouver, BC 49.24945°N 76.01%
Blaine, WA (at the Canada border) 48.9881°N 68.71%
Everett, WA 47.9633°N 64.82%
Seattle, WA 47.6097°N 64.18%
Madawaska, ME (Northern Maine) 47.207186°N 62.89%
Fargo, ND 46.8772°N 62.17%
Portland, OR 45.5200°N 47.81%
Minneapolis, MN 44.9778°N 33.78%
Augusta, ME 44.3235°N 28.63%
Portland, ME 43.6667°N 14.42%
Detroit, MI 42.3314°N 1.16%
Chicago, IL 41.8369°N 0.00%

Sorry, 171 residents of Pelee Island.  I know you're south of Chicago, but your census district representative location isn't!

Tuesday, October 14, 2014

Project Euler 243 - And Python!

Introduction

Project Euler is one of my favorite ways to occasionally exercise both my puzzle-solving, math, and programming interests.  The problems tend to not be sophisticated from a software engineering angle, but the more advanced problems require both mathematical and computer science knowledge.  It seems like the more difficult problems tend to be more mathematical.

I chose problem 243 because it was in the 200s and many people had solved it.  243 is one of the few problems necessary to get the "Trinary Triumph" badge, along with 1, 3, 9, 27, and 81 (powers of 3).

I'll say this about 243 - I did a lot of work before finiding the quick mathematical solution.  I could give a quick write up on the actual solution, but I put in so much work on the solutions that didn't quite work that it would seem like a waste to me to not show the process.  The other nice thing?  Giving out the actual answer rubs me the wrong way, allowing people who haven't done the work to take credit for doing so.  There are other places where you can find the answers, but not here.  If you want to cheat, you can easily find the answer elsewhere, and more easily than you'll find here.

The other part of this is my journey into Python. I've only been coding in Python for a few months.  I took up Python basically because I hadn't done it before, and I wanted to work in a language that isn't associated with the Microsoft sphere of influence.  I worked for Microsoft, so I don't necessarily need more of that.

Part 1 - Brute Force


https://projecteuler.net/problem=243

The first step, usually, is to look at the problem from a brute force angle.  For Project Euler, I prefer problems that cannot be solved by the brute force approach within anywhere near a reasonable amount of time.  Since the problems are supposed to be solvable within a minute, I consider the problem too easy if brute force can solve it even within an hour.  Fortunately, Project Euler rarely makes the problems that easy.

Anyway, brute force is pretty much always worth trying.  It gives a way to check the more sophisticated algorithm, and often lets me wrap my head around the problem, so my brain can think of other ways to solve it.

We are looking for irreduceable fractions.  On an individual fraction basis, that seems pretty easy to do manually:  Just find out if the numerator and denominator have any common factors.  If they do, the fraction can be reduced.

Python has a greast common denominator function, gcd, which makes a brute force solution easy.

The basic brute force solution is:
Starting from d=2.
Find the resilience of d.
If it is less than 15499/94744, we have the solution.

What is the resilience R(d) of a number?

Take all the numbers from 1 to d-1.
Determine which of those numbers has a gcd with d == 1, the count of those over d-1 is the resilience.

def R(d):
    resilient_numerator = 0
    for i in range(1,d):
        if gcd(i,d) == 1:
            resilient_numerator += 1
    return Fraction(resilient_numerator,d-1)

That bit of code correctly returns the value expected for d=12 given in the example, 4/11.  So we'll consider that good for now.

Running R(d) through a range of denominators, I didn't get one less than 15499/94744 within a minute.  Or two.  Or three.

That was the first attempt, and it was no surprise it failed.  I didn't even know how close it was.

Part 2 - How close


Next I wanted to see how close I was actually getting.

def brute_lowest_resilience(maxint):
    best = Fraction(1,1)
    
    for i in range(2,maxint):
        current = R(i)
        if current < best:
            print "New lowest!", i, current, float(current)
            best = current
    return best

The goal is 15499/94744, which is around 0.163588.

So I ran my brute_lowest_resilience to 100000.

Even after a few minutes, the best result was far above the goal, creeping more slowly towards it as time went on.

Here's the output after a few minutes:

New lowest! 4 2/3 0.666666666667
New lowest! 6 2/5 0.4
New lowest! 12 4/11 0.363636363636
New lowest! 18 6/17 0.352941176471
New lowest! 24 8/23 0.347826086957
New lowest! 30 8/29 0.275862068966
New lowest! 60 16/59 0.271186440678
New lowest! 90 24/89 0.269662921348
New lowest! 120 32/119 0.268907563025
New lowest! 150 40/149 0.268456375839
New lowest! 180 48/179 0.268156424581
New lowest! 210 48/209 0.22966507177
New lowest! 420 96/419 0.229116945107
New lowest! 630 144/629 0.22893481717
New lowest! 840 192/839 0.22884386174
New lowest! 1050 240/1049 0.228789323165
New lowest! 1260 288/1259 0.228752978554
New lowest! 1470 336/1469 0.228727025187
New lowest! 1680 384/1679 0.228707564026
New lowest! 1890 432/1889 0.228692429857
New lowest! 2100 480/2099 0.228680323964
New lowest! 2310 480/2309 0.207882200087
New lowest! 4620 960/4619 0.207837194198
New lowest! 6930 1440/6929 0.207822196565
New lowest! 9240 1920/9239 0.20781469856
New lowest! 11550 2400/11549 0.207810200017
New lowest! 13860 2880/13859 0.207807201097
New lowest! 16170 3360/16169 0.207805059064
New lowest! 18480 3840/18479 0.207803452568
New lowest! 20790 4320/20789 0.207802203088
New lowest! 23100 4800/23099 0.207801203515
New lowest! 25410 5280/25409 0.20780038569
New lowest! 27720 5760/27719 0.207799704174
New lowest! 30030 5760/30029 0.19181457924

One of the first things I noticed was a sort of pattern.  6, 12, 18, 24, 30 - multiples of 6.  Then multiples of 30 - 30, 60, 90, 120, 150, 180, 210.  Then multiples of 210.  And so on.

If you take a look at where the pattern made its biggest changes: 6, 30, 210, 2310...
You can see:
6 * 5 = 30
30 * 7 = 210
210 * 11 = 2310

The pattern changes when it is a multiple of the next prime.

So, with the information I have now, I can try to find a solution.  At least, I can try narrowing down where I'll find the answer.

2*3*5*7*11*13*17*19 = 9699690

R(9699690) = 1658880/9699689
And that's around 17.1024
And that took around 16 seconds.

So, as it stands, this path will not get me to an answer in under a minute.  It does get me closer to the answer, and it does give me a better idea of just how large the answer is.

9699690 * 23 = 223092870

R(223092870) = ?
I get a MemoryError when I try to run that.

So I have two problems here.
First, the algorithm is too slow.
Second, it can't handle large enough numbers.

Next time:  MemoryErrors, for loops, the Sieve of Eratosthenes!

Monday, October 13, 2014

Archive: Google AI Challenge! (C#)

Note:  This used to be on my old blog.  That went down, but I still had the file, so here it is!

The (February 2010) Google AI Challenge was a programming contest held from 2/4/2010 - 2/26/2010.
I finished 28th out of 708 entrants. I wrote it in C#, because my most likely next job will have me working at Microsoft. Plus I wanted to try to win, so I didn't want to experiment with a language I've never studied (Python) or a language I've never been particularly good with (Lisp). I toyed with doing it in C++, but I'm more comfortable with C# right now.
My code is available (zipped up) here:

The Game


The challenge was to create an AI that would play the Tron Lightcycles game, also known as Snakes or Worms or Snafu.
In it, the two players begin in a rectangular board (sometimes with obstacles within the board). Simultaneously, each player moves one square in an available direction (N,E,S, or W), and the position the player vacates is replaced with a wall. The first player to hit a wall loses, but if both players hit a wall (or each other) at the same time, the game is a tie.

The Algorithm


My algorithm seemed to me to be pretty much straightforward and basic, right out of my college AI class. In fact, I had some known issues with my algorithm. And there were places where I figured it could be rewritten to be more efficient, whether it was just a code cleanup or a wasteful algorithm. Still, 680 participants ended up finishing behind me, so I guess I did enough.

Evaluation


My algorithm began with the evaluation code. Basically, this took the "current" state of the board, and gave an integer score based on how well my AI was doing.
In what I later learned was called using a Voronoi heuristic, I went through each square of the board, giving my AI 100 points if it was closer the square than the opponent, and subtracting 100 points if the opponent was closer. Positions that were walls went unscored.
This was not the best evaluation function. I knew early on there was an issue on certain boards where a bottleneck could easily separate some available positions with others.

#########
#   #   #
# 1##   #
#       #
#   #   #
#   #   #
#   #   #
#   #   #
#   #   #
#   #2  #
#   #   #
#   #   #
#########


In this example, player 2 isn't really in a particularly bad position. But my AI evaluates this board as being a great position for player 1. Player 1 is closer to all of the spots on the left side of the board, and some of the spots on the right side of the board (the upper ones), so player 1 is judged to have an advantage. Unfortunately for my AI, player 1's position isn't that strong - it's much closer to a tie. Not knowing that this board is evenly scored for players 1 and 2 was my AI's biggest issue.
Here's an example of a game that was lost because of this issue: http://csclub.uwaterloo.ca/contest/visualizer.php?game_id=4073327
I never got around to solving this problem before the contest ended, but apparently the #1 finisher, Andy Sloane, came up with an idea at 3am on the final day of the contest.

Lookahead


For my lookahead, I used a recursive algorithm, a variant of Negamax with alpha-beta pruning.
Basically this meant that for each possible move my AI could make, it would call my NegamaxAB function (sending the function a board that included the move made) , and whichever move returned the best score would be the move to make.
The NegamaxAB function would first check to see if the board it was given was a cutoff node (time ran out, or someone crashed, or the maximum depth to search had been reached). If so, it would evaluate that board and return it.
If it wasn't a cutoff node, the Negamax would generate all possible moves for the player to move, and then send those to the NegamaxAB function, looking for the best move (highest score). Because I was also performing alpha-beta pruning on the algorithm, I kept track of values for alpha (the minimum score for the maximizing player) and beta (the maximum score for the minimizing player). In each call to NegaMax, the positions of alpha and beta are switched, and the evaluation score is negated, allowing for the single Negamax function to put itself in the shoes of each player when it is being run for that player.
Because Negamax was designed for turn-based two player games, I made the design decision to evaluate board like a turn-based two player game. My player would go first, and the opponent would go second. So my algorithm's calculated my opponent's turn using my move as data. Basically I assumed my opponent knew exactly what I would do. This made it a little easier to port the Negamax algorithm to my code, and also made my AI's decision a bit more difficult than it actually was.

The Time Limit


Under the contest rules, each move had a one second time limit. I handled this using an iterative deepening search.
I started by performing a search 2 moves ahead (one for me, one for my opponent). That search ended with a "best move".
If there was still time, I performed a brand new search 4 moves ahead. If I didn't run out of time in the middle of this new search, the move returned would become the new "best move"
I continued doing this, adding 2 levels of lookahead at each iteration, until I ran out of time. I added two levels of lookahead to simulate both players making a move.
The best move returned from the most recent completed search was the move I made.

Filling a Separated Board


Often, my AI would find itself in a situation where it was separated from the opponent, where the two bots were in separated zones.

#########
#   #   #
#   #   #
#  1#   #
#   #   #
#   #   #
#   #2  #
#   #   #
#   #   #
#########

In this case, my AI would switch to fill mode.
My intent was to make this a breadth first search, where each node was evaluated by the number of positions accessible to it.
I ended up wastefully reusing my iterative deepening code instead, and didn't get to fixing it before I ran out of time.
I did make one improvement. Sometimes my AI would find itself staying away from small sections it couldn't fill. If there were more than one of these sections it would put all of them of until it only had the option of even partially filling one of them.
Illustrated in this game: http://csclub.uwaterloo.ca/contest/visualizer.php?game_id=4041576
In this game, my AI finds itself staying away from the two dead ends, until it finds itself forced to go to one of them. By this time, though, trying to stay away from the one-square dead ends made my AI create two multiple-square dead ends. It chose one, leaving many squares open on the board. If the fill had been better, my AI would have won that game.
So to try to fix it, I modified the evaluation to not include dead ends beyond the first. That way, my AI would allow itself to block off a dead end rather than postpone that section of the board until later, which could be deadly if there were multiple dead ends on the board. It was a definite improvement.

The Competition


I don't know how talented the field was. I assume it was pretty good. Granted, not all of the 708 entrants were particularly good or tried very hard, but I'm still assuming they were pretty good. First off, 302 of the participants submitted their code more than twenty times. That means that 302 participants actually uploaded code to the server more than twenty times. Usually, a code upload only comes after making modifications to the AI, so 302 participants seemed really motiviated to participate. Secondly, and I'm just guessing on this, but it takes a certain type of motivated individual to participate in a contest like this. I would guess that the majority of participants were above average, especially the 302. Why join otherwise.
I'm sure there were plenty of participants who didn't have the time to invest in trying to win, but the 302 made a good effort. I estimate I submitted about 30 times. I had 14 different distinct versions of my AI, and 24 unique submissions. Sometimes I resubmitted old code to get a better view of how my AI was doing.

Summary


This was one fun challenge. I enjoyed the time I put into coding up my AI, and it was icing on the cake that I managed to perform pretty well (28th out of 708, in case you don't remember). It was exciting to see how my AI fared on the leaderboard, and how my improvements were reflected on the preliminary rankings. I coded late into the night sometimes, and I sometimes woke up early to get in some coding before work. I did it because I loved it. Next time, maybe I'll win.

Wednesday, May 21, 2014

Launch!

Welcome to my new blog, rising from the scattered ashes of my old blog!

My old blog has been down for a few years.  I maintained it myself, and something went wrong during an upgrade, and the blog went down.  And it's been down ever since.  As much as I like the idea of taking care of these things myself, I decided I didn't want that hassle.

And so, I'm letting the fine people at blogger handle it for me.  I realize this means that I don't technically own my data, but I don't expect to need to own my data anyway.  This is more of a semiprofessional blog than a personal one.  If you want to see all of my public personal bloggings (not that there's much of it), you can go find me on Facebook.  Here, you'll see more of a public professional persona.

In coordination with my new professional public persona here, I've also created a new public github repository here:  https://github.com/EdwardTonai/public

Pretty much everything that goes up there will be commented on here.

Let's go!