Archive for January, 2010

Computer Science and Empiricism

In a talk given at UW today, Alfred Spector, Google’s VP of Research and Special Initiatives, made a point that I hadn’t thought about before. He made the statement that computer science is much more empirical today than it was when he was a graduate student 30 years ago.

I’m currently reading Logicomix, a graphic novel with a twofold role as a biography of Bertrand Russell and a concise history of the quest for mathematical certainty that took place in the early 20th century. Reading it alongside a courseload of discrete math and theory of computation classes has me knee-deep in the mathematical foundations of computer science. While these foundations are valid and necessary for a historical appreciation of the field, they don’t always lend themselves well to contemporary issues faced in industry and academia.

Spector’s talk reinforced the image of Google as grand archiver and distributor of the consolidated sum of human knowledge, a role not always well-served by a traditional approach. In recent years, we’ve seen a rise in parallel and probabilistic approaches to emerging problems: MapReduce/Hadoop, machine learning, Bayesian this, Markov that. With the astronomical amounts of data that companies like Google have to deal with, the door is opened for statistical methods.

Computer science deals with more measurable quantities than it used to. For example, as networks and systems grow larger, small margins of error or delay become more readily measurable. When the entire world is your testbed, you have to measure and test every aspect of the systems you build. In this sense, CS is increasingly becoming a more empirical science.

My only hope is that this was more often emphasized in classes, instead of by visiting guest lecturers.

Geolocation API for distributed computing research

Last quarter, I quit my web development job at the UW Clinical Trial Center in order to pursue research within UW’s CSE department. As a startup project for a distributed computing research project called Seattle, I put together a simple geolocation library that uses a Python library called pygeoip to look up location data for hostnames and IP addresses.

The first step was to set up an XML-RPC server to serve remote calls to the pygeoip API. This was fairly easy to do using Python’s SimpleXMLRPCServer class:

from SimpleXMLRPCServer import SimpleXMLRPCServer
import pygeoip
...
# Create server
server = SimpleXMLRPCServer((ip, port), allow_none=True)

The location-lookup methods within the pygeoip library must be registered for use via the XML-RPC server. We first initialize a GeoIP object, passing it the filename of a valid binary GeoIP database. Then the GeoIP object is passed to the XML-RPC server’s register_instance method to expose its methods for remote execution:

# Initialize and register GeoIP object
gi = pygeoip.GeoIP(geoipdb_filename)
server.register_instance(gi)

# Run the server's main loop
server.serve_forever()

The lookup methods of pygeoip can now be called remotely. To demonstrate my project, I wrote a script that fetches the IPs of all nodes in the distributed computing network that you’ve allocated, looks up their lat/lang coordinates, and plots them on a Google map. For funsies I used the geolocation API built into Firefox 3.5+to include a pointer to the user’s current location.

Screenshot of project demo

Hopefully my little library will get some use. Moving into the new year, I’m hoping to increase my involvement in the Seattle project and become more familiar with networking and API design.