The Surveying Problem Part 5: Graph-Tool

Background

As I mentioned in my previous posts, I wanted to use a graph module in Python to implement a depth-first search (DFS) to solve the surveying problem. My first go was with NetworkX, which is a pure Python implementation that offers a lot of Graph theory-based functionality. However, because it’s pure Python and has the added overhead of importing more modules and running more functions, it’s a lot slower than the solution I implemented.

An alternative is the graph-tool package for Python, which offloads a lot of the heavy lifting to C++. According to the graph-tool website, this results in a much faster implementation compared to NetworkX1https://graph-tool.skewed.de/performance. However, it certainly wasn’t a simple swap-out from NetworkX to graph-tool; I couldn’t even get the library to compile! The folks in the AUR2https://aur.archlinux.org/packages/python-graph-tool/ do warn against parallelizing the build process because of memory consumption, but I was running out of RAM even on a single thread. Luckily, the maintainer of graph-tool maintains a docker image for graph-tool3https://hub.docker.com/r/tiagopeixoto/graph-tool, which gave me a great opportunity to learn docker!

Learning Docker

I have known about docker for a few years but assumed it was just some magic version of virtualization, and never had a reason to dig into it in any more detail than that. I initially tried to just grab the graph-tool image and throw it into docker, but I ran into all sorts of problems around adding other Python packages to the image and getting PyCharm to see Python within the image. This all ultimately stemmed from the fact that I fundamentally didn’t understand docker, specifically the differences between images and containers.

In years gone by I would have either just read the manual or the free tutorials, or just tried to hack through and figure it out myself. However, I’ve reached the point in my life where I realize that if I can spend a few dollars on something that will save me a few hours, then that’s money very well spent! I found a course on Udemy4https://www.udemy.com/course/docker-mastery that I managed to get for $20, but I would have spent $100 on it, and it’s been excellent at explaining the concepts in an easy to understand way. I can’t recommend it enough.

I’m not going to go into the details of what I did with docker, but briefly, I created my own image built on the docker image I linked to above with the additional Python packages I needed for my code. Luckily the original image was built on Arch Linux (which I am very familiar with), so it was pretty straightforward to add a RUN line using pacman to add these packages. Using the Python interpreter within the docker container was also straightforward thanks to PyCharm Professional, which I was able to snag for 6 months as part of a Python Humble Bundle earlier this year.

The Code

Bit more pre-amble than previously for this post, so much so that the code almost seems like an afterthought! Anyway, here’s the code:

""" A graph-tools implementation of the surveying problem
find_reservoirs: Determine the number and location of contiguous reservoirs in a grid
"""

import graph_tool as gt
import graph_tool.topology as topology

from tools import get_neighbors

METHOD_NAME = "Graph-Tool Method"


def find_reservoirs(locations):
    """ Uses a graph approach to find how many wells are needed, making the assumption that
    only one well is needed per contiguous field

    locations: Set containing all locations with oil
    """
    locations = {location: idx for idx, location in enumerate(locations)}

    locations_graph = gt.Graph()
    locations_graph.set_directed(False)

    locations_graph.add_vertex(len(locations))
    locations_prop = locations_graph.new_vertex_property("object", locations)

    edge_list = []
    for location in locations:
        neighbor_coords = get_neighbors(location, locations)
        edge_list.extend([(locations[location], locations[neighbor])
                          for neighbor in neighbor_coords])

    locations_graph.add_edge_list(edge_list)

    components, _ = topology.label_components(locations_graph, directed=False)
    wells = dict()
    for vertex, label in enumerate(components.a):
        if label not in wells:
            wells[label] = []
        wells[label].append(locations_prop[vertex])

    return wells.values()

It’s logically identical to the NetworkX approach, first we create the verticies, then we create the edges between the connected vertices, and then we find our connected sub-graphs in our grid.

The thing that struck me most about graph-tool is that it’s extremely unpythonic. I suppose this shouldn’t be surprising considering it’s really just a Python wrapper on top of C++ code, but that wrapper is very very thin. There aren’t really any helper functions, so you have to do a lot of stuff yourself in terms of getting data into the correct format for graph-tool and post-processing it’s results. This means graph-tool code is about 50% longer than the equivalent NetworkX code. I suppose one can argue that if performance is so important to you that you have gone to the effort to implement graph-tool (and all the compilation pain it brings), you’d want complete control over these wrapper functions. Still, it would be nice if that wrapper was just a little bit thicker.

Results

The next step is to review performance between the NetworkX and graph-tool implementations. The results are shown in the table below. The biggest takeaway from this is that graph-tool starts out slower than NetworkX for very small grids, and only becomes more performant beyond 100×100. Even then though, it remains slower than the Recursive implementation.

Grid SizeTime (Recursive)Time (NetworkX)Time (graph-tool)
10×105.205E-05 s1.686E-04 s (x3.24 slower)3.620E-04 s (x6.95 slower)
100×1002.856E-03 s1.007E-02 s (x3.53 slower)5.851E-03 s (x2.05 slower)
1000×10004.628E-01 s2.260E+00 s (x4.88 slower)8.253E-01 s (x1.78 slower)
5000×50001.827E+01 s9.253E+01 s (x5.06 slower)2.722E+01 s (x1.49 slower)

I have a pretty good idea about why this is, and that’s that the actual work being done by the graph library is small compared to the work being done to build the graph. A good way to check is to split the times from above by the time spent building the graph (which is approximately the same between both methods), and the time spent solving the graph.

Grid SizeGraph Build Time (NetworkX)Clustering Time (NetworkX)Graph Build Time (graph-tool)Clustering Time (graph-tool)
10×101.259E-04 s (73%)4.376E-05 s (27%)2.628E-04 s (67%)1.266E-04 s (33%)
100×1001.103E-02 s (81%)2.564E-03 s (19%)7.290E-03 s (82%)1.574E-03 s (18%)
1000×10001.785E+00 s (87%)2.626E-01 s (13%)6.128E-01 s (78%)1.688E-01 s (22%)
5000×50007.293E+01 s (84%)1.397E+01 s (16%)2.105E+01 s (83%)4.224E+00 s (17%)

The broad thing to take away from this table is that my hypothesis from above is correct: the clustering time is a small fraction of the total time taken to solve the problem, and so any time saved here is not going to have a significant impact on overall performance. If I was really interested in optimizing performance I’d be better off re-implementing some of the graph creation logic in something like NumPy, or ultimately just re-implementing the code in a different language altogether.

There are some additional nuances that might be interesting to look into in more detail. One is that the ratio between build time to clustering time seems to trend pretty clearly with the build time becoming the larger fraction at larger grids, but the 1000×1000 size specifically seems to buck this trend. It might be to do with the specific amounts of memory I am using, and that I reach a point where memory allocation starts to slow down and my machine starts swapping. I’m not sure though, add that to the list of things to look into.

Next Steps

I’d really like to dig into matplotlib to try plotting some of these results to see what they look like. I might update this post in a few weeks with a plot showing the some of the tables above. My go-to environment for crude plots is usually Excel, but I’m stuck in Linux working on this stuff, so I should just bite the bullet and learn matplotlib.

In terms of algorithms and implementation, one final thing I can think of to try and drive performance further is to re-implement the initial recursive solution as a stack-based DFS algorithm. As I have mentioned before, Python isn’t suited for recursive implementations because there’s such a huge overhead on function calls. So there’s a good chance I could see some performance improvement here. However, regardless of the implementation, I’m still going to be executing the find_neighbors code the same number of times, which I already know is where a lot of the thinking goes on, so there’s likely a pretty high limit to how much performance can be improved.