The Surveying Problem Part 5: Graph-Tool

Background

As I mentioned in my previous posts, I wanted to use a graph module in Python to implement a depth-first search (DFS) to solve the surveying problem. My first go was with NetworkX, which is a pure Python implementation that offers a lot of Graph theory-based functionality. However, because it’s pure Python and has the added overhead of importing more modules and running more functions, it’s a lot slower than the solution I implemented.

An alternative is the graph-tool package for Python, which offloads a lot of the heavy lifting to C++. According to the graph-tool website, this results in a much faster implementation compared to NetworkX1https://graph-tool.skewed.de/performance. However, it certainly wasn’t a simple swap-out from NetworkX to graph-tool; I couldn’t even get the library to compile! The folks in the AUR2https://aur.archlinux.org/packages/python-graph-tool/ do warn against parallelizing the build process because of memory consumption, but I was running out of RAM even on a single thread. Luckily, the maintainer of graph-tool maintains a docker image for graph-tool3https://hub.docker.com/r/tiagopeixoto/graph-tool, which gave me a great opportunity to learn docker!

Learning Docker

I have known about docker for a few years but assumed it was just some magic version of virtualization, and never had a reason to dig into it in any more detail than that. I initially tried to just grab the graph-tool image and throw it into docker, but I ran into all sorts of problems around adding other Python packages to the image and getting PyCharm to see Python within the image. This all ultimately stemmed from the fact that I fundamentally didn’t understand docker, specifically the differences between images and containers.

In years gone by I would have either just read the manual or the free tutorials, or just tried to hack through and figure it out myself. However, I’ve reached the point in my life where I realize that if I can spend a few dollars on something that will save me a few hours, then that’s money very well spent! I found a course on Udemy4https://www.udemy.com/course/docker-mastery that I managed to get for $20, but I would have spent $100 on it, and it’s been excellent at explaining the concepts in an easy to understand way. I can’t recommend it enough.

I’m not going to go into the details of what I did with docker, but briefly, I created my own image built on the docker image I linked to above with the additional Python packages I needed for my code. Luckily the original image was built on Arch Linux (which I am very familiar with), so it was pretty straightforward to add a RUN line using pacman to add these packages. Using the Python interpreter within the docker container was also straightforward thanks to PyCharm Professional, which I was able to snag for 6 months as part of a Python Humble Bundle earlier this year.

The Code

Bit more pre-amble than previously for this post, so much so that the code almost seems like an afterthought! Anyway, here’s the code:

""" A graph-tools implementation of the surveying problem
find_reservoirs: Determine the number and location of contiguous reservoirs in a grid
"""

import graph_tool as gt
import graph_tool.topology as topology

from tools import get_neighbors

METHOD_NAME = "Graph-Tool Method"


def find_reservoirs(locations):
    """ Uses a graph approach to find how many wells are needed, making the assumption that
    only one well is needed per contiguous field

    locations: Set containing all locations with oil
    """
    locations = {location: idx for idx, location in enumerate(locations)}

    locations_graph = gt.Graph()
    locations_graph.set_directed(False)

    locations_graph.add_vertex(len(locations))
    locations_prop = locations_graph.new_vertex_property("object", locations)

    edge_list = []
    for location in locations:
        neighbor_coords = get_neighbors(location, locations)
        edge_list.extend([(locations[location], locations[neighbor])
                          for neighbor in neighbor_coords])

    locations_graph.add_edge_list(edge_list)

    components, _ = topology.label_components(locations_graph, directed=False)
    wells = dict()
    for vertex, label in enumerate(components.a):
        if label not in wells:
            wells[label] = []
        wells[label].append(locations_prop[vertex])

    return wells.values()

It’s logically identical to the NetworkX approach, first we create the verticies, then we create the edges between the connected vertices, and then we find our connected sub-graphs in our grid.

The thing that struck me most about graph-tool is that it’s extremely unpythonic. I suppose this shouldn’t be surprising considering it’s really just a Python wrapper on top of C++ code, but that wrapper is very very thin. There aren’t really any helper functions, so you have to do a lot of stuff yourself in terms of getting data into the correct format for graph-tool and post-processing it’s results. This means graph-tool code is about 50% longer than the equivalent NetworkX code. I suppose one can argue that if performance is so important to you that you have gone to the effort to implement graph-tool (and all the compilation pain it brings), you’d want complete control over these wrapper functions. Still, it would be nice if that wrapper was just a little bit thicker.

Results

The next step is to review performance between the NetworkX and graph-tool implementations. The results are shown in the table below. The biggest takeaway from this is that graph-tool starts out slower than NetworkX for very small grids, and only becomes more performant beyond 100×100. Even then though, it remains slower than the Recursive implementation.

Grid SizeTime (Recursive)Time (NetworkX)Time (graph-tool)
10×105.205E-05 s1.686E-04 s (x3.24 slower)3.620E-04 s (x6.95 slower)
100×1002.856E-03 s1.007E-02 s (x3.53 slower)5.851E-03 s (x2.05 slower)
1000×10004.628E-01 s2.260E+00 s (x4.88 slower)8.253E-01 s (x1.78 slower)
5000×50001.827E+01 s9.253E+01 s (x5.06 slower)2.722E+01 s (x1.49 slower)

I have a pretty good idea about why this is, and that’s that the actual work being done by the graph library is small compared to the work being done to build the graph. A good way to check is to split the times from above by the time spent building the graph (which is approximately the same between both methods), and the time spent solving the graph.

Grid SizeGraph Build Time (NetworkX)Clustering Time (NetworkX)Graph Build Time (graph-tool)Clustering Time (graph-tool)
10×101.259E-04 s (73%)4.376E-05 s (27%)2.628E-04 s (67%)1.266E-04 s (33%)
100×1001.103E-02 s (81%)2.564E-03 s (19%)7.290E-03 s (82%)1.574E-03 s (18%)
1000×10001.785E+00 s (87%)2.626E-01 s (13%)6.128E-01 s (78%)1.688E-01 s (22%)
5000×50007.293E+01 s (84%)1.397E+01 s (16%)2.105E+01 s (83%)4.224E+00 s (17%)

The broad thing to take away from this table is that my hypothesis from above is correct: the clustering time is a small fraction of the total time taken to solve the problem, and so any time saved here is not going to have a significant impact on overall performance. If I was really interested in optimizing performance I’d be better off re-implementing some of the graph creation logic in something like NumPy, or ultimately just re-implementing the code in a different language altogether.

There are some additional nuances that might be interesting to look into in more detail. One is that the ratio between build time to clustering time seems to trend pretty clearly with the build time becoming the larger fraction at larger grids, but the 1000×1000 size specifically seems to buck this trend. It might be to do with the specific amounts of memory I am using, and that I reach a point where memory allocation starts to slow down and my machine starts swapping. I’m not sure though, add that to the list of things to look into.

Next Steps

I’d really like to dig into matplotlib to try plotting some of these results to see what they look like. I might update this post in a few weeks with a plot showing the some of the tables above. My go-to environment for crude plots is usually Excel, but I’m stuck in Linux working on this stuff, so I should just bite the bullet and learn matplotlib.

In terms of algorithms and implementation, one final thing I can think of to try and drive performance further is to re-implement the initial recursive solution as a stack-based DFS algorithm. As I have mentioned before, Python isn’t suited for recursive implementations because there’s such a huge overhead on function calls. So there’s a good chance I could see some performance improvement here. However, regardless of the implementation, I’m still going to be executing the find_neighbors code the same number of times, which I already know is where a lot of the thinking goes on, so there’s likely a pretty high limit to how much performance can be improved.

The Surveying Problem: XSLT Bonus!

Update

So I started looking at graph-tools, but unfortunately, I fell at the first hurdle; I couldn’t even get it to compile. Kept running out of memory at the same point, I suppose I should have gone for more than 16 Gb when I upgraded my computer earlier this year.

The solution is to use a docker image that is helpfully provided by the graph-tools maintainer, but in order to do that, I need to learn docker. Making some great process, but it might be another week or so before I can get my Python code running against containerized Python. So while I do that, here’s a post I wrote a few weeks ago but was saving for a rainy day…

Background

Shortly after the interview where I came up with the DFS solution to the surveying problem I had an amusing thought. The panel said I could use any language I wanted, so I thought to myself, what would the panel have thought if I decided to try and solve the problem in XSLT1https://en.wikipedia.org/wiki/XSLT?

Anyone who knows me knows I have a strange fondness for XSLT. That strange fondness goes back to my first job after my Ph.D., where XSLT was used extensively to transform XML documents. The majority of my colleagues hated that language since it has a trifecta of attributes against it:

  1. It’s functional, as opposed to imperative
  2. It’s very rarely used
  3. It is itself written in XML

I personally always enjoyed writing XSLT; I saw the need to learn a functional programming language as an interesting challenge, and once I got used to thinking in a functional way, the code came naturally. The fact that it was written in XML wasn’t too much of an issue thanks to a decent IDE, the only real issue was that with a (comparatively) small user community, it was often up to me to figure out the best way to do things.

Another important thing to know about XSLT, besides the fact that it’s a functional language, is that it’s a transformation language, i.e. it works by transforming an input document into an output document. An XSLT file can’t do anything by itself, it must be applied to something.

The Solution

Anyway, enough background. I’ll now go through the code I wrote to solve this problem in XSLT. As I mentioned above, an XSLT file must be applied to an input document, so here’s my (abridged) input document:

<grid>
  <el>
    <x>0</x>
    <y>16</y>
  </el>
  <el>
    <x>0</x>
    <y>17</y>
  </el>
  <el>
    <x>0</x>
    <y>19</y>
  </el>
  <el>
    <x>1</x>
    <y>3</y>
  </el>
  <el>
    <x>1</x>
    <y>4</y>
  </el>
</grid>

It’s a pretty simple document format and required some pretty trivial modifications to the build_graph() code from a previous article to spit out XML instead of a Python set object. Each <el> element represents a location in the set, and is defined by it’s <x> and <y> child elements.

Next is the XSLT itself:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:exsl="http://exslt.org/common"
    exclude-result-prefixes="xs exsl"
    version="1.0">
    
   <xsl:include href="firstPass.xslt" />
   <xsl:include href="secondPass.xslt" />
    
    <xsl:output method="xml" indent="yes"/>
    
    <xsl:template match="grid">
        <!-- Actually implement the logic -->
        <xsl:variable name="firstPass">
            <xsl:apply-templates select="el" mode="first" />
        </xsl:variable>
        
        <!-- Strip duplicate locations and empty reservoirs -->
        <xsl:variable name="reservoirs">
            <xsl:apply-templates select="exsl:node-set($firstPass)/reservoir" mode="secondPass"/>
        </xsl:variable>
                
        <!-- Generate summary results -->
        <xsl:element name="results">
            <xsl:element name="numberOfReservoirs">
                <xsl:value-of select="count(exsl:node-set($reservoirs)/reservoir)" />
            </xsl:element>
            <xsl:element name="reservoirs">
                <xsl:apply-templates select="exsl:node-set($reservoirs)/reservoir" mode="results" />
            </xsl:element>
        </xsl:element>
    </xsl:template>
    
    <!-- Results wrapper template -->
    <xsl:template match="reservoir" mode="results">
        <xsl:copy>
            <xsl:attribute name="size">
                <xsl:value-of select="count(location)" />
            </xsl:attribute>
            <xsl:copy-of select="location"/>
        </xsl:copy>
    </xsl:template>
    
</xsl:stylesheet>
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:exsl="http://exslt.org/common"
    exclude-result-prefixes="xs exsl"
    version="1.0">
    
    <!-- First time through we need to create the reservoir element -->
    <xsl:template match="el" mode="first">        
        <xsl:element name="reservoir">

                <xsl:element name="location">
                    <xsl:copy-of select="*" />
                </xsl:element>
                
                <xsl:variable name="this_x" select="number(x/text())"/>
                <xsl:variable name="this_y" select="number(y/text())"/>
                
                <!-- Register that we have visited this element before -->
                <xsl:variable name="visited">
                    <xsl:copy-of select="." />
                </xsl:variable>
                
                <!-- We rely on the grid being sorted in increasing x and y. -->
                <!-- The first time through, we will only find a neighbor with increasing or same x -->
                <xsl:apply-templates select="following::el[
                    (x = $this_x and y = $this_y + 1) or
                    (x = $this_x + 1 and y = $this_y - 1) or
                    (x = $this_x + 1 and y = $this_y) or
                    (x = $this_x + 1 and y = $this_y + 1)]" mode="recursive">
                    <xsl:with-param name="visited" select="$visited" />
                </xsl:apply-templates>
            
        </xsl:element>
    </xsl:template>
    
    <!-- Subsequent times through we don't, but we do need to check we haven't been here before -->
    <xsl:template match="el" mode="recursive">
        <xsl:param name="visited" />
        
        <xsl:variable name="currentElement" select="." />
        
        <!-- Check if we have been here before, which stops infinite recursion -->
        <xsl:if test="not(exsl:node-set($visited)/el[x = $currentElement/x][y = $currentElement/y])">
            
            <xsl:variable name="this_x" select="number(x/text())"/>
            <xsl:variable name="this_y" select="number(y/text())"/>
            
            <!-- Add the current location to the list of visited locations -->
            <xsl:variable name="newVisited">
                <xsl:copy-of select="exsl:node-set($visited)/el" />
                <xsl:copy-of select="$currentElement" />
            </xsl:variable>
            
            <xsl:element name="location">
                <xsl:copy-of select="*" />
            </xsl:element>
            
            <!-- Apply this template over all neighbor locations -->
            <!-- Here we might need to go 'up' and 'left', so we need to check negative offsets -->
            <xsl:apply-templates select="../el[
                (x = $this_x - 1 and y = $this_y - 1) or
                (x = $this_x - 1 and y = $this_y) or
                (x = $this_x - 1 and y = $this_y + 1) or
                (x = $this_x and y = $this_y - 1) or
                (x = $this_x and y = $this_y + 1) or
                (x = $this_x + 1 and y = $this_y - 1) or
                (x = $this_x + 1 and y = $this_y) or
                (x = $this_x + 1 and y = $this_y + 1)]" mode="recursive">
                <xsl:with-param name="visited" select="$newVisited" />
            </xsl:apply-templates>
        </xsl:if>
    </xsl:template>
    
</xsl:stylesheet>
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="1.0">
    
    <!-- We need a separator here, otherwise we can't tell the difference between e.g. 1,20 and 12,0 -->
    <xsl:key name="locations-by-coords" match="location" use="concat(x, ',', y)" />
    
    <!-- Only match reservoirs that have at least one location after stripping duplicates -->
    <xsl:template match="reservoir[count(location[generate-id() = generate-id(key('locations-by-coords', concat(x, ',', y))[1])]) != 0]" mode="secondPass">
        <xsl:copy>
            <!-- Iterate over the unique locations in a reservoir -->
            <xsl:apply-templates select="location[generate-id() = generate-id(key('locations-by-coords', concat(x, ',', y))[1])]" mode="secondPass" />
        </xsl:copy>
    </xsl:template>
    
    <!-- Actually output the location -->
    <xsl:template match="location" mode="secondPass">
        <xsl:copy-of select="." />
    </xsl:template>
    
</xsl:stylesheet>

You can see one of the things I hate about XSLT, which is how wordy and bloated the source is! In case you didn’t notice, it’s in three different files just to avoid hundreds and hundreds of lines of scrolling.

The code is executed in two parts. The first is extremely similar to the Python DFS solution, although because XSLT is a functional language and all variables are immutable, there’s no way of determining if a location has already been visited. There are some optimizations that can be made by ensuring the locations are visited in a well-defined order using the <xsl:sort /> function, but it doesn’t allow the elimination of visiting the same location twice. Consider the following situation:

Even if the list of locations is sorted by increasing x and then increasing y, or even increasing distance from the origin, you still wouldn’t necessarily identify (3,0) as being a part of the well unless you go all the way down to (3,2), and then back up to (3,0). At this point, you may already have started a new well at (3,0) which is a duplicate of the first well. The global ‘visited’ set in the Python implementation solves this problem, but that’s not possible in a functional implementation.

The solution here is to add a second pass through which does two things. The first is that it removes duplicate nodes within the same well, caused when there are multiple possible ways of getting from one active location to another. The second is to remove all reservoirs that are a subset of another reservoir. Through some clever sorting, we can ensure that we will always have the full reservoir before any subsets, which means we can perform this filtering by doing a global filter on unique locations. The code implements Muenchian Grouping2https://en.wikipedia.org/wiki/XSLT/Muenchian_grouping to enable this, which is a neat way of removing duplicate nodes based on arbitrary but definable rules for what constitutes a duplicate within an element.

The final step is then a template that wraps the reservoir results with a summary of how many reservoirs were found in total and adds the size of each reservoir to each reservoir element. An excerpt of the output is given below.

<?xml version="1.0" encoding="UTF-8"?>
<results>
   <numberOfReservoirs>21</numberOfReservoirs>
   <reservoirs>
      <reservoir size="3">
         <location>
            <x>0</x>
            <y>16</y>
         </location>
         <location>
            <x>0</x>
            <y>17</y>
         </location>
         <location>
            <x>1</x>
            <y>17</y>
         </location>
      </reservoir>
      <reservoir size="2">
         <location>
            <x>0</x>
            <y>19</y>
         </location>
         <location>
            <x>1</x>
            <y>20</y>
         </location>
      </reservoir>

Thoughts

First things first, I know this is a terrible solution. It’s inefficient in that it requires multiple passes to generate the right output, and it fails completely at pretty modestly-sized grids (anything above 50×50 takes too long to complete). It might be possible to perform some improvements, but it’s not something I really care to do.

I realize that XSLT was never designed to perform this kind of work, but then I thought to myself: “how you would go about any kind of efficient BFS implementation without being able to rely on a global ‘visited’ tracker to avoid hitting the same location twice?” Is it even possible to implement an O(n) BFS algorithm in a functional language?

The answer seems to be yes! An article on Stack Overflow3https://stackoverflow.com/questions/30464163/functional-breadth-first-search has a solution in Haskell that seems to do exactly what’s required. I have no experience writing in Haskell, but it seems like an extremely interesting approach. If I get time, it’s something I’ll play around with to see how it works, and whether it makes sense for a problem like this!

Until then, back to docker…

The Surveying Problem Part 4: Graph Solution

As I mentioned in my previous post, the recursive solution to the Surveying Problem is pretty performant, but I wanted to see if a graph-based solution would be faster or easier to implement.

I started with the NetworkX package1https://networkx.github.io/, which I have used before in some other projects. In my experience, it’s reasonably quick, easy to install, and is released under a BSD license, which is important when you are writing in a commercial environment. Without further ado, here’s the code:

import networkx as nx

METHOD_NAME = "NetworkX Method"

def get_neighbors(node, graph):
    """ Returns a list of neighbor location objects

     node: A networkX node. The name is a 2-tuple representing the x and y coordinates of an element
     graph: A networkX graph of locations
     """
    x_coord, y_coord = node
    x_offsets = [-1, 0, 1]
    y_offsets = [-1, 0, 1]

    neighbors = list()
    for x_offset in x_offsets:
        for y_offset in y_offsets:
            if x_offset == 0 and y_offset == 0:
                continue

            coord = (x_coord + x_offset, y_coord + y_offset)
            if coord in graph:
                neighbors.append(coord)
    return neighbors


def find_reservoirs(grid):
    """ Uses a graph approach to find how many wells are needed, making the assumption that
    only one well is needed per contiguous field

    grid: Set containing all locations with oil
    """

    locations_graph = nx.Graph()
    locations_graph.add_nodes_from(grid)

    edges_to_create = set()
    for node in locations_graph:
        neighbors = get_neighbors(node, locations_graph)
        _ = [edges_to_create.add((node, neighbor)) for neighbor in neighbors]

    locations_graph.add_edges_from(edges_to_create)

    connected_subgraphs = nx.connected_components(locations_graph)
    wells = [{vertex for vertex in subgraph} for subgraph in connected_subgraphs]

    return wells

Pros: Simplicity

First, the pros of this approach. Firstly, the code is conceptually a lot simpler than the recursive approach:

  1. Create nodes for every location that has oil
  2. Connect nodes together if they are adjacent
  3. Call nx.connected_components() on our graph
    • This generates the set of connected subgraphs
    • i.e. the set of subgraphs where every node is connected to every other node in that subgraph
  4. List comprehension to get the correct output summary

Another pro is that a lot of the algorithmic design has been done by the NetworkX folks, as opposed to by me. As far as an interview question goes that’s obviously a bad thing since the point of the exercise was to test my problem-solving abilities, not to have an encyclopedic knowledge of which libraries are available. However, in a professional environment, I think it’s always better to use a library and let someone else do the thinking for you.

Cons: Performance

As for the cons, performance is significantly worse, almost by an order of magnitude. Here’s a summary of the comparison between the recursive solution and the NetworkX-based graph solution:

Grid SizeTime (Recursive)Time (NetworkX)Time (Recursive / NetworkX)
10×104.603E-05 s1.799E-04 s25.6 %
100×1002.369E-03 s1.634E-02 s14.5 %
1000×10003.933E-01 s2.139E+00 s18.4 %
5000×50001.649E+01 s8.869E+01 s18.5 %

The obvious question to ask is, why? To answer this question, I looked at the source for the nx.connected_components() method.

Investigating the Slowdown

Behind the scenes, NetworkX is doing something very similar to what I implement previously 2https://github.com/networkx/networkx/blob/master/networkx/algorithms/components/connected.py. nx.connected_components() implements a Breadth-First Search3https://en.wikipedia.org/wiki/Breadth-first_search, whereas my recursive code implements a Depth-First Search4https://en.wikipedia.org/wiki/Depth-first_search. A BFS might make sense if we just wanted to find the existence of every well, but since we need to find out every well’s size and complete location, the decision between BFS vs DFS really doesn’t make a difference; the optimal solution is just that we should each location exactly once. The Wikipedia page on connected components5https://en.wikipedia.org/wiki/Connected_component_(graph_theory) states the following:

“It is straightforward to compute the components of a graph in linear time (in terms of the numbers of the vertices and edges of the graph) using either breadth-first search or depth-first search. In either case, a search that begins at some particular vertex v will find the entire component containing v (and no more) before returning. To find all the components of a graph, loop through its vertices, starting a new breadth-first or depth-first search whenever the loop reaches a vertex that has not already been included in a previously found component. Hopcroft & Tarjan (1973) describe essentially this algorithm, and state that at that point it was “well known”.”

https://en.wikipedia.org/wiki/Connected_component_(graph_theory)

In other words, my solution that I gave during my interview was literally the textbook solution to the problem!

But if the algorithms are effectively equivalent, then why the performance difference? I strongly suspect the reason is because of the additional overhead in the NetworkX approach, probably related to the number of additional function calls. Python is notoriously bad with function call latency https://ilovesymposia.com/2015/12/10/the-cost-of-a-python-function-call/, and the poor performance of NetworkX is well documented6https://graph-tool.skewed.de/performance.

Moving to a library that does more of the heavy lifting in a more performant language should enable some significant performance improvements. I’m going to take a look at graph-tool and see how that goes. I have been compiling it while writing this article and it’s still going, so it might be a while before I can start and the code and write the next article. Don’t hold your breath!

Next Steps

  • Implement the code in graph-tools
    • Pre-requisite: Compile graph-tools!
  • Try and get the NetworkX implementation to work on a grid of 10,000×10,000 elements. I did try this for the table above, but my system ran out of RAM. I might have to switch to a disk-based copy of the grid to generate the graph as I do. This will have a pretty significant performance hit, but might be the only way to get the memory usage down