The Surveying Problem Part 4: Graph Solution

As I mentioned in my previous post, the recursive solution to the Surveying Problem is pretty performant, but I wanted to see if a graph-based solution would be faster or easier to implement.

I started with the NetworkX package1https://networkx.github.io/, which I have used before in some other projects. In my experience, it’s reasonably quick, easy to install, and is released under a BSD license, which is important when you are writing in a commercial environment. Without further ado, here’s the code:

import networkx as nx

METHOD_NAME = "NetworkX Method"

def get_neighbors(node, graph):
    """ Returns a list of neighbor location objects

     node: A networkX node. The name is a 2-tuple representing the x and y coordinates of an element
     graph: A networkX graph of locations
     """
    x_coord, y_coord = node
    x_offsets = [-1, 0, 1]
    y_offsets = [-1, 0, 1]

    neighbors = list()
    for x_offset in x_offsets:
        for y_offset in y_offsets:
            if x_offset == 0 and y_offset == 0:
                continue

            coord = (x_coord + x_offset, y_coord + y_offset)
            if coord in graph:
                neighbors.append(coord)
    return neighbors


def find_reservoirs(grid):
    """ Uses a graph approach to find how many wells are needed, making the assumption that
    only one well is needed per contiguous field

    grid: Set containing all locations with oil
    """

    locations_graph = nx.Graph()
    locations_graph.add_nodes_from(grid)

    edges_to_create = set()
    for node in locations_graph:
        neighbors = get_neighbors(node, locations_graph)
        _ = [edges_to_create.add((node, neighbor)) for neighbor in neighbors]

    locations_graph.add_edges_from(edges_to_create)

    connected_subgraphs = nx.connected_components(locations_graph)
    wells = [{vertex for vertex in subgraph} for subgraph in connected_subgraphs]

    return wells

Pros: Simplicity

First, the pros of this approach. Firstly, the code is conceptually a lot simpler than the recursive approach:

  1. Create nodes for every location that has oil
  2. Connect nodes together if they are adjacent
  3. Call nx.connected_components() on our graph
    • This generates the set of connected subgraphs
    • i.e. the set of subgraphs where every node is connected to every other node in that subgraph
  4. List comprehension to get the correct output summary

Another pro is that a lot of the algorithmic design has been done by the NetworkX folks, as opposed to by me. As far as an interview question goes that’s obviously a bad thing since the point of the exercise was to test my problem-solving abilities, not to have an encyclopedic knowledge of which libraries are available. However, in a professional environment, I think it’s always better to use a library and let someone else do the thinking for you.

Cons: Performance

As for the cons, performance is significantly worse, almost by an order of magnitude. Here’s a summary of the comparison between the recursive solution and the NetworkX-based graph solution:

Grid SizeTime (Recursive)Time (NetworkX)Time (Recursive / NetworkX)
10×104.603E-05 s1.799E-04 s25.6 %
100×1002.369E-03 s1.634E-02 s14.5 %
1000×10003.933E-01 s2.139E+00 s18.4 %
5000×50001.649E+01 s8.869E+01 s18.5 %

The obvious question to ask is, why? To answer this question, I looked at the source for the nx.connected_components() method.

Investigating the Slowdown

Behind the scenes, NetworkX is doing something very similar to what I implement previously 2https://github.com/networkx/networkx/blob/master/networkx/algorithms/components/connected.py. nx.connected_components() implements a Breadth-First Search3https://en.wikipedia.org/wiki/Breadth-first_search, whereas my recursive code implements a Depth-First Search4https://en.wikipedia.org/wiki/Depth-first_search. A BFS might make sense if we just wanted to find the existence of every well, but since we need to find out every well’s size and complete location, the decision between BFS vs DFS really doesn’t make a difference; the optimal solution is just that we should each location exactly once. The Wikipedia page on connected components5https://en.wikipedia.org/wiki/Connected_component_(graph_theory) states the following:

“It is straightforward to compute the components of a graph in linear time (in terms of the numbers of the vertices and edges of the graph) using either breadth-first search or depth-first search. In either case, a search that begins at some particular vertex v will find the entire component containing v (and no more) before returning. To find all the components of a graph, loop through its vertices, starting a new breadth-first or depth-first search whenever the loop reaches a vertex that has not already been included in a previously found component. Hopcroft & Tarjan (1973) describe essentially this algorithm, and state that at that point it was “well known”.”

https://en.wikipedia.org/wiki/Connected_component_(graph_theory)

In other words, my solution that I gave during my interview was literally the textbook solution to the problem!

But if the algorithms are effectively equivalent, then why the performance difference? I strongly suspect the reason is because of the additional overhead in the NetworkX approach, probably related to the number of additional function calls. Python is notoriously bad with function call latency https://ilovesymposia.com/2015/12/10/the-cost-of-a-python-function-call/, and the poor performance of NetworkX is well documented6https://graph-tool.skewed.de/performance.

Moving to a library that does more of the heavy lifting in a more performant language should enable some significant performance improvements. I’m going to take a look at graph-tool and see how that goes. I have been compiling it while writing this article and it’s still going, so it might be a while before I can start and the code and write the next article. Don’t hold your breath!

Next Steps

  • Implement the code in graph-tools
    • Pre-requisite: Compile graph-tools!
  • Try and get the NetworkX implementation to work on a grid of 10,000×10,000 elements. I did try this for the table above, but my system ran out of RAM. I might have to switch to a disk-based copy of the grid to generate the graph as I do. This will have a pretty significant performance hit, but might be the only way to get the memory usage down