The Surveying Problem Part 6: Stack-based DFS

Intro

As I mentioned in my previous post, the final approach I want to try is a stack-based DFS implementation. This will certainly be faster than any of the graph library-based approaches, and based on Python’s performance issues around recursive algorithms, might be the fastest approach of all.

The Code

First things first, here’s the code:

""" A recursive implementation of the surveying problem
find_reservoirs: Determine the number and location of contiguous reservoirs in a grid
get_neighbors: Find the number of active neighbor elements for a given element
"""
from tools import get_neighbors

METHOD_NAME = "Stack Method (List)"


def find_reservoirs(this_grid):
    """ Recursively determines how many wells are needed, making the assumption that
    only one well is needed per contiguous field

    this_grid: This is the list of locations to be checked for the current reservoir
    reservoir: If being called recursively, this contains the current reservoir that is being built
    original_grid: If being called recursively, this is the full grid to find neighbor elements
    """

    checked_elements = set()
    stack = list()
    reservoirs = []

    remaining_nodes = this_grid

    while remaining_nodes:
        reservoir = set()
        stack.append(remaining_nodes.pop())

        while stack:
            location = stack.pop()

            if location in checked_elements:
                continue

            reservoir.add(location)
            checked_elements.add(location)

            stack.extend(get_neighbors(location, this_grid))

        reservoirs.append(reservoir)
        remaining_nodes -= reservoir

    return reservoirs

It’s pleasingly concise, especially considering it’s not calling any external libraries. It’s essentially a standard DFS, but since I’m working with a forest instead of a single tree I wrap the main loop iterating through the stack (the tree) with another while loop that iterates through all remaining nodes (the forest). This outer loop ‘seeds’ the stack with an arbitrary item (I just pick the first one in the set), and then when the stack is empty I remove the entire tree from the forest with a set subtraction. Once I’ve processed the final tree, it is subtracted from the forest and the outer while loop ends.

One design decision I made at this stage was to use a list as my stack. Python does have a deque1https://docs.python.org/3.8/library/collections.html#collections.deque object that I could have used instead, which is optimized for fast appends and pops and has O(1) performance compared to O(n) performance for lists. I’ll look at that in the next article though.

Results

As I said last time, I wanted to dive into matplotlib to generate some plots of the data I have been generating. So here’s a bar chart that shows the relative performance of all the different implementations for a range of grid sizes and probability thresholds.

The takeaway from this chart is that the stack method is the fastest implementation for all scenarios, but it’s very close between the stack and recursive approaches. This isn’t surprising, we already know that the time taken to figure out which sites are near which other sites is the dominant factor in all methods. Any performance improvement on the recursive side of things is going to be limited.

Next Steps

The chart above is a sneak preview of some of the investigations I have started doing into the relative performance of the different DFS algorithms based on grid size and probability threshold. Grid size is a pretty simple scalar, in that beyond a certain point the number of trees scales linearly with the number of available locations. However, the probability threshold is a bit more interesting since it interacts with the clustering algorithm to very rapidly generate very dense forests at still relatively low thresholds.

The next article will be the final one in the series, and will dive into more detail on some of the interesting mathematical features of this problem, and come up with a final determination on the best solution, including looking into list vs deque implmentations of the stack algorithm.

The Surveying Problem Part 4: Graph Solution

As I mentioned in my previous post, the recursive solution to the Surveying Problem is pretty performant, but I wanted to see if a graph-based solution would be faster or easier to implement.

I started with the NetworkX package1https://networkx.github.io/, which I have used before in some other projects. In my experience, it’s reasonably quick, easy to install, and is released under a BSD license, which is important when you are writing in a commercial environment. Without further ado, here’s the code:

import networkx as nx

METHOD_NAME = "NetworkX Method"

def get_neighbors(node, graph):
    """ Returns a list of neighbor location objects

     node: A networkX node. The name is a 2-tuple representing the x and y coordinates of an element
     graph: A networkX graph of locations
     """
    x_coord, y_coord = node
    x_offsets = [-1, 0, 1]
    y_offsets = [-1, 0, 1]

    neighbors = list()
    for x_offset in x_offsets:
        for y_offset in y_offsets:
            if x_offset == 0 and y_offset == 0:
                continue

            coord = (x_coord + x_offset, y_coord + y_offset)
            if coord in graph:
                neighbors.append(coord)
    return neighbors


def find_reservoirs(grid):
    """ Uses a graph approach to find how many wells are needed, making the assumption that
    only one well is needed per contiguous field

    grid: Set containing all locations with oil
    """

    locations_graph = nx.Graph()
    locations_graph.add_nodes_from(grid)

    edges_to_create = set()
    for node in locations_graph:
        neighbors = get_neighbors(node, locations_graph)
        _ = [edges_to_create.add((node, neighbor)) for neighbor in neighbors]

    locations_graph.add_edges_from(edges_to_create)

    connected_subgraphs = nx.connected_components(locations_graph)
    wells = [{vertex for vertex in subgraph} for subgraph in connected_subgraphs]

    return wells

Pros: Simplicity

First, the pros of this approach. Firstly, the code is conceptually a lot simpler than the recursive approach:

  1. Create nodes for every location that has oil
  2. Connect nodes together if they are adjacent
  3. Call nx.connected_components() on our graph
    • This generates the set of connected subgraphs
    • i.e. the set of subgraphs where every node is connected to every other node in that subgraph
  4. List comprehension to get the correct output summary

Another pro is that a lot of the algorithmic design has been done by the NetworkX folks, as opposed to by me. As far as an interview question goes that’s obviously a bad thing since the point of the exercise was to test my problem-solving abilities, not to have an encyclopedic knowledge of which libraries are available. However, in a professional environment, I think it’s always better to use a library and let someone else do the thinking for you.

Cons: Performance

As for the cons, performance is significantly worse, almost by an order of magnitude. Here’s a summary of the comparison between the recursive solution and the NetworkX-based graph solution:

Grid SizeTime (Recursive)Time (NetworkX)Time (Recursive / NetworkX)
10×104.603E-05 s1.799E-04 s25.6 %
100×1002.369E-03 s1.634E-02 s14.5 %
1000×10003.933E-01 s2.139E+00 s18.4 %
5000×50001.649E+01 s8.869E+01 s18.5 %

The obvious question to ask is, why? To answer this question, I looked at the source for the nx.connected_components() method.

Investigating the Slowdown

Behind the scenes, NetworkX is doing something very similar to what I implement previously 2https://github.com/networkx/networkx/blob/master/networkx/algorithms/components/connected.py. nx.connected_components() implements a Breadth-First Search3https://en.wikipedia.org/wiki/Breadth-first_search, whereas my recursive code implements a Depth-First Search4https://en.wikipedia.org/wiki/Depth-first_search. A BFS might make sense if we just wanted to find the existence of every well, but since we need to find out every well’s size and complete location, the decision between BFS vs DFS really doesn’t make a difference; the optimal solution is just that we should each location exactly once. The Wikipedia page on connected components5https://en.wikipedia.org/wiki/Connected_component_(graph_theory) states the following:

“It is straightforward to compute the components of a graph in linear time (in terms of the numbers of the vertices and edges of the graph) using either breadth-first search or depth-first search. In either case, a search that begins at some particular vertex v will find the entire component containing v (and no more) before returning. To find all the components of a graph, loop through its vertices, starting a new breadth-first or depth-first search whenever the loop reaches a vertex that has not already been included in a previously found component. Hopcroft & Tarjan (1973) describe essentially this algorithm, and state that at that point it was “well known”.”

https://en.wikipedia.org/wiki/Connected_component_(graph_theory)

In other words, my solution that I gave during my interview was literally the textbook solution to the problem!

But if the algorithms are effectively equivalent, then why the performance difference? I strongly suspect the reason is because of the additional overhead in the NetworkX approach, probably related to the number of additional function calls. Python is notoriously bad with function call latency https://ilovesymposia.com/2015/12/10/the-cost-of-a-python-function-call/, and the poor performance of NetworkX is well documented6https://graph-tool.skewed.de/performance.

Moving to a library that does more of the heavy lifting in a more performant language should enable some significant performance improvements. I’m going to take a look at graph-tool and see how that goes. I have been compiling it while writing this article and it’s still going, so it might be a while before I can start and the code and write the next article. Don’t hold your breath!

Next Steps

  • Implement the code in graph-tools
    • Pre-requisite: Compile graph-tools!
  • Try and get the NetworkX implementation to work on a grid of 10,000×10,000 elements. I did try this for the table above, but my system ran out of RAM. I might have to switch to a disk-based copy of the grid to generate the graph as I do. This will have a pretty significant performance hit, but might be the only way to get the memory usage down

The Surveying Problem Part 3: The Recursive Solution

As I mentioned in my previous post, I have been working on some possible solutions to the following problem:

You are working for a company that is responsible for processing survey data for an Oil and Gas company. You are given data that summarizes whether oil was found at a certain grid reference. The assumption is that if oil is found in contiguous grid references (or grid elements), then they form a single oil reservoir. The objective is to write a program that will determine the number and locations of the reservoirs.

The question was posed to me in an interview context, so I only had access to standard Python libraries. As a result, the solution I came up with used a recursive algorithm, here’s the code:

""" Generate a grid of elements with a random layout of elements that contain oil
build_grid: Build the grid with an optional random seed
get_neighbors: Find the number of active neighbor elements for a given element
"""

import random
import itertools

THRESHOLD = 0.85

def get_neighbors(this_element, grid):
    """ Returns a list of neighbor location objects

     this_element: A 2-tuple representing the x and y coordinates of an element
     grid: A dictionary containing all elements and their state
     """
    x_coord, y_coord = this_element
    x_offsets = [-1, 0, 1]
    y_offsets = [-1, 0, 1]

    neighbors = list()
    for x_offset in x_offsets:
        for y_offset in y_offsets:
            if x_offset == 0 and y_offset == 0:
                continue

            coord = (x_coord + x_offset, y_coord + y_offset)
            if coord in grid:
                neighbors.append(coord)
    return neighbors


def build_grid(size, seed=None):
    """ Build a square grid of elements, where each element may or may not contain oil

     size: The number of elements along one edge of the grid
     seed: Random seed to be used to generate the grid
     """
    random.seed(seed)

    initial_grid = set()
    for location in itertools.product(range(0, size), repeat=2):
        if random.random() > THRESHOLD:
            initial_grid.add(location)

    grid = set()
    # Cluster the grid. If an active element is not isolated,
    # or if an inactive element has at least 4 active neighbors
    for location in itertools.product(range(0, size), repeat=2):
        state = location in initial_grid
        sites_nearby = get_neighbors(location, initial_grid)
        neighbor_count = len(sites_nearby)
        if (state and neighbor_count != 0) or neighbor_count >= 4:
            grid.add(location)

    return grid

It’s worth saying that whilst I came up with the broad algorithm during the interview, the version above contains a few improvements over the original, both to the structure of the code and the performance.

One example: the code in the previous post had the locations object as a dict(), where each key was a 2-tuple representing the x and y coordinates, and each value was either True or False. This was fine, but when trying with large grids (1000 x 1000+) I was using a not-insignificant amount of memory, so I switched to using a list containing the True 2-tuples only.

Performance Optimization

However, imagine my horror when I realized that while I reduced my RAM usage, I slowed down the performance on a 100×100 grid by a factor of over 1000! I must admit I had to resort to the Profiler in PyCharm to point me in the right direction, which pointed me firmly in the direction of my get_neighbors method.

The Profiler wasn’t kind enough to tell me that it was specifically my coord in all_locations call that was causing the slowdown, but the get_neighbors function doesn’t do a whole lot, so I quickly realized this must be the cause. After a quick Google, I came up with the Python TimeComplexity page1TimeComplexity. As long as the keys stored in a dict have sensible hashing functions (which my 2-tuples do), an in call with a dict is O(1), compared to O(n) for a list.

I moved away from a dict because I had no values, so instead of switching to a list, I should have switched to a set. A set is basically a dict without values (it turns out in CPython that’s exactly how it’s implemented), and is an extremely useful type, especially when you want to perform set-type operations. Here’s the call graph using a set instead of a list, as you can see the bottleneck in get_neighbors is gone.

In case you are wondering, the other 60% is in _find_and_load, so not much we can do there. This was for a grid size of around 100×100 though, and the _find_and_load overhead decreases as expected as the grid size is increased.

That’s pretty much it in terms of the recursive approach. Performance is not bad, here are some results for some example runs:

Grid SizeTime# of ReservoirsTime / ReservoirTime / ElementReservoir Frequency
10×101.528E-05 s27.64E-06 s1.528E-07 s0.02
100×1003.417E-03 s3061.117E-05 s3.417E-07 s0.0306
1000×10003.657E-01 s28,4251.287E-05 s3.657E-07 s0.0284
10000×100008.298E+01 s2,841,3602.920E-05 s8.298E-07 s0.0284

The Time/Element and Time/Reservoir results are pretty flat, which is probably what I would expect. The average number of reservoirs per element (reservoir frequency) is also reasonably flat, which trends towards 0.0284. It might be interesting to look into this a bit more, especially how it depends on how the grid itself was made and the clustering rules. #TODO

Memory Use

One definite negative to the code above is that even with a shift to storing only active elements in a set, it’s extremely memory intensive. The final run on a 10,000×10,000 grid was using around 4 Gb of memory at peak, which I think is probably due to two factors:

Firstly, I can’t use a yield-type idiom because I’m bouncing around looking at elements before and after the one I am processing. Also, during the clustering process, both the original and clustered grids have to be in memory at once, so I can examine the current state while I build the new state.

Secondly, native Python types are pretty bloated, memory-wise2https://stackoverflow.com/questions/1331471/in-memory-size-of-a-python-structure. An int is 24 bytes and the overhead for a tuple is 56 bytes, which means each element needs 104 bytes. The original grid is 15% active, which means for a grid of 10,000 x 10,000 elements 1.5 Gb of RAM is required. Assuming the clustered grid is around 10% active, that’s 2.5 Gb of memory to store both the original and clustered grid.

Next Steps

So, pros and cons to this approach, and definitely worth looking at more. However, my plan is to dig into the graph-based approach to see how the code complexity and performance changes.