The Surveying Problem Part 6: Stack-based DFS

Intro

As I mentioned in my previous post, the final approach I want to try is a stack-based DFS implementation. This will certainly be faster than any of the graph library-based approaches, and based on Python’s performance issues around recursive algorithms, might be the fastest approach of all.

The Code

First things first, here’s the code:

""" A recursive implementation of the surveying problem
find_reservoirs: Determine the number and location of contiguous reservoirs in a grid
get_neighbors: Find the number of active neighbor elements for a given element
"""
from tools import get_neighbors

METHOD_NAME = "Stack Method (List)"


def find_reservoirs(this_grid):
    """ Recursively determines how many wells are needed, making the assumption that
    only one well is needed per contiguous field

    this_grid: This is the list of locations to be checked for the current reservoir
    reservoir: If being called recursively, this contains the current reservoir that is being built
    original_grid: If being called recursively, this is the full grid to find neighbor elements
    """

    checked_elements = set()
    stack = list()
    reservoirs = []

    remaining_nodes = this_grid

    while remaining_nodes:
        reservoir = set()
        stack.append(remaining_nodes.pop())

        while stack:
            location = stack.pop()

            if location in checked_elements:
                continue

            reservoir.add(location)
            checked_elements.add(location)

            stack.extend(get_neighbors(location, this_grid))

        reservoirs.append(reservoir)
        remaining_nodes -= reservoir

    return reservoirs

It’s pleasingly concise, especially considering it’s not calling any external libraries. It’s essentially a standard DFS, but since I’m working with a forest instead of a single tree I wrap the main loop iterating through the stack (the tree) with another while loop that iterates through all remaining nodes (the forest). This outer loop ‘seeds’ the stack with an arbitrary item (I just pick the first one in the set), and then when the stack is empty I remove the entire tree from the forest with a set subtraction. Once I’ve processed the final tree, it is subtracted from the forest and the outer while loop ends.

One design decision I made at this stage was to use a list as my stack. Python does have a deque1https://docs.python.org/3.8/library/collections.html#collections.deque object that I could have used instead, which is optimized for fast appends and pops and has O(1) performance compared to O(n) performance for lists. I’ll look at that in the next article though.

Results

As I said last time, I wanted to dive into matplotlib to generate some plots of the data I have been generating. So here’s a bar chart that shows the relative performance of all the different implementations for a range of grid sizes and probability thresholds.

The takeaway from this chart is that the stack method is the fastest implementation for all scenarios, but it’s very close between the stack and recursive approaches. This isn’t surprising, we already know that the time taken to figure out which sites are near which other sites is the dominant factor in all methods. Any performance improvement on the recursive side of things is going to be limited.

Next Steps

The chart above is a sneak preview of some of the investigations I have started doing into the relative performance of the different DFS algorithms based on grid size and probability threshold. Grid size is a pretty simple scalar, in that beyond a certain point the number of trees scales linearly with the number of available locations. However, the probability threshold is a bit more interesting since it interacts with the clustering algorithm to very rapidly generate very dense forests at still relatively low thresholds.

The next article will be the final one in the series, and will dive into more detail on some of the interesting mathematical features of this problem, and come up with a final determination on the best solution, including looking into list vs deque implmentations of the stack algorithm.