## The Surveying Problem Part 3: The Recursive Solution

As I mentioned in my previous post, I have been working on some possible solutions to the following problem:

You are working for a company that is responsible for processing survey data for an Oil and Gas company. You are given data that summarizes whether oil was found at a certain grid reference. The assumption is that if oil is found in contiguous grid references (or grid elements), then they form a single oil reservoir. The objective is to write a program that will determine the number and locations of the reservoirs.

The question was posed to me in an interview context, so I only had access to standard Python libraries. As a result, the solution I came up with used a recursive algorithm, here’s the code:

```""" Generate a grid of elements with a random layout of elements that contain oil
build_grid: Build the grid with an optional random seed
get_neighbors: Find the number of active neighbor elements for a given element
"""

import random
import itertools

THRESHOLD = 0.85

def get_neighbors(this_element, grid):
""" Returns a list of neighbor location objects

this_element: A 2-tuple representing the x and y coordinates of an element
grid: A dictionary containing all elements and their state
"""
x_coord, y_coord = this_element
x_offsets = [-1, 0, 1]
y_offsets = [-1, 0, 1]

neighbors = list()
for x_offset in x_offsets:
for y_offset in y_offsets:
if x_offset == 0 and y_offset == 0:
continue

coord = (x_coord + x_offset, y_coord + y_offset)
if coord in grid:
neighbors.append(coord)
return neighbors

def build_grid(size, seed=None):
""" Build a square grid of elements, where each element may or may not contain oil

size: The number of elements along one edge of the grid
seed: Random seed to be used to generate the grid
"""
random.seed(seed)

initial_grid = set()
for location in itertools.product(range(0, size), repeat=2):
if random.random() > THRESHOLD:

grid = set()
# Cluster the grid. If an active element is not isolated,
# or if an inactive element has at least 4 active neighbors
for location in itertools.product(range(0, size), repeat=2):
state = location in initial_grid
sites_nearby = get_neighbors(location, initial_grid)
neighbor_count = len(sites_nearby)
if (state and neighbor_count != 0) or neighbor_count >= 4:

return grid
```

It’s worth saying that whilst I came up with the broad algorithm during the interview, the version above contains a few improvements over the original, both to the structure of the code and the performance.

One example: the code in the previous post had the locations object as a `dict()`, where each key was a 2-tuple representing the x and y coordinates, and each value was either `True` or `False`. This was fine, but when trying with large grids (1000 x 1000+) I was using a not-insignificant amount of memory, so I switched to using a list containing the `True` 2-tuples only.

## Performance Optimization

However, imagine my horror when I realized that while I reduced my RAM usage, I slowed down the performance on a 100×100 grid by a factor of over 1000! I must admit I had to resort to the Profiler in PyCharm to point me in the right direction, which pointed me firmly in the direction of my `get_neighbors` method.

The Profiler wasn’t kind enough to tell me that it was specifically my `coord in all_locations` call that was causing the slowdown, but the `get_neighbors` function doesn’t do a whole lot, so I quickly realized this must be the cause. After a quick Google, I came up with the Python TimeComplexity page1TimeComplexity. As long as the keys stored in a `dict` have sensible hashing functions (which my 2-tuples do), an `in` call with a `dict` is O(1), compared to O(n) for a list.

I moved away from a `dict` because I had no values, so instead of switching to a list, I should have switched to a `set`. A `set` is basically a `dict` without values (it turns out in CPython that’s exactly how it’s implemented), and is an extremely useful type, especially when you want to perform set-type operations. Here’s the call graph using a set instead of a list, as you can see the bottleneck in `get_neighbors` is gone.

In case you are wondering, the other 60% is in `_find_and_load`, so not much we can do there. This was for a grid size of around 100×100 though, and the `_find_and_load` overhead decreases as expected as the grid size is increased.

That’s pretty much it in terms of the recursive approach. Performance is not bad, here are some results for some example runs:

The Time/Element and Time/Reservoir results are pretty flat, which is probably what I would expect. The average number of reservoirs per element (reservoir frequency) is also reasonably flat, which trends towards 0.0284. It might be interesting to look into this a bit more, especially how it depends on how the grid itself was made and the clustering rules. #TODO

## Memory Use

One definite negative to the code above is that even with a shift to storing only active elements in a set, it’s extremely memory intensive. The final run on a 10,000×10,000 grid was using around 4 Gb of memory at peak, which I think is probably due to two factors:

Firstly, I can’t use a yield-type idiom because I’m bouncing around looking at elements before and after the one I am processing. Also, during the clustering process, both the original and clustered grids have to be in memory at once, so I can examine the current state while I build the new state.

Secondly, native Python types are pretty bloated, memory-wise2https://stackoverflow.com/questions/1331471/in-memory-size-of-a-python-structure. An `int` is 24 bytes and the overhead for a `tuple` is 56 bytes, which means each element needs 104 bytes. The original grid is 15% active, which means for a grid of 10,000 x 10,000 elements 1.5 Gb of RAM is required. Assuming the clustered grid is around 10% active, that’s 2.5 Gb of memory to store both the original and clustered grid.

## Next Steps

So, pros and cons to this approach, and definitely worth looking at more. However, my plan is to dig into the graph-based approach to see how the code complexity and performance changes.