During the technical component of an interview for a software developer position I was posed the following problem:
You are working for a company that is responsible for processing survey data for an Oil and Gas company. You are given data that summarizes whether oil was found at a certain grid reference. The assumption is that if oil is found in contiguous grid references (or grid elements), then they form a single oil reservoir. The objective is to write a program that will determine the number and locations of the reservoirs.
The Algorithm
The interview went pretty well, and I managed to write something within 10-15 minutes or so that I think did the job. The coding exercise was done with a shared Google Doc, so I never actually got the chance to run the code, but my interviews seemed happy with the solution. The algorithm I came up with looked something like this:
- Create a new active
reservoir
- For each
element
ingrid
:- If the
element
is in thechecked_locations
set, go to the nextelement
- Add the
element
tochecked_locations
- Add the
element
to the currently activereservoir
- Check if there are any neighbors to this
element
- If yes: go to 2 (passing in the list of neighbors as the
grid
)
- If yes: go to 2 (passing in the list of neighbors as the
- If we are not currently recursing:
- Add the
reservoir
to the result set - Create a new active
reservoir
- Add the
- If the
The Discussion
When I was asked to defend my code, I said I was initially hesitant about going down a recursive route, because Python is generally not a great language for writing functional-style recursive functions. However, a purely iterative approach would mean having to run an arbitrary number of times to see if there were any adjacent reservoirs that could be joined, which also isn’t ideal.
As I said, my interviewers seemed happy with my solution, so I assume this was the solution they had in mind. However, I really wasn’t satisfied with the recursive approach, and I suggested a graph-based approach instead. The crux of the problem is taking a grid of data and understanding the connectivity, which is what graphs are intended to capture. I have a bit of experience with the iGraph and NetworkX Python libraries and knew that a graph-based solution would need probably half as many lines and probably be as performant as the recursive approach, if not more so.
The discussion went along the lines of trading one form of complexity for another. I argued on the side of trying to shift the complexity into the standard library, even potentially at the expense of performance. Yes, a full graph library is maybe overkill for a relatively simple problem. However, I’m always in favor of writing (and maintaining!) as little code as possible.
Outcome
I think the discussion was well-received, but alas I didn’t end up getting the job (probably more to do with COVID-19 than anything else). As a coding exercise, I decided to revisit the code to see firstly if my algorithm above would really work, and how much more/less performant a graph-based approach would be. Expect articles in the near future that show the code I have written for each approach and an analysis of the pros and cons of each!