Jeopardy!

Hi everyone! I know I said I was going to start work on a weight tracking app next, but I came up with something a bit more fun! I was crawling through some of the fascinating datasets on Kaggle looking for some inspiration, and found a set of over 200,000 Jeopardy questions!1https://www.kaggle.com/tunguz/200000-jeopardy-questions.

My plan is to use this to learn a little more about front-end stuff, which is still the area I’m weakest in the software development world. My plan is to build a page that generates random combinations of previous Jeopardy games, potentially allowing two people to play against each other in some kind of shared session.

First things first though, I need to get the raw data from Kaggle into a format that I can use in my front end. Python seems like a sensible tool to do that, and the Pandas library should make that process easy. The first thing I did was to write a simple script that would suck the data out of the CSV and dump it into a SQLite database. Thanks to the magic of Pandas and SqlAlchemy, the code is alarmingly simple…

import pandas as pd
from sqlalchemy import create_engine

df_questions = pd.read_csv('JEOPARDY_CSV.csv', parse_dates=[1])
engine = create_engine('sqlite:///questions.db', echo=False)
df_questions.to_sql('questions', con=engine, if_exists='replace')

And that’s it! Done! Time for a pint! Pandas automatically reads the CSV (including headers), parses the dates into real DateTime objects, and then SqlAlchemy creates a SQLite database file and dumps the data into the database. 5 lines including import statements, even for Python that’s impressive.

However, there are some problems with this. The first is that the column names in the CSV have spaces all over the place, which is not ideal since both SQL and Pandas make you jump through all sorts of hoops to address columns with spaces in the name. However, that’s an easy fix with a renaming dict, which turns the code above into something like this:

import pandas as pd
from sqlalchemy import create_engine

column_rename = {"Show Number": "show_identity",
                 " Air Date": "air_date",
                 " Round": "round_name",
                 " Category": "category",
                 " Value": "value",
                 " Question":"question",
                 " Answer": "answer"}

df_questions = pd.read_csv('JEOPARDY_CSV.csv', parse_dates=[1])
df_questions.rename(columns = column_rename, inplace=True)
engine = create_engine('sqlite:///questions.db', echo=False)
df_questions.to_sql('questions', con=engine, if_exists='replace')

OK, so I’m now up to 7 lines. Still not bad. However, another problem remains, and that is the normalization of the data (or lack of). I’ll talk through the pros and cons of normalizing a dataset in a moment, but first, he’s a look at the current state of the database:

show_identityair_dateround_namecategoryvaluequestionanswer
54622008-05-13Double Jeopardy!Who’s Your Mommy?$800King Solomon
54622008-05-13Double Jeopardy!It’s all from the Greek to me$800An institution for the care…
54622008-05-13Double Jeopardy!Dr. Drew$1200People 18 to 25 are vulnerable to…
54622008-05-13Double Jeopardy!Recent Films$1200The attempted assassination…
54622008-05-13Double Jeopardy!Music Class$2000This large woodwind…
54622008-05-13Double Jeopardy!Lend me you Iroquois$1200The Iroquois Nationals…
54622008-05-13Double Jeopardy!Who’s Your Mommy?$1200Invention (proverbially)

This is a random sample of 7 rows from the dataset, and the lack of normalization is pretty obvious. From the 7 columns shown here, only two of them contain unique values (Question and Answer), 2 columns contain repeated values, and 3 columns contain all identical values! In fact, from the 216,930 unique questions in this dataset, there are only 3,640 unique shows and 47,200 unique categories. So there’s certainly a huge scope for normalization here!

I’ll go through the normalization steps I went through in a moment, but as a starting point, here’s the schema diagram for the current unnormalized table:

The data as pulled from Kaggle doesn’t even really conform to the Unnormalized Form2https://en.wikipedia.org/wiki/Unnormalized_form, since it doesn’t have an explicitly defined primary key (however Pandas was kind enough to create one for me automatically when I loaded the CSV). It does however have no duplicate rows, so it’s halfway there.

Normalizing The Dataset

Typically, before I start writing any database code I want to design my schema to figure out how I’m going to represent the data. I design relational databases for a living, and I’ve developed a pretty simple approach to the initial coarse-grained normalization of a database schema.

The trick is to figure out what each table conceptualizes.

Throughout my career, I have typically been involved in designing and building databases to store engineering data. If I need to build a database to store mechanical testing data (for example), I’m probably going to need tables about the material being tested (both the specification of the material and the specific instance of the material), the machine that’s doing the testing, the results of the test, the calculations performed on those results, etc.

In this case, the key Jeopardy concepts I came up with were:

  • The Show (show number, air date)
  • The Rounds (Jeopardy!, Double Jeopardy!, Final Jeopardy!)
  • The Categories (History, Brass, Birds, “T”elevision)
  • Question & Answers

Everything associated with Jeopardy (except for the contestants and Aaaaalex Trebek!) fits into one of those buckets, and so that sounds like a pretty good way to go about designing the normalized database. To formalize things up a bit, here it is in UML:

This feels like the appropriate way to construct the database since it relates the key concepts of a Jeopardy game in a way that makes conceptual sense, e.g. multiple categories are included in a single show, and each category contains multiple questions and answers. Each question and answer pair belongs to a single category, and has a specific ‘value’.

However, I want to be sure I’ve considered the most significant possible ways of normalizing the data, so I’ll now go through the first three normal forms to make sure the database either adheres to that form, or that I have a justification for not doing so.

1st Normal Form (1NF)

The First normal form simply states that the columns in a table should be atomic (i.e. indivisible). We were actually already there with the initial CSV, so we pass this first test straight off. Great!

2nd Normal Form (2NF)

The Second normal form get’s a bit trickier. To pass this one, we need to ensure that every column that is not part of a candidate key depends on the entire candidate key, and not just a part of it. A common ‘smell’ associated with breaking 2NF is that there’s a lot of duplicated data.

Our original database didn’ satisfy 2NF all over the place. A candidate key for our original database was the Question itself, but everything except the answer doesn’t depend on the question. Pulling the Question, Answer, and Value columns into a separate table helps with this, and then doing the same for the category name ensures all tables are compliant with 2NF.

3rd Normal Form (3NF)

The Third normal form states that every non-prime column is non-transitively dependent on every key in that table, i.e. there should not be a column whose values depend on something other than a key. Again, this is something that my schema satisfies due to the creation of separate tables for each entity.

For example, in the original dataset, the air date of the show is functionally dependent on the show_identity and not on the question candidate key. Therefore there’s nothing to stop the same episode identity from having multiple different air dates. This is a violation of 3NF.

The normalized schema creates a dedicated Show table with the show identity, and the air_date is dependent purely on this column.

Pros and Cons of Excessive Normalization

Before I go on to the code that performs the nomalization, I want to address the question of whether too much normalization is a bad thing. The answer in my experience is that you have to determine that on a case-by-case basis, based on both the data that you are storing and the way that data is being used. It’s pointless having a theoretically perfect database if you have to perform 30 joins that inur a huge performance penalty, and even worse if the front-end tool you are using just can’t do the joins that you need!

With that being said, I always like to start with an over-normalized database since during the implementation phase of a project it’s generally easier to denormalize a database than it is to normalize a database. Also, you only realize what you need to denormalize once you actually try to use the database. So I think it’s always better to over-normalize early on in a project, and then denormalize as the front-end comes together and additional constraints become clear.

The Code

The code to normalize the database makes extensive use of copying subsets of DataFames in Pandas, dropping columns, removing duplicates, and mapping values. Most of it was pretty straight forward, but one particular area that required some figuring out was where I generate the category table. This step required creating an entirely new identity that was dependent on the category name, the round, and the show (since category names could be re-used between shows). My original approach to doing this was to:

  1. Copy the questions DataFrame to a ‘category’ DataFrame
  2. Drop the columns I didn’t want (value, question, answer)
  3. Delete duplicate rows
  4. Re-generate the DataFrame index
  5. Find the category identity (index) based on each round, category, show tuple in the questions DataFrame
# Create a categories table
df_categories = pd.DataFrame(columns=['show_identity', 'round_name', 'name'])

df_categories['show_identity'] = df_questions['show_identity'].to_numpy()
df_categories['round_name'] = df_questions['round_name'].to_numpy()
df_categories['name'] = df_questions['category'].to_numpy()
df_categories.drop_duplicates(inplace=True, ignore_index=True)

def get_category_id(question):
    show_identity = question.show_identity
    round_name = question.round_name
    category = question.category

    category_id = df_categories[(df_categories.round_name == round_name) &
                                       (df_categories.name == category) &
                                       (df_categories.show_identity == show_identity)].index[0]
    return category_id

# Get the category ID for each question
category_ids = df_questions.swifter.apply(get_category_id, axis=1)

# Add the category identity to the questions table, and delete unneeded columns
df_questions.insert(loc=0, column='category_identity', value=category_ids)

This just felt wrong when I was writing it; doing 200,000 lookups across three columns in a data frame is never going to be efficient, and even with the swifter3https://github.com/jmcarpenter2/swifter library that parallelized the operation across my 12-core Ryzen it still took over 20 minutes to complete.

The better approach was to create the category index before I duplicated the DataFrame, that way the cross-referencing was already done without having to resort to a whole load of lookups. The steps to do this were as follows:

  1. Create the category DataFrame and copy the relevant columns
  2. Create the composite key based on those relevant columns and create a map between those values and an int
  3. Map the composite key in the category DataFrame
  4. Map the composite key in the question DataFrame, and drop the columns that were moved to the category DataFrame

This cut the time taken to build the category table from around 20 minutes to about 10 seconds, a factor of improvement of about 120! You can see that code below, starting from the comment line # Create a categories table.

import pandas as pd
from sqlalchemy import create_engine
import numpy as np

column_rename = {"Show Number": "show_identity",
                 " Air Date": "air_date",
                 " Round": "round_name",
                 " Category": "category",
                 " Value": "value",
                 " Question":"question",
                 " Answer": "answer"}

df_questions = pd.read_csv('JEOPARDY_CSV.csv', parse_dates=[1])

df_questions.rename(columns = column_rename, inplace=True)

# Turn the value into an integer
df_questions['value'] = df_questions['value'].map(
    lambda a: int((a[1:].replace(',', '')))
        if a != 'None'
    else np.NaN)

# Create a shows table, and delete corresponding columns in questions table
df_shows = pd.DataFrame(columns=['identity', 'air_date'])
df_shows['identity'] = df_questions['show_identity'].to_numpy()
df_shows['air_date'] = df_questions['air_date'].to_numpy()
df_shows.drop_duplicates(inplace=True, ignore_index=True)
df_shows.set_index('identity', inplace=True, append=False)
df_questions.drop('air_date', axis=1, inplace=True)

# Create a categories table
df_categories = pd.DataFrame()

df_categories['show_identity'] = df_questions['show_identity'].to_numpy()
df_categories['round_name'] = df_questions['round_name'].to_numpy()
df_categories['category'] = df_questions['category'].to_numpy()

# Define the composite key that uniquely identifies a category
category_composite_key = ['show_identity', 'round_name', 'category']
df_categories.drop_duplicates(inplace=True, ignore_index=True)

old_category_identities = df_categories[category_composite_key].apply(lambda row: '_'.join(row.values.astype(str)),
                                                                      axis=1)
new_category_identities = range(len(old_category_identities))
category_identity_mapping = dict(zip(old_category_identities, new_category_identities))


def map_category_identity(category):
    composite_key = '_'.join(category.values.astype(str))
    return category_identity_mapping[composite_key]


df_categories['identity'] = df_categories.apply(map_category_identity, axis=1)
df_categories.rename(columns={'category': 'name'}, inplace=True)
df_categories.index.rename('identity', inplace=True)
df_categories.set_index(keys=['identity'], append=False, inplace=True, drop=True)

# Perform the same mapping on the questions themselves, and drop the individual elements of the key
df_questions.insert(0, 'category_identity', value=df_questions[category_composite_key].apply(map_category_identity,
                                                                                             axis=1))
df_questions.drop(category_composite_key, axis=1, inplace=True)
df_questions.index.rename('identity', inplace=True)

# Create a rounds table
round_name_mapping = {'Jeopardy!': 0, 'Double Jeopardy!': 1, 'Final Jeopardy!': 2, 'Tiebreaker': 3}
df_rounds = pd.DataFrame.from_dict({'round_name': round_name_mapping.keys()})
df_rounds.index.rename('identity', inplace=True)

round_identity = df_categories.apply(lambda c: round_name_mapping[c.round_name], axis=1)
df_categories.insert(loc=0, column='round_identity', value=round_identity)
df_categories.drop('round_name', axis=1, inplace=True)

# Export to SQLite
engine = create_engine('sqlite:///questions.db', echo=False)
df_questions.to_sql('questions', con=engine, if_exists='replace')
df_shows.to_sql('shows', con=engine, if_exists='replace')
df_categories.to_sql('categories', con=engine, if_exists='replace')
df_rounds.to_sql('rounds', con=engine, if_exists='replace')


Referential Integrity

The code above successfully creates the different tables in SQLite with the right columns, which allow all the various joins to be performed to find out which questions appeared on a certain show (through the category table), or ultimately even to re-create the original CSV.

However, the SQLite database doesn’t have any constraints to ensure that any new data added to the database satisfy the requirements to be able to do the joins described above. And unfortunately, since I’m using SQLite at the moment, there’s no simple way to apply constraints to tables in the database if you are creating them with dataframe_to_sql()4https://www.thetopsites.net/article/53283866.shtml. Since this is still in the prototype stage it’s not really critical that I handle this now, but before going into production I’d probably change this to a MySQL or Postgres backend, and at that point, I’d want to make sure my referential integrity is all squared away.

Next Steps

As I mentioned at the top, my goal for this is to be able to create something in a front-end that maybe lets me play a randomly generated game of jeopardy, potentially with other people in some kind of shared session over the web.

However, the first step in that is going to be building a REST API that can return a random Jeopardy ‘game’ when requested. I’ll probably implement the first pass in Flask, but I also want to give some other languages a try; I’ve done some very basic work with Spring Boot in Java, and I think it would make a lot of sense to use that here as well. Maybe have the backend in Spring Boot and the front-end in a Flask-based web app…

Anyway, watch this space and see what I come up with!

The Surveying Problem Part 3: The Recursive Solution

As I mentioned in my previous post, I have been working on some possible solutions to the following problem:

You are working for a company that is responsible for processing survey data for an Oil and Gas company. You are given data that summarizes whether oil was found at a certain grid reference. The assumption is that if oil is found in contiguous grid references (or grid elements), then they form a single oil reservoir. The objective is to write a program that will determine the number and locations of the reservoirs.

The question was posed to me in an interview context, so I only had access to standard Python libraries. As a result, the solution I came up with used a recursive algorithm, here’s the code:

""" Generate a grid of elements with a random layout of elements that contain oil
build_grid: Build the grid with an optional random seed
get_neighbors: Find the number of active neighbor elements for a given element
"""

import random
import itertools

THRESHOLD = 0.85

def get_neighbors(this_element, grid):
    """ Returns a list of neighbor location objects

     this_element: A 2-tuple representing the x and y coordinates of an element
     grid: A dictionary containing all elements and their state
     """
    x_coord, y_coord = this_element
    x_offsets = [-1, 0, 1]
    y_offsets = [-1, 0, 1]

    neighbors = list()
    for x_offset in x_offsets:
        for y_offset in y_offsets:
            if x_offset == 0 and y_offset == 0:
                continue

            coord = (x_coord + x_offset, y_coord + y_offset)
            if coord in grid:
                neighbors.append(coord)
    return neighbors


def build_grid(size, seed=None):
    """ Build a square grid of elements, where each element may or may not contain oil

     size: The number of elements along one edge of the grid
     seed: Random seed to be used to generate the grid
     """
    random.seed(seed)

    initial_grid = set()
    for location in itertools.product(range(0, size), repeat=2):
        if random.random() > THRESHOLD:
            initial_grid.add(location)

    grid = set()
    # Cluster the grid. If an active element is not isolated,
    # or if an inactive element has at least 4 active neighbors
    for location in itertools.product(range(0, size), repeat=2):
        state = location in initial_grid
        sites_nearby = get_neighbors(location, initial_grid)
        neighbor_count = len(sites_nearby)
        if (state and neighbor_count != 0) or neighbor_count >= 4:
            grid.add(location)

    return grid

It’s worth saying that whilst I came up with the broad algorithm during the interview, the version above contains a few improvements over the original, both to the structure of the code and the performance.

One example: the code in the previous post had the locations object as a dict(), where each key was a 2-tuple representing the x and y coordinates, and each value was either True or False. This was fine, but when trying with large grids (1000 x 1000+) I was using a not-insignificant amount of memory, so I switched to using a list containing the True 2-tuples only.

Performance Optimization

However, imagine my horror when I realized that while I reduced my RAM usage, I slowed down the performance on a 100×100 grid by a factor of over 1000! I must admit I had to resort to the Profiler in PyCharm to point me in the right direction, which pointed me firmly in the direction of my get_neighbors method.

The Profiler wasn’t kind enough to tell me that it was specifically my coord in all_locations call that was causing the slowdown, but the get_neighbors function doesn’t do a whole lot, so I quickly realized this must be the cause. After a quick Google, I came up with the Python TimeComplexity page1TimeComplexity. As long as the keys stored in a dict have sensible hashing functions (which my 2-tuples do), an in call with a dict is O(1), compared to O(n) for a list.

I moved away from a dict because I had no values, so instead of switching to a list, I should have switched to a set. A set is basically a dict without values (it turns out in CPython that’s exactly how it’s implemented), and is an extremely useful type, especially when you want to perform set-type operations. Here’s the call graph using a set instead of a list, as you can see the bottleneck in get_neighbors is gone.

In case you are wondering, the other 60% is in _find_and_load, so not much we can do there. This was for a grid size of around 100×100 though, and the _find_and_load overhead decreases as expected as the grid size is increased.

That’s pretty much it in terms of the recursive approach. Performance is not bad, here are some results for some example runs:

Grid SizeTime# of ReservoirsTime / ReservoirTime / ElementReservoir Frequency
10×101.528E-05 s27.64E-06 s1.528E-07 s0.02
100×1003.417E-03 s3061.117E-05 s3.417E-07 s0.0306
1000×10003.657E-01 s28,4251.287E-05 s3.657E-07 s0.0284
10000×100008.298E+01 s2,841,3602.920E-05 s8.298E-07 s0.0284

The Time/Element and Time/Reservoir results are pretty flat, which is probably what I would expect. The average number of reservoirs per element (reservoir frequency) is also reasonably flat, which trends towards 0.0284. It might be interesting to look into this a bit more, especially how it depends on how the grid itself was made and the clustering rules. #TODO

Memory Use

One definite negative to the code above is that even with a shift to storing only active elements in a set, it’s extremely memory intensive. The final run on a 10,000×10,000 grid was using around 4 Gb of memory at peak, which I think is probably due to two factors:

Firstly, I can’t use a yield-type idiom because I’m bouncing around looking at elements before and after the one I am processing. Also, during the clustering process, both the original and clustered grids have to be in memory at once, so I can examine the current state while I build the new state.

Secondly, native Python types are pretty bloated, memory-wise2https://stackoverflow.com/questions/1331471/in-memory-size-of-a-python-structure. An int is 24 bytes and the overhead for a tuple is 56 bytes, which means each element needs 104 bytes. The original grid is 15% active, which means for a grid of 10,000 x 10,000 elements 1.5 Gb of RAM is required. Assuming the clustered grid is around 10% active, that’s 2.5 Gb of memory to store both the original and clustered grid.

Next Steps

So, pros and cons to this approach, and definitely worth looking at more. However, my plan is to dig into the graph-based approach to see how the code complexity and performance changes.

The Surveying Problem Part 2: Building a Grid

If I’m going to solve this problem, the first thing I need is a grid. In my interview, the grid itself didn’t actually exist; it was all hypothetical: “assuming a grid in a format that you can dictate, write an algorithm that would solve this problem”. If I am actually going to solve the problem, I need a grid with which to work!

Building a Grid

I assumed a square grid at all times, needing only one parameter to define the size. I then wrote a pretty basic function that would return a grid in the form of a dict, where the key is a 2-tuple representing the x and y coordinate of the element, and the value is either True or False, depending on whether there is oil there or not.

import random
import itertools

THRESHOLD = 0.85

def build_grid(size, seed=None):
   """ Build a square grid of elements, where each element may or may not contain oil

    size: The number of elements along one edge of the grid
    seed: Random seed to be used to generate the grid
    """
    random.seed(seed)

    grid = dict()
    for x_coord, y_coord in itertools.product(range(0, size), repeat=2):
        grid[(x_coord, y_coord)] = random.random() > THRESHOLD

    return grid

I somewhat arbitrarily picked 0.85 as the threshold for determining whether an element contained oil or not, which seemed to give reasonable results. However, the grids that were generated from this approach often contained a lot of isolated ‘oil’ (or active) elements. I was keen to ensure I had some degree of clustering, which would ensure that my code would only return a single reservoir for adjacent active elements. Here’s an example grid that was generated with the code above:

Clustering the Grid

There are a few active elements that are adjacent to one or more other active elements, but a good number are isolated. I wanted to make sure I had something that was a bit more representative of ‘reality’, or at least my idea of reality (I’m not a geologist!), where the active elements would always be clustered together. To achieve this, I modified the build_grid() code perform an additional round of processing to cluster the grid:

""" Generate a grid of elements with a random layout of elements that contain oil
build_grid: Build the grid with an optional random seed
get_neighbors: Find the number of active neighbor elements for a given element
"""

import random
import itertools

THRESHOLD = 0.85

def get_neighbors(this_element, grid):
    """ Returns a list of neighbor location objects

     this_element: A 2-tuple representing the x and y coordinates of an element
     grid: A dictionary containing all elements and their state
     """
    x_coord, y_coord = this_element
    x_offsets = [-1, 0, 1]
    y_offsets = [-1, 0, 1]

    neighbors = list()
    for x_offset in x_offsets:
        for y_offset in y_offsets:
            if x_offset == 0 and y_offset == 0:
                continue

            coord = (x_coord + x_offset, y_coord + y_offset)
            if grid.get(coord, False):
                neighbors.append(coord)
    return neighbors


def build_grid(size, seed=None):
    """ Build a square grid of elements, where each element may or may not contain oil

     size: The number of elements along one edge of the grid
     seed: Random seed to be used to generate the grid
     """
    random.seed(seed)

    initial_grid = dict()
    for x_coord, y_coord in itertools.product(range(0, size), repeat=2):
        initial_grid[(x_coord, y_coord)] = random.random() > THRESHOLD

    grid = set()
    # Cluster the grid. If an active element is not isolated,
    # or if an inactive element has at least 4 active neighbors
    for location, state in initial_grid.items():
        sites_nearby = get_neighbors(location, initial_grid)
        neighbor_count = len(sites_nearby)
        if (state and neighbor_count != 0) or neighbor_count >= 4:
            grid.add(location)

    return grid

This method implements something inspired by Conway’s Game of Life1https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life, by getting the number of adjacent active elements for each element in turn and applying the following rules:

  • If an element is active and has at least 1 active neighbor, it remains active
  • If an element is inactive and has more than 4 active neighbors, it becomes active
  • Any other element is inactive

This tends to produce a more clustered grid, which seems more representative of reality:

Next Steps

And that’s it! I now have some code that generates procedurally generated grids that I can use as inputs to my solution. A couple of notes on this though, before I move on:

  1. I used two different seeds for the two figures in this article, so the active elements don’t line up. Sorry if that causes any confusion!
  2. The code for drawing the output will be covered in a future article

Next step, solve the problem!

The Surveying Problem Part 1: The Problem

During the technical component of an interview for a software developer position I was posed the following problem:

You are working for a company that is responsible for processing survey data for an Oil and Gas company. You are given data that summarizes whether oil was found at a certain grid reference. The assumption is that if oil is found in contiguous grid references (or grid elements), then they form a single oil reservoir. The objective is to write a program that will determine the number and locations of the reservoirs.

The Algorithm

The interview went pretty well, and I managed to write something within 10-15 minutes or so that I think did the job. The coding exercise was done with a shared Google Doc, so I never actually got the chance to run the code, but my interviews seemed happy with the solution. The algorithm I came up with looked something like this:

  1. Create a new active reservoir
  2. For each element in grid:
    1. If the element is in the checked_locations set, go to the next element
    2. Add the element to checked_locations
    1. Add the element to the currently active reservoir
    2. Check if there are any neighbors to this element
      1. If yes: go to 2 (passing in the list of neighbors as the grid)
    1. If we are not currently recursing:
      1. Add the reservoir to the result set
      2. Create a new active reservoir

The Discussion

When I was asked to defend my code, I said I was initially hesitant about going down a recursive route, because Python is generally not a great language for writing functional-style recursive functions. However, a purely iterative approach would mean having to run an arbitrary number of times to see if there were any adjacent reservoirs that could be joined, which also isn’t ideal.

As I said, my interviewers seemed happy with my solution, so I assume this was the solution they had in mind. However, I really wasn’t satisfied with the recursive approach, and I suggested a graph-based approach instead. The crux of the problem is taking a grid of data and understanding the connectivity, which is what graphs are intended to capture. I have a bit of experience with the iGraph and NetworkX Python libraries and knew that a graph-based solution would need probably half as many lines and probably be as performant as the recursive approach, if not more so.

The discussion went along the lines of trading one form of complexity for another. I argued on the side of trying to shift the complexity into the standard library, even potentially at the expense of performance. Yes, a full graph library is maybe overkill for a relatively simple problem. However, I’m always in favor of writing (and maintaining!) as little code as possible.

Outcome

I think the discussion was well-received, but alas I didn’t end up getting the job (probably more to do with COVID-19 than anything else). As a coding exercise, I decided to revisit the code to see firstly if my algorithm above would really work, and how much more/less performant a graph-based approach would be. Expect articles in the near future that show the code I have written for each approach and an analysis of the pros and cons of each!