Hi everyone! I know I said I was going to start work on a weight tracking app next, but I came up with something a bit more fun! I was crawling through some of the fascinating datasets on Kaggle looking for some inspiration, and found a set of over 200,000 Jeopardy questions!1

My plan is to use this to learn a little more about front-end stuff, which is still the area I’m weakest in the software development world. My plan is to build a page that generates random combinations of previous Jeopardy games, potentially allowing two people to play against each other in some kind of shared session.

First things first though, I need to get the raw data from Kaggle into a format that I can use in my front end. Python seems like a sensible tool to do that, and the Pandas library should make that process easy. The first thing I did was to write a simple script that would suck the data out of the CSV and dump it into a SQLite database. Thanks to the magic of Pandas and SqlAlchemy, the code is alarmingly simple…

import pandas as pd
from sqlalchemy import create_engine

df_questions = pd.read_csv('JEOPARDY_CSV.csv', parse_dates=[1])
engine = create_engine('sqlite:///questions.db', echo=False)
df_questions.to_sql('questions', con=engine, if_exists='replace')

And that’s it! Done! Time for a pint! Pandas automatically reads the CSV (including headers), parses the dates into real DateTime objects, and then SqlAlchemy creates a SQLite database file and dumps the data into the database. 5 lines including import statements, even for Python that’s impressive.

However, there are some problems with this. The first is that the column names in the CSV have spaces all over the place, which is not ideal since both SQL and Pandas make you jump through all sorts of hoops to address columns with spaces in the name. However, that’s an easy fix with a renaming dict, which turns the code above into something like this:

import pandas as pd
from sqlalchemy import create_engine

column_rename = {"Show Number": "show_identity",
                 " Air Date": "air_date",
                 " Round": "round_name",
                 " Category": "category",
                 " Value": "value",
                 " Question":"question",
                 " Answer": "answer"}

df_questions = pd.read_csv('JEOPARDY_CSV.csv', parse_dates=[1])
df_questions.rename(columns = column_rename, inplace=True)
engine = create_engine('sqlite:///questions.db', echo=False)
df_questions.to_sql('questions', con=engine, if_exists='replace')

OK, so I’m now up to 7 lines. Still not bad. However, another problem remains, and that is the normalization of the data (or lack of). I’ll talk through the pros and cons of normalizing a dataset in a moment, but first, he’s a look at the current state of the database:

54622008-05-13Double Jeopardy!Who’s Your Mommy?$800King Solomon
54622008-05-13Double Jeopardy!It’s all from the Greek to me$800An institution for the care…
54622008-05-13Double Jeopardy!Dr. Drew$1200People 18 to 25 are vulnerable to…
54622008-05-13Double Jeopardy!Recent Films$1200The attempted assassination…
54622008-05-13Double Jeopardy!Music Class$2000This large woodwind…
54622008-05-13Double Jeopardy!Lend me you Iroquois$1200The Iroquois Nationals…
54622008-05-13Double Jeopardy!Who’s Your Mommy?$1200Invention (proverbially)

This is a random sample of 7 rows from the dataset, and the lack of normalization is pretty obvious. From the 7 columns shown here, only two of them contain unique values (Question and Answer), 2 columns contain repeated values, and 3 columns contain all identical values! In fact, from the 216,930 unique questions in this dataset, there are only 3,640 unique shows and 47,200 unique categories. So there’s certainly a huge scope for normalization here!

I’ll go through the normalization steps I went through in a moment, but as a starting point, here’s the schema diagram for the current unnormalized table:

The data as pulled from Kaggle doesn’t even really conform to the Unnormalized Form2, since it doesn’t have an explicitly defined primary key (however Pandas was kind enough to create one for me automatically when I loaded the CSV). It does however have no duplicate rows, so it’s halfway there.

Normalizing The Dataset

Typically, before I start writing any database code I want to design my schema to figure out how I’m going to represent the data. I design relational databases for a living, and I’ve developed a pretty simple approach to the initial coarse-grained normalization of a database schema.

The trick is to figure out what each table conceptualizes.

Throughout my career, I have typically been involved in designing and building databases to store engineering data. If I need to build a database to store mechanical testing data (for example), I’m probably going to need tables about the material being tested (both the specification of the material and the specific instance of the material), the machine that’s doing the testing, the results of the test, the calculations performed on those results, etc.

In this case, the key Jeopardy concepts I came up with were:

  • The Show (show number, air date)
  • The Rounds (Jeopardy!, Double Jeopardy!, Final Jeopardy!)
  • The Categories (History, Brass, Birds, “T”elevision)
  • Question & Answers

Everything associated with Jeopardy (except for the contestants and Aaaaalex Trebek!) fits into one of those buckets, and so that sounds like a pretty good way to go about designing the normalized database. To formalize things up a bit, here it is in UML:

This feels like the appropriate way to construct the database since it relates the key concepts of a Jeopardy game in a way that makes conceptual sense, e.g. multiple categories are included in a single show, and each category contains multiple questions and answers. Each question and answer pair belongs to a single category, and has a specific ‘value’.

However, I want to be sure I’ve considered the most significant possible ways of normalizing the data, so I’ll now go through the first three normal forms to make sure the database either adheres to that form, or that I have a justification for not doing so.

1st Normal Form (1NF)

The First normal form simply states that the columns in a table should be atomic (i.e. indivisible). We were actually already there with the initial CSV, so we pass this first test straight off. Great!

2nd Normal Form (2NF)

The Second normal form get’s a bit trickier. To pass this one, we need to ensure that every column that is not part of a candidate key depends on the entire candidate key, and not just a part of it. A common ‘smell’ associated with breaking 2NF is that there’s a lot of duplicated data.

Our original database didn’ satisfy 2NF all over the place. A candidate key for our original database was the Question itself, but everything except the answer doesn’t depend on the question. Pulling the Question, Answer, and Value columns into a separate table helps with this, and then doing the same for the category name ensures all tables are compliant with 2NF.

3rd Normal Form (3NF)

The Third normal form states that every non-prime column is non-transitively dependent on every key in that table, i.e. there should not be a column whose values depend on something other than a key. Again, this is something that my schema satisfies due to the creation of separate tables for each entity.

For example, in the original dataset, the air date of the show is functionally dependent on the show_identity and not on the question candidate key. Therefore there’s nothing to stop the same episode identity from having multiple different air dates. This is a violation of 3NF.

The normalized schema creates a dedicated Show table with the show identity, and the air_date is dependent purely on this column.

Pros and Cons of Excessive Normalization

Before I go on to the code that performs the nomalization, I want to address the question of whether too much normalization is a bad thing. The answer in my experience is that you have to determine that on a case-by-case basis, based on both the data that you are storing and the way that data is being used. It’s pointless having a theoretically perfect database if you have to perform 30 joins that inur a huge performance penalty, and even worse if the front-end tool you are using just can’t do the joins that you need!

With that being said, I always like to start with an over-normalized database since during the implementation phase of a project it’s generally easier to denormalize a database than it is to normalize a database. Also, you only realize what you need to denormalize once you actually try to use the database. So I think it’s always better to over-normalize early on in a project, and then denormalize as the front-end comes together and additional constraints become clear.

The Code

The code to normalize the database makes extensive use of copying subsets of DataFames in Pandas, dropping columns, removing duplicates, and mapping values. Most of it was pretty straight forward, but one particular area that required some figuring out was where I generate the category table. This step required creating an entirely new identity that was dependent on the category name, the round, and the show (since category names could be re-used between shows). My original approach to doing this was to:

  1. Copy the questions DataFrame to a ‘category’ DataFrame
  2. Drop the columns I didn’t want (value, question, answer)
  3. Delete duplicate rows
  4. Re-generate the DataFrame index
  5. Find the category identity (index) based on each round, category, show tuple in the questions DataFrame
# Create a categories table
df_categories = pd.DataFrame(columns=['show_identity', 'round_name', 'name'])

df_categories['show_identity'] = df_questions['show_identity'].to_numpy()
df_categories['round_name'] = df_questions['round_name'].to_numpy()
df_categories['name'] = df_questions['category'].to_numpy()
df_categories.drop_duplicates(inplace=True, ignore_index=True)

def get_category_id(question):
    show_identity = question.show_identity
    round_name = question.round_name
    category = question.category

    category_id = df_categories[(df_categories.round_name == round_name) &
                                       ( == category) &
                                       (df_categories.show_identity == show_identity)].index[0]
    return category_id

# Get the category ID for each question
category_ids = df_questions.swifter.apply(get_category_id, axis=1)

# Add the category identity to the questions table, and delete unneeded columns
df_questions.insert(loc=0, column='category_identity', value=category_ids)

This just felt wrong when I was writing it; doing 200,000 lookups across three columns in a data frame is never going to be efficient, and even with the swifter3 library that parallelized the operation across my 12-core Ryzen it still took over 20 minutes to complete.

The better approach was to create the category index before I duplicated the DataFrame, that way the cross-referencing was already done without having to resort to a whole load of lookups. The steps to do this were as follows:

  1. Create the category DataFrame and copy the relevant columns
  2. Create the composite key based on those relevant columns and create a map between those values and an int
  3. Map the composite key in the category DataFrame
  4. Map the composite key in the question DataFrame, and drop the columns that were moved to the category DataFrame

This cut the time taken to build the category table from around 20 minutes to about 10 seconds, a factor of improvement of about 120! You can see that code below, starting from the comment line # Create a categories table.

import pandas as pd
from sqlalchemy import create_engine
import numpy as np

column_rename = {"Show Number": "show_identity",
                 " Air Date": "air_date",
                 " Round": "round_name",
                 " Category": "category",
                 " Value": "value",
                 " Question":"question",
                 " Answer": "answer"}

df_questions = pd.read_csv('JEOPARDY_CSV.csv', parse_dates=[1])

df_questions.rename(columns = column_rename, inplace=True)

# Turn the value into an integer
df_questions['value'] = df_questions['value'].map(
    lambda a: int((a[1:].replace(',', '')))
        if a != 'None'
    else np.NaN)

# Create a shows table, and delete corresponding columns in questions table
df_shows = pd.DataFrame(columns=['identity', 'air_date'])
df_shows['identity'] = df_questions['show_identity'].to_numpy()
df_shows['air_date'] = df_questions['air_date'].to_numpy()
df_shows.drop_duplicates(inplace=True, ignore_index=True)
df_shows.set_index('identity', inplace=True, append=False)
df_questions.drop('air_date', axis=1, inplace=True)

# Create a categories table
df_categories = pd.DataFrame()

df_categories['show_identity'] = df_questions['show_identity'].to_numpy()
df_categories['round_name'] = df_questions['round_name'].to_numpy()
df_categories['category'] = df_questions['category'].to_numpy()

# Define the composite key that uniquely identifies a category
category_composite_key = ['show_identity', 'round_name', 'category']
df_categories.drop_duplicates(inplace=True, ignore_index=True)

old_category_identities = df_categories[category_composite_key].apply(lambda row: '_'.join(row.values.astype(str)),
new_category_identities = range(len(old_category_identities))
category_identity_mapping = dict(zip(old_category_identities, new_category_identities))

def map_category_identity(category):
    composite_key = '_'.join(category.values.astype(str))
    return category_identity_mapping[composite_key]

df_categories['identity'] = df_categories.apply(map_category_identity, axis=1)
df_categories.rename(columns={'category': 'name'}, inplace=True)
df_categories.index.rename('identity', inplace=True)
df_categories.set_index(keys=['identity'], append=False, inplace=True, drop=True)

# Perform the same mapping on the questions themselves, and drop the individual elements of the key
df_questions.insert(0, 'category_identity', value=df_questions[category_composite_key].apply(map_category_identity,
df_questions.drop(category_composite_key, axis=1, inplace=True)
df_questions.index.rename('identity', inplace=True)

# Create a rounds table
round_name_mapping = {'Jeopardy!': 0, 'Double Jeopardy!': 1, 'Final Jeopardy!': 2, 'Tiebreaker': 3}
df_rounds = pd.DataFrame.from_dict({'round_name': round_name_mapping.keys()})
df_rounds.index.rename('identity', inplace=True)

round_identity = df_categories.apply(lambda c: round_name_mapping[c.round_name], axis=1)
df_categories.insert(loc=0, column='round_identity', value=round_identity)
df_categories.drop('round_name', axis=1, inplace=True)

# Export to SQLite
engine = create_engine('sqlite:///questions.db', echo=False)
df_questions.to_sql('questions', con=engine, if_exists='replace')
df_shows.to_sql('shows', con=engine, if_exists='replace')
df_categories.to_sql('categories', con=engine, if_exists='replace')
df_rounds.to_sql('rounds', con=engine, if_exists='replace')

Referential Integrity

The code above successfully creates the different tables in SQLite with the right columns, which allow all the various joins to be performed to find out which questions appeared on a certain show (through the category table), or ultimately even to re-create the original CSV.

However, the SQLite database doesn’t have any constraints to ensure that any new data added to the database satisfy the requirements to be able to do the joins described above. And unfortunately, since I’m using SQLite at the moment, there’s no simple way to apply constraints to tables in the database if you are creating them with dataframe_to_sql()4 Since this is still in the prototype stage it’s not really critical that I handle this now, but before going into production I’d probably change this to a MySQL or Postgres backend, and at that point, I’d want to make sure my referential integrity is all squared away.

Next Steps

As I mentioned at the top, my goal for this is to be able to create something in a front-end that maybe lets me play a randomly generated game of jeopardy, potentially with other people in some kind of shared session over the web.

However, the first step in that is going to be building a REST API that can return a random Jeopardy ‘game’ when requested. I’ll probably implement the first pass in Flask, but I also want to give some other languages a try; I’ve done some very basic work with Spring Boot in Java, and I think it would make a lot of sense to use that here as well. Maybe have the backend in Spring Boot and the front-end in a Flask-based web app…

Anyway, watch this space and see what I come up with!

The Surveying Problem Part 7: Analysis and Conclusions


This final entry in the series will attempt to summarize the results presented so far in terms of the different solutions found, and explore the different parameters in the problem that might influence the appropriate solution.

I’ve written 5 different solutions to this problem, in order of implementation they are:

  • A recursive depth-first search
  • Using NetworkX
  • Using iGraph
  • Using graph-tools
  • A stack-based depth-first search

To figure out which approach is the most efficient, I need a DoE that takes into consideration the different parameters that might affect performance. I’ve already done some investigations into the amount of time taken building the network compared to finding the clusters in the network, and some approaches are likely more suited to situations where the average reservoir size is higher/lower.

To investigate this, the first thing I looked into was this dependence of the average reservoir size on the grid size and probability. It’s not strictly necessary to understand which approach is most performant, but I thought it might be an interesting thing to investigate.

Number of reservoirs vs probability and grid size

This isn’t really a property of the solution, it’s more of a property of the problem itself. Here’s a plot of how the average reservoir size (and therefore the ratio of active sites to grid size) changes as a function of the Probability Threshold (i.e. the number between 0 and 1 I use to determine if a site is ‘active’ or not) and the grid size. I’m only giving one dimension, but the grid itself is square.

Figure 1: Dependence of grid size and probability threshold on average reservoir size

If there were no clustering I’d expect this to be dependent on probability threshold only, although with much higher variability at low grid size because the sample size is so much smaller. However, the clustering step adds a dependency on grid size for small grid sizes only. This is because I’m not using any boundary conditions in my grid, it’s just a hard edge. For a grid size of 10×10, there are 100 elements, but 32 of those only have 5 neighbors instead of 8. And 4 only have 3 neighbors! The table below summarizes the average number of neighbors per site for each grid size:

Grid SizseNumber of Edge + Corner ElementsAverage Number of Neighbors
10×1032 + 46.84
50×50192 + 47.762
100×100392 + 47.880
500×5001,992 + 47.976
1000×10003,992 + 47.988
5000×500019,992 + 47.998
10000×1000039,992 + 47.999

There’s a pretty clear correlation here between the drop off of average neighbors and the drop off of average reservoir size. Once the grid gets to 100×100 then the average number of neighbors is within 98% of 8 neighbors in an infinite grid, so we’d expect 100×100 to be the first grid with a static average reservoir size, and that’s basically what figure 1 shows.

I’m not a mathematician, but I’ve got a pretty strong feeling that there’s an analytical relationship between the grid size and the number of neighbors. There’s nothing here that really relates to the puzzle, but it’s an interesting geometrical quirk of the way I’m building the grid. Getting to the bottom of this is something else to add to the backlog!

Designing the Experiment

Anyway, back to the problem at hand. It’s clear that both the Probability Threshold and Grid Size have an impact on the problem, so my DoE was to run at grid sizes of 10-10000 in logarithmic increments, and probability thresholds of 0.99 (obviously 1.00 is trivial) and 0.95 to 0.6 in 0.05 increments. I was previously unable to hit 10,000×10,000 due to RAM constraints, but I recently treated myself to another 16 gig, so it’s now within reach!

To avoid letting this run for ages, I started off just running the 4 extreme cases with the entire set of algorithms. The idea being if I see 2 implementations that are fastest on all 4 cases, I’ll run the entire DoE on those 2 only.

But disaster! I hit a recursion limit with the threshold set to 0.6! Even by allowing a higher recursion limit with sys.setrecursionlimit Python was still segfaulting, which means I was probably hitting my system stack size limit. I could have increased this stack size, but I decided just to back the DoE limit off to 0.7 instead. I did check the populated fraction of the grid at a threshold of 0.6, and it had climbed to an impressive 63%! The default Python recursion limit is 1000, so I must have had a reservoir that exceeded that limit.

Here are the results of my sparse run for all solutions at those four extremes of the parameter ranges (yes, I stole this plot from the previous article):

Figure 2: Comparison of execution times for all methods for extreme parameter values

It’s clear here that the Stack and Recursive methods are the fastest, so I’ll disregard the external library solutions for the rest of the analysis.

Recursive Vs Stack

So let’s take a deeper look into it. Here’s the data for the full range of solutions for both the recursive and the stack solutions.

Figure 3: Performance of recursive vs stack DFS algorithms

I experimented with a surface plot to show the explicit dependence on both parameters with all the cross-terms included, but it was difficult to really see what was going on. So I settled instead on the slices along the edges of the parameter space.

The stack approach performs either similar to or better than the recursive approach in all situations, so it’s clearly the optimum solution. However, there are some situations, both at small grids and for medium probabilities, where the difference between them really isn’t that great. I’m not really sure why this is, but it must be that the time saved in not doing a whole load of recursing balances against the set subtractions required in the stack approach.

Set subtraction is an O(n) operation, so the total time performing a set subtraction scales linearly with the grid size. The recursion however becomes more of an issue as the average reservoir size increases. To see how this is the case, if the average reservoir size was 1, then no recursion would be necessary at all. Therefore, large reservoir sizes will have a disproportionately negative effect on the recursive approach compared to the stack approach. So with all that being said, I don’t know why I’m seeing the behavior I am. Figure 3 shows that the difference between the two seems more dependent on the grid size than the probability threshold, and figure 1 shows that reservoir size is dependent on the probability threshold only beyond a grid size of 100.

The only reason I can think of is that there’s some other effect that’s slowing the recursive method down more for increasing grid size which is dwarfing the real effect of recursive vs stack. One possible culprit might be the set copy operation I perform at the beginning of the recursion, so I can continue to find neighbors of the site in question. It would make sense that as the set gets very large it becomes more expensive to copy it.

If this were the other way around and I was seeing a performance hit in the stack implementation I would look into it more. However, i know that the stack implementation is the right one in Python, so I’m going to leave it as is. However, another task for the backlog is to see if I can get rid of that set copy operation somehow and try to improve performance of the recursive approach.

List vs Deque

The final step is to investigate which stack is the most performant for this particular scenario. I mentioned previously that both lists and deques can be used for this, since they both support FILO accessing. Here’s the results of running the same set of experiments on a list version of the code and a deque version of the code.

Figure 4: Effect of list vs deque on the performance of the stack algorithm

There’s really no difference here. With the exception of the results at 5000, 0.99 an 10, 0.75, they behave pretty much identically. This isn’t surprising, since the average size of the stack is probably around 20 items since at most we have a large forest of very small trees.

Conclusions and Discussion

So, the best implementation here is the stack-based implementation using the deque-based stack, which is pretty uncontroversial. However, is this actually a practical solution if this were a real-world problem? To start thinking about this, I’m going to focus on the North Sea.

There doesn’t seem to be any information easily available on the total sizes that are surveyed, but the North Sea itself is 220,000 square miles, so that’s a fair place to start. The maximum grid size that I considered here was 10,000×10,000, which when mapped to the North Sea would give a resolution of approximately 2.2 * 10-3 square miles (1.5 acres, 0.5 hectares). That’s probably way too small a grid size, a size of 500×500 would give a resolution of 0.88 square miles, which seems much more reasonable.

There are 173 active rigs in the North Sea. My crude model for building and clustering the grid with a probability threshold of 0.95 generated a 500×500 grid with about 300 reservoirs, so it’s not a million miles away from reality. If this were required to be production code then obviously I’d be given the grid to work with, however the signs here point to Python being a perfectly performant solution for this scenario.

However, there are two realistic-type extensions that I can immediately think of that might be required here.

The first is that we are treating this scenario as a 2-dimensional investigation only. The mathenatics of this solution could be easily translated into 3 or more dimensions, but obviously the 3-dimensional case is the only extenson that’s actually physically useful. Oil deposits typically form in the strata of the continental shelf, so this would probably only require the addition of 10 or so sites in the z direction.

The second is that there may be other analytics required beyond just determining the number and size of the reservoir. Some simple additions could be made to the code, but if any other analytics were needed that more closely followed a graph-theory type analysis, I’d certainly move to implement a library instead of writing custom code. It would then depend on the trade-off between maintainability vs performance vs licensing requirements as to which library I’d go for:

  • graph-tool is the most performan, but it’s LGPL licensed and is a huge pain to compile
  • NetworkX is the slowest, but it’s a simple pip install, even on Windows, and it’s released under a BSD license, so there’s no issues with commercial reuse


So that ends this investigation! It was a simple throw-away question in an interview, but the question had some really interesting mathematical features and potential solutions, so I thought it was worth digging into in a bit more detail.

I plan to polish and publish the code, and then I’ll start working on my next project. I have a few ideas, but I’m happy for any inspiration on what I can do next!

The Surveying Problem Part 1: The Problem

During the technical component of an interview for a software developer position I was posed the following problem:

You are working for a company that is responsible for processing survey data for an Oil and Gas company. You are given data that summarizes whether oil was found at a certain grid reference. The assumption is that if oil is found in contiguous grid references (or grid elements), then they form a single oil reservoir. The objective is to write a program that will determine the number and locations of the reservoirs.

The Algorithm

The interview went pretty well, and I managed to write something within 10-15 minutes or so that I think did the job. The coding exercise was done with a shared Google Doc, so I never actually got the chance to run the code, but my interviews seemed happy with the solution. The algorithm I came up with looked something like this:

  1. Create a new active reservoir
  2. For each element in grid:
    1. If the element is in the checked_locations set, go to the next element
    2. Add the element to checked_locations
    1. Add the element to the currently active reservoir
    2. Check if there are any neighbors to this element
      1. If yes: go to 2 (passing in the list of neighbors as the grid)
    1. If we are not currently recursing:
      1. Add the reservoir to the result set
      2. Create a new active reservoir

The Discussion

When I was asked to defend my code, I said I was initially hesitant about going down a recursive route, because Python is generally not a great language for writing functional-style recursive functions. However, a purely iterative approach would mean having to run an arbitrary number of times to see if there were any adjacent reservoirs that could be joined, which also isn’t ideal.

As I said, my interviewers seemed happy with my solution, so I assume this was the solution they had in mind. However, I really wasn’t satisfied with the recursive approach, and I suggested a graph-based approach instead. The crux of the problem is taking a grid of data and understanding the connectivity, which is what graphs are intended to capture. I have a bit of experience with the iGraph and NetworkX Python libraries and knew that a graph-based solution would need probably half as many lines and probably be as performant as the recursive approach, if not more so.

The discussion went along the lines of trading one form of complexity for another. I argued on the side of trying to shift the complexity into the standard library, even potentially at the expense of performance. Yes, a full graph library is maybe overkill for a relatively simple problem. However, I’m always in favor of writing (and maintaining!) as little code as possible.


I think the discussion was well-received, but alas I didn’t end up getting the job (probably more to do with COVID-19 than anything else). As a coding exercise, I decided to revisit the code to see firstly if my algorithm above would really work, and how much more/less performant a graph-based approach would be. Expect articles in the near future that show the code I have written for each approach and an analysis of the pros and cons of each!