Building on the data ingestion pipeline I showed last time, today I went a bit further and started to build out a website based on that data! It’s still very early days, but I did get far enough through to have a functioning JSON endpoint. Here are the steps that I went through to get to an initial basic version of JeoPydy!

Database enhancements

My first step was to finish the data ingestion pipeline by adding constraints to the schema. I did this by defining the schema explicitly in SQL instead of relying on the auto-generated schema from DataFrame.to_sql(), which gives me greater control over the resulting database:

  round_name TEXT NOT NULL
  air_date TEXT NOT NULL
CREATE TABLE category (
  round_id INTEGER NOT NULL,
  FOREIGN KEY (round_id) REFERENCES round (id),
  FOREIGN KEY (show_id) REFERENCES show (id)
CREATE TABLE question (
  category_id INTEGER NOT NULL,
  value INTEGER,
  question TEXT NOT NULL,
  answer TEXT,
  FOREIGN KEY (category_id) REFERENCES category (id)

The key things I’m doing here are:

  • Types: Data frame columns are also typed, so these would have been mapped across to the SQL table anyway, but I still think it’s good to have more explicit control over these, especially since my types are not dynamic
  • Constraints: I have set the id columns to be autoincremented primary keys, and I have also set foreign key constraints for the relevant columns to ensure referential integrity, i.e. that a record I create in a linked table has a valid reference to the foreign table. The foreign key constraint is useful to make sure I’m not accidentally creating any orphaned rows during the ingestion pipeline, but the autoincremented primary key will only really be useful if I start adding data to the database through a separate method.

It’s worth saying I could probably use SqlAlchemy to do this instead of writing pure SQL, but that’s a project for another time.


My next step was to write the actual web API. The eventual aim of the project is to write a website that will allow interactions, but my front-end experience is limited, so it makes sense to start with what I know. The HTML and CSS trickery can come later.

In previous projects, I have used the base http.server1 provided in core Python, but I wouldn’t do it again. Recent versions of python (3.7+ I think) have added a new class called ThreadingHTTPServer that serves each request in a separate thread which results in much more performant code when dealing with concurrent requests, but there’s still a whole lot of plumbing that’s needed just to write a server that can do basic GET calls. And then if you want anything like authentication, SSL, session support, etc., you have to manually implement it. It’s kind of fun manually implementing the code for how to handle every method on every endpoint, but only up to a point…

The alternative is the Flask2 framework, and the major advantage is that does all this stuff for you. I don’t have a whole lot of experience with Flask, but I have previously worked my way through the excellent Flask tutorial3 which does a great job of introducing the concepts and philosophy of Flask in the context of a real application. So many package tutorials I have worked through neglect to do this, and it makes the learning curve a lot steeper to implement the package in a real project. Having a tutorial that both follows best practices and results in a real production-ready application makes it so much easier to understand how to apply the package to your specific requirements, as opposed to just what it does from a functionality perspective. It’s so good, it’s also a great place to start from if you are writing your own code, which I unashamedly did here. It’s released under the 3-clause BSD license4, so there’s no legal or ethical issues with doing this either.

The relevant parts of the code are the application factory, database initialization function, and then a single blueprint for accessing the data and returning a dictionary as a json response:

import os
from flask import Flask

def create_app(test_config=None):
    # create and configure the app
    app = Flask(__name__, instance_relative_config=True)
        DATABASE=os.path.join('jeopydy', 'jeopydy.db'),

    # ensure the instance folder exists
    except OSError:

    from . import db

    from . import round

    return app
import sqlite3
import click
from flask import current_app, g
from flask.cli import with_appcontext
import jeopydy.build_database

def get_db():
    if 'db' not in g:
        g.db = sqlite3.connect(

        g.db.row_factory = sqlite3.Row
    return g.db

def close_db(e=None):
    db = g.pop('db', None)
    if db is not None:

def init_db():
    db = get_db()

    with current_app.open_resource('schema.sql') as f:

def init_db_command():
    """Clear the existing data and create new tables."""
    click.echo('Initialized the database.')

def init_app(app):
from flask import (
    Blueprint, flash, g, redirect, render_template, request, url_for
from werkzeug.exceptions import abort
import datetime

from jeopydy.db import get_db

bp = Blueprint('round', __name__)

def get_round():
    db = get_db()
    round = db.execute(
        "SELECT, s.air_date, value, question, answer "
        "FROM question as q "
        "INNER JOIN category c on = q.category_id "
        "INNER JOIN show s on = c.show_id "
        "WHERE q.category_id in ("
            "SELECT "
            "FROM category as c "
            "WHERE c.round_id == 0 "
            "ORDER BY RANDOM() LIMIT 6) "

    response = dict()
    for category_name, air_date, value, question, answer in round:
        category_date = f"{category_name} " \
                        f"({datetime.datetime.strptime(air_date, '%Y-%m-%d %H:%M:%S').strftime('%B %-d, %Y')})"
        if category_date not in response:
            response[category_date] = dict()
        response[category_date][value] = question, answer
    return response

def api():
    return get_round()

The tutorial referenced above gives a thorough run-through of what each line/block does, so I’ll only cover the parts specific to my project here.

The presence of an file denotes the folder it is in as a package, and the code within that file is executed whenever that package is imported. In the world of Flask, that means the Flask application can be written as its own package, which helps in managing and organizing the code, e.g. separate parts of the application can be encapsulated into separate modules, allowing a clearer separation of concerns. My application is nowhere near that scale yet, but as I start adding the concept of users, PvP games, etc., it probably will be.

The code itself is pretty basic, it just implements a factory pattern to create the Flask app class instance (JeoPydy – every Python application needs a witty name, right?) and registers my API blueprint defined below.

The file is responsible for creating the database if it doesn’t already exist, and managing the connection to the database. Here I import the module I wrote in the previous post that actually processes and imports the data, but first I execute the SQL schema above to create the tables with the appropriate constraints.

One really cool feature here is the @click.command('init-db') decorator. The Click library5 lets you turn a function into a command-line application with barely any code. In this case, adding the decorator to the init_db_command() function lets us do flask init-db from the command line, which automatically set up the Flask app and initializes the database, without us having to worry about calling Python and calling it with a specific function of a specific module. Beautiful.

Finally, actually defines the API endpoint that returns the questions and answers for a Jeopardy round. The query is included in the Python code above, but I’ll break it out here separately:

SELECT, s.air_date, value, question, answer
FROM question as q
INNER JOIN category c on = q.category_id
INNER JOIN show s on = c.show_id
WHERE q.category_id in (
    FROM category as c
    WHERE c.round_id == 0

The main query selects a set of rows from the question table, along with the Category Name (very important!) and original Air Date (sometimes important for contemporaneous reasons) for each. However, the query needs to limit the set of rows to all of those linked to 6 randomly selected Jeopardy (or Double Jeopardy) categories. I achieved this through a sub-query, where I first randomly select 6 categories and then feed this into the WHERE statement for my main query.

Performance for this is pretty good, it only takes around 50 ms from an SQLite database. Before this goes into production I’ll switch to a more scalable database engine, I have the most experience with MS SQL Server, and since it’s now available in Linux (and there’s an official Docker image6 – good job Microsoft!), I might give that a go. However, the current situation is fine for this stage of the project. If performance becomes an issue later on I’ll look to denormalize the database to avoid the need for the sub-query, but it’s working well enough right now.

The rest of the code sets up the blueprint. The blueprint contains a single function mapped to the '/api' path, which returns the results of the SQL query above transformed into a JSON payload. Another great feature of Flask is that I don’t need to worry about serializing the JSON before I return it, I just return the dict and Flask figures out what to do automatically!


Here’s the response from the endpoint:

  "BUSINESS & INDUSTRY (December 23, 2002)": {
    "200": [
      "This company serving about 30 million customers features the following in its ads [(actor walking across the <i>Jeopardy!</i> stage)  <i>Can you hear me now?  Good!  Can you hear me now?  Good!</i>]",
      "Verizon Wireless"
    "400": [
      "It was once described as 2 \"beautifully embossed, choc. flavored wafers with a rich cream filling\"",
      "Oreo cookies"
    "600": [
      "This Australian-born moguls's first U.S. newspaper acquisition was the San Antonio Express & News",
      "Rupert Murdoch"
    "800": [
      "The \"K\" in K-Mart's name stands for this founder",
      "S.S. Kresge"
    "1000": [
      "The lion on this French auto company's logo was adopted from Belfort, the city where an early model was made",
  "EPITAPHS (March 26, 1999)": {
    "100": [
      "\"Author of the Declaration of American Independence...\"",
      "Thomas Jefferson"
    "200": [
      "\"Quoth the raven nevermore\"",
      "Edgar Allan Poe"
    "300": [
      "\"Truth and history.  21 men.  The boy bandit king -- he died as he had lived\"",
      "Billy the Kid"
    "500": [
      "\"Man of 1000 Voices\"",
      "Mel Blanc"
    "1000": [
      "\"61/61.  Against all odds\"",
      "Roger Maris"

The real response contained 6 categories, but I’ve truncated it to 2 categories here to save space. Something interesting though is that some categories have the current values ($200 to $1,000 in $200 increments), but some questions have half that ($100 to $500 in $100 increments), and some have some kind of hybrid, with $100 to $300 in $100 increments, and then the final two questions being worth $500 and $1,000. I might want to normalize these at some point, possibly by transforming all questions that were aired before the date on which Jeopardy settled on the current prize money structure.

Another quirk of the dataset is that sometimes there are missing question/answer pairs in a given category. I assume this is because sometimes those questions are never revealed; sometimes the time expires, and so the question may never be revealed, and therefore never makes it onto I’ll have to decide what I want to do with these, it may be that I drop them from the initial database, or maybe flag the category as incomplete so they can be filtered out when the query runs.

Summary & Next Steps

I’m pretty pleased with how far I managed to get in maybe 1 or 2 hours! The code samples above include a few firsts for me:

  • Defining a schema in SQL, including ensuring referential integrity
  • Using a nested SQL query
  • Using Flask to build an API

The surprising thing is, none of this was really that difficult! I think it’s a testament to the Flask framework for eliminating the need to write a whole load of boilerplate code and having a great tutorial, but it also gives me a sense of confidence that this stuff isn’t really that hard, I just need an engaging project as a reason to learn it.

The next steps I want to accomplish are to serve this up using a self-hosted WSGI server and then to write a website using the Jinja7 templating language. Hopefully, the next article will include some screenshots of a real interactive Jeopardy board!


Hi everyone! I know I said I was going to start work on a weight tracking app next, but I came up with something a bit more fun! I was crawling through some of the fascinating datasets on Kaggle looking for some inspiration, and found a set of over 200,000 Jeopardy questions!1

My plan is to use this to learn a little more about front-end stuff, which is still the area I’m weakest in the software development world. My plan is to build a page that generates random combinations of previous Jeopardy games, potentially allowing two people to play against each other in some kind of shared session.

First things first though, I need to get the raw data from Kaggle into a format that I can use in my front end. Python seems like a sensible tool to do that, and the Pandas library should make that process easy. The first thing I did was to write a simple script that would suck the data out of the CSV and dump it into a SQLite database. Thanks to the magic of Pandas and SqlAlchemy, the code is alarmingly simple…

import pandas as pd
from sqlalchemy import create_engine

df_questions = pd.read_csv('JEOPARDY_CSV.csv', parse_dates=[1])
engine = create_engine('sqlite:///questions.db', echo=False)
df_questions.to_sql('questions', con=engine, if_exists='replace')

And that’s it! Done! Time for a pint! Pandas automatically reads the CSV (including headers), parses the dates into real DateTime objects, and then SqlAlchemy creates a SQLite database file and dumps the data into the database. 5 lines including import statements, even for Python that’s impressive.

However, there are some problems with this. The first is that the column names in the CSV have spaces all over the place, which is not ideal since both SQL and Pandas make you jump through all sorts of hoops to address columns with spaces in the name. However, that’s an easy fix with a renaming dict, which turns the code above into something like this:

import pandas as pd
from sqlalchemy import create_engine

column_rename = {"Show Number": "show_identity",
                 " Air Date": "air_date",
                 " Round": "round_name",
                 " Category": "category",
                 " Value": "value",
                 " Question":"question",
                 " Answer": "answer"}

df_questions = pd.read_csv('JEOPARDY_CSV.csv', parse_dates=[1])
df_questions.rename(columns = column_rename, inplace=True)
engine = create_engine('sqlite:///questions.db', echo=False)
df_questions.to_sql('questions', con=engine, if_exists='replace')

OK, so I’m now up to 7 lines. Still not bad. However, another problem remains, and that is the normalization of the data (or lack of). I’ll talk through the pros and cons of normalizing a dataset in a moment, but first, he’s a look at the current state of the database:

54622008-05-13Double Jeopardy!Who’s Your Mommy?$800King Solomon
54622008-05-13Double Jeopardy!It’s all from the Greek to me$800An institution for the care…
54622008-05-13Double Jeopardy!Dr. Drew$1200People 18 to 25 are vulnerable to…
54622008-05-13Double Jeopardy!Recent Films$1200The attempted assassination…
54622008-05-13Double Jeopardy!Music Class$2000This large woodwind…
54622008-05-13Double Jeopardy!Lend me you Iroquois$1200The Iroquois Nationals…
54622008-05-13Double Jeopardy!Who’s Your Mommy?$1200Invention (proverbially)

This is a random sample of 7 rows from the dataset, and the lack of normalization is pretty obvious. From the 7 columns shown here, only two of them contain unique values (Question and Answer), 2 columns contain repeated values, and 3 columns contain all identical values! In fact, from the 216,930 unique questions in this dataset, there are only 3,640 unique shows and 47,200 unique categories. So there’s certainly a huge scope for normalization here!

I’ll go through the normalization steps I went through in a moment, but as a starting point, here’s the schema diagram for the current unnormalized table:

The data as pulled from Kaggle doesn’t even really conform to the Unnormalized Form2, since it doesn’t have an explicitly defined primary key (however Pandas was kind enough to create one for me automatically when I loaded the CSV). It does however have no duplicate rows, so it’s halfway there.

Normalizing The Dataset

Typically, before I start writing any database code I want to design my schema to figure out how I’m going to represent the data. I design relational databases for a living, and I’ve developed a pretty simple approach to the initial coarse-grained normalization of a database schema.

The trick is to figure out what each table conceptualizes.

Throughout my career, I have typically been involved in designing and building databases to store engineering data. If I need to build a database to store mechanical testing data (for example), I’m probably going to need tables about the material being tested (both the specification of the material and the specific instance of the material), the machine that’s doing the testing, the results of the test, the calculations performed on those results, etc.

In this case, the key Jeopardy concepts I came up with were:

  • The Show (show number, air date)
  • The Rounds (Jeopardy!, Double Jeopardy!, Final Jeopardy!)
  • The Categories (History, Brass, Birds, “T”elevision)
  • Question & Answers

Everything associated with Jeopardy (except for the contestants and Aaaaalex Trebek!) fits into one of those buckets, and so that sounds like a pretty good way to go about designing the normalized database. To formalize things up a bit, here it is in UML:

This feels like the appropriate way to construct the database since it relates the key concepts of a Jeopardy game in a way that makes conceptual sense, e.g. multiple categories are included in a single show, and each category contains multiple questions and answers. Each question and answer pair belongs to a single category, and has a specific ‘value’.

However, I want to be sure I’ve considered the most significant possible ways of normalizing the data, so I’ll now go through the first three normal forms to make sure the database either adheres to that form, or that I have a justification for not doing so.

1st Normal Form (1NF)

The First normal form simply states that the columns in a table should be atomic (i.e. indivisible). We were actually already there with the initial CSV, so we pass this first test straight off. Great!

2nd Normal Form (2NF)

The Second normal form get’s a bit trickier. To pass this one, we need to ensure that every column that is not part of a candidate key depends on the entire candidate key, and not just a part of it. A common ‘smell’ associated with breaking 2NF is that there’s a lot of duplicated data.

Our original database didn’ satisfy 2NF all over the place. A candidate key for our original database was the Question itself, but everything except the answer doesn’t depend on the question. Pulling the Question, Answer, and Value columns into a separate table helps with this, and then doing the same for the category name ensures all tables are compliant with 2NF.

3rd Normal Form (3NF)

The Third normal form states that every non-prime column is non-transitively dependent on every key in that table, i.e. there should not be a column whose values depend on something other than a key. Again, this is something that my schema satisfies due to the creation of separate tables for each entity.

For example, in the original dataset, the air date of the show is functionally dependent on the show_identity and not on the question candidate key. Therefore there’s nothing to stop the same episode identity from having multiple different air dates. This is a violation of 3NF.

The normalized schema creates a dedicated Show table with the show identity, and the air_date is dependent purely on this column.

Pros and Cons of Excessive Normalization

Before I go on to the code that performs the nomalization, I want to address the question of whether too much normalization is a bad thing. The answer in my experience is that you have to determine that on a case-by-case basis, based on both the data that you are storing and the way that data is being used. It’s pointless having a theoretically perfect database if you have to perform 30 joins that inur a huge performance penalty, and even worse if the front-end tool you are using just can’t do the joins that you need!

With that being said, I always like to start with an over-normalized database since during the implementation phase of a project it’s generally easier to denormalize a database than it is to normalize a database. Also, you only realize what you need to denormalize once you actually try to use the database. So I think it’s always better to over-normalize early on in a project, and then denormalize as the front-end comes together and additional constraints become clear.

The Code

The code to normalize the database makes extensive use of copying subsets of DataFames in Pandas, dropping columns, removing duplicates, and mapping values. Most of it was pretty straight forward, but one particular area that required some figuring out was where I generate the category table. This step required creating an entirely new identity that was dependent on the category name, the round, and the show (since category names could be re-used between shows). My original approach to doing this was to:

  1. Copy the questions DataFrame to a ‘category’ DataFrame
  2. Drop the columns I didn’t want (value, question, answer)
  3. Delete duplicate rows
  4. Re-generate the DataFrame index
  5. Find the category identity (index) based on each round, category, show tuple in the questions DataFrame
# Create a categories table
df_categories = pd.DataFrame(columns=['show_identity', 'round_name', 'name'])

df_categories['show_identity'] = df_questions['show_identity'].to_numpy()
df_categories['round_name'] = df_questions['round_name'].to_numpy()
df_categories['name'] = df_questions['category'].to_numpy()
df_categories.drop_duplicates(inplace=True, ignore_index=True)

def get_category_id(question):
    show_identity = question.show_identity
    round_name = question.round_name
    category = question.category

    category_id = df_categories[(df_categories.round_name == round_name) &
                                       ( == category) &
                                       (df_categories.show_identity == show_identity)].index[0]
    return category_id

# Get the category ID for each question
category_ids = df_questions.swifter.apply(get_category_id, axis=1)

# Add the category identity to the questions table, and delete unneeded columns
df_questions.insert(loc=0, column='category_identity', value=category_ids)

This just felt wrong when I was writing it; doing 200,000 lookups across three columns in a data frame is never going to be efficient, and even with the swifter3 library that parallelized the operation across my 12-core Ryzen it still took over 20 minutes to complete.

The better approach was to create the category index before I duplicated the DataFrame, that way the cross-referencing was already done without having to resort to a whole load of lookups. The steps to do this were as follows:

  1. Create the category DataFrame and copy the relevant columns
  2. Create the composite key based on those relevant columns and create a map between those values and an int
  3. Map the composite key in the category DataFrame
  4. Map the composite key in the question DataFrame, and drop the columns that were moved to the category DataFrame

This cut the time taken to build the category table from around 20 minutes to about 10 seconds, a factor of improvement of about 120! You can see that code below, starting from the comment line # Create a categories table.

import pandas as pd
from sqlalchemy import create_engine
import numpy as np

column_rename = {"Show Number": "show_identity",
                 " Air Date": "air_date",
                 " Round": "round_name",
                 " Category": "category",
                 " Value": "value",
                 " Question":"question",
                 " Answer": "answer"}

df_questions = pd.read_csv('JEOPARDY_CSV.csv', parse_dates=[1])

df_questions.rename(columns = column_rename, inplace=True)

# Turn the value into an integer
df_questions['value'] = df_questions['value'].map(
    lambda a: int((a[1:].replace(',', '')))
        if a != 'None'
    else np.NaN)

# Create a shows table, and delete corresponding columns in questions table
df_shows = pd.DataFrame(columns=['identity', 'air_date'])
df_shows['identity'] = df_questions['show_identity'].to_numpy()
df_shows['air_date'] = df_questions['air_date'].to_numpy()
df_shows.drop_duplicates(inplace=True, ignore_index=True)
df_shows.set_index('identity', inplace=True, append=False)
df_questions.drop('air_date', axis=1, inplace=True)

# Create a categories table
df_categories = pd.DataFrame()

df_categories['show_identity'] = df_questions['show_identity'].to_numpy()
df_categories['round_name'] = df_questions['round_name'].to_numpy()
df_categories['category'] = df_questions['category'].to_numpy()

# Define the composite key that uniquely identifies a category
category_composite_key = ['show_identity', 'round_name', 'category']
df_categories.drop_duplicates(inplace=True, ignore_index=True)

old_category_identities = df_categories[category_composite_key].apply(lambda row: '_'.join(row.values.astype(str)),
new_category_identities = range(len(old_category_identities))
category_identity_mapping = dict(zip(old_category_identities, new_category_identities))

def map_category_identity(category):
    composite_key = '_'.join(category.values.astype(str))
    return category_identity_mapping[composite_key]

df_categories['identity'] = df_categories.apply(map_category_identity, axis=1)
df_categories.rename(columns={'category': 'name'}, inplace=True)
df_categories.index.rename('identity', inplace=True)
df_categories.set_index(keys=['identity'], append=False, inplace=True, drop=True)

# Perform the same mapping on the questions themselves, and drop the individual elements of the key
df_questions.insert(0, 'category_identity', value=df_questions[category_composite_key].apply(map_category_identity,
df_questions.drop(category_composite_key, axis=1, inplace=True)
df_questions.index.rename('identity', inplace=True)

# Create a rounds table
round_name_mapping = {'Jeopardy!': 0, 'Double Jeopardy!': 1, 'Final Jeopardy!': 2, 'Tiebreaker': 3}
df_rounds = pd.DataFrame.from_dict({'round_name': round_name_mapping.keys()})
df_rounds.index.rename('identity', inplace=True)

round_identity = df_categories.apply(lambda c: round_name_mapping[c.round_name], axis=1)
df_categories.insert(loc=0, column='round_identity', value=round_identity)
df_categories.drop('round_name', axis=1, inplace=True)

# Export to SQLite
engine = create_engine('sqlite:///questions.db', echo=False)
df_questions.to_sql('questions', con=engine, if_exists='replace')
df_shows.to_sql('shows', con=engine, if_exists='replace')
df_categories.to_sql('categories', con=engine, if_exists='replace')
df_rounds.to_sql('rounds', con=engine, if_exists='replace')

Referential Integrity

The code above successfully creates the different tables in SQLite with the right columns, which allow all the various joins to be performed to find out which questions appeared on a certain show (through the category table), or ultimately even to re-create the original CSV.

However, the SQLite database doesn’t have any constraints to ensure that any new data added to the database satisfy the requirements to be able to do the joins described above. And unfortunately, since I’m using SQLite at the moment, there’s no simple way to apply constraints to tables in the database if you are creating them with dataframe_to_sql()4 Since this is still in the prototype stage it’s not really critical that I handle this now, but before going into production I’d probably change this to a MySQL or Postgres backend, and at that point, I’d want to make sure my referential integrity is all squared away.

Next Steps

As I mentioned at the top, my goal for this is to be able to create something in a front-end that maybe lets me play a randomly generated game of jeopardy, potentially with other people in some kind of shared session over the web.

However, the first step in that is going to be building a REST API that can return a random Jeopardy ‘game’ when requested. I’ll probably implement the first pass in Flask, but I also want to give some other languages a try; I’ve done some very basic work with Spring Boot in Java, and I think it would make a lot of sense to use that here as well. Maybe have the backend in Spring Boot and the front-end in a Flask-based web app…

Anyway, watch this space and see what I come up with!