Building on the data ingestion pipeline I showed last time, today I went a bit further and started to build out a website based on that data! It’s still very early days, but I did get far enough through to have a functioning JSON endpoint. Here are the steps that I went through to get to an initial basic version of JeoPydy!
Database enhancements
My first step was to finish the data ingestion pipeline by adding constraints to the schema. I did this by defining the schema explicitly in SQL instead of relying on the auto-generated schema from DataFrame.to_sql()
, which gives me greater control over the resulting database:
DROP TABLE IF EXISTS category; DROP TABLE IF EXISTS question; DROP TABLE IF EXISTS round; DROP TABLE IF EXISTS show; CREATE TABLE round ( id INTEGER PRIMARY KEY AUTOINCREMENT, round_name TEXT NOT NULL ); CREATE TABLE show ( id INTEGER PRIMARY KEY AUTOINCREMENT, air_date TEXT NOT NULL ); CREATE TABLE category ( id INTEGER PRIMARY KEY AUTOINCREMENT, round_id INTEGER NOT NULL, show_id INTEGER NOT NULL, name TEXT NOT NULL, FOREIGN KEY (round_id) REFERENCES round (id), FOREIGN KEY (show_id) REFERENCES show (id) ); CREATE TABLE question ( id INTEGER PRIMARY KEY AUTOINCREMENT, category_id INTEGER NOT NULL, value INTEGER, question TEXT NOT NULL, answer TEXT, FOREIGN KEY (category_id) REFERENCES category (id) );
The key things I’m doing here are:
- Types: Data frame columns are also typed, so these would have been mapped across to the SQL table anyway, but I still think it’s good to have more explicit control over these, especially since my types are not dynamic
- Constraints: I have set the id columns to be autoincremented primary keys, and I have also set foreign key constraints for the relevant columns to ensure referential integrity, i.e. that a record I create in a linked table has a valid reference to the foreign table. The foreign key constraint is useful to make sure I’m not accidentally creating any orphaned rows during the ingestion pipeline, but the autoincremented primary key will only really be useful if I start adding data to the database through a separate method.
It’s worth saying I could probably use SqlAlchemy to do this instead of writing pure SQL, but that’s a project for another time.
Web API
My next step was to write the actual web API. The eventual aim of the project is to write a website that will allow interactions, but my front-end experience is limited, so it makes sense to start with what I know. The HTML and CSS trickery can come later.
In previous projects, I have used the base http.server
1https://docs.python.org/3.9/library/http.server.html provided in core Python, but I wouldn’t do it again. Recent versions of python (3.7+ I think) have added a new class called ThreadingHTTPServer
that serves each request in a separate thread which results in much more performant code when dealing with concurrent requests, but there’s still a whole lot of plumbing that’s needed just to write a server that can do basic GET calls. And then if you want anything like authentication, SSL, session support, etc., you have to manually implement it. It’s kind of fun manually implementing the code for how to handle every method on every endpoint, but only up to a point…
The alternative is the Flask2https://flask.palletsprojects.com/en/1.1.x/ framework, and the major advantage is that does all this stuff for you. I don’t have a whole lot of experience with Flask, but I have previously worked my way through the excellent Flask tutorial3https://flask.palletsprojects.com/en/1.1.x/tutorial/ which does a great job of introducing the concepts and philosophy of Flask in the context of a real application. So many package tutorials I have worked through neglect to do this, and it makes the learning curve a lot steeper to implement the package in a real project. Having a tutorial that both follows best practices and results in a real production-ready application makes it so much easier to understand how to apply the package to your specific requirements, as opposed to just what it does from a functionality perspective. It’s so good, it’s also a great place to start from if you are writing your own code, which I unashamedly did here. It’s released under the 3-clause BSD license4https://flask.palletsprojects.com/en/1.1.x/license/, so there’s no legal or ethical issues with doing this either.
The relevant parts of the code are the application factory, database initialization function, and then a single blueprint for accessing the data and returning a dictionary as a json response:
import os from flask import Flask def create_app(test_config=None): # create and configure the app app = Flask(__name__, instance_relative_config=True) app.config.from_mapping( SECRET_KEY='dev', DATABASE=os.path.join('jeopydy', 'jeopydy.db'), ) # ensure the instance folder exists try: os.makedirs(app.instance_path) except OSError: pass from . import db db.init_app(app) from . import round app.register_blueprint(round.bp) return app
import sqlite3 import click from flask import current_app, g from flask.cli import with_appcontext import jeopydy.build_database def get_db(): if 'db' not in g: g.db = sqlite3.connect( current_app.config['DATABASE'], detect_types=sqlite3.PARSE_DECLTYPES ) g.db.row_factory = sqlite3.Row return g.db def close_db(e=None): db = g.pop('db', None) if db is not None: db.close() def init_db(): db = get_db() with current_app.open_resource('schema.sql') as f: db.executescript(f.read().decode('utf8')) jeopydy.build_database.build(db) @click.command('init-db') @with_appcontext def init_db_command(): """Clear the existing data and create new tables.""" init_db() click.echo('Initialized the database.') def init_app(app): app.teardown_appcontext(close_db) app.cli.add_command(init_db_command)
from flask import ( Blueprint, flash, g, redirect, render_template, request, url_for ) from werkzeug.exceptions import abort import datetime from jeopydy.db import get_db bp = Blueprint('round', __name__) def get_round(): db = get_db() round = db.execute( "SELECT c.name, s.air_date, value, question, answer " "FROM question as q " "INNER JOIN category c on c.id = q.category_id " "INNER JOIN show s on s.id = c.show_id " "WHERE q.category_id in (" "SELECT c.id " "FROM category as c " "WHERE c.round_id == 0 " "ORDER BY RANDOM() LIMIT 6) " ).fetchall() response = dict() for category_name, air_date, value, question, answer in round: category_date = f"{category_name} " \ f"({datetime.datetime.strptime(air_date, '%Y-%m-%d %H:%M:%S').strftime('%B %-d, %Y')})" if category_date not in response: response[category_date] = dict() response[category_date][value] = question, answer return response @bp.route('/api') def api(): return get_round()
The tutorial referenced above gives a thorough run-through of what each line/block does, so I’ll only cover the parts specific to my project here.
__init__.py
The presence of an __init__.py
file denotes the folder it is in as a package, and the code within that file is executed whenever that package is imported. In the world of Flask, that means the Flask application can be written as its own package, which helps in managing and organizing the code, e.g. separate parts of the application can be encapsulated into separate modules, allowing a clearer separation of concerns. My application is nowhere near that scale yet, but as I start adding the concept of users, PvP games, etc., it probably will be.
The code itself is pretty basic, it just implements a factory pattern to create the Flask app class instance (JeoPydy – every Python application needs a witty name, right?) and registers my API blueprint defined below.
db.py
The db.py
file is responsible for creating the database if it doesn’t already exist, and managing the connection to the database. Here I import the module I wrote in the previous post that actually processes and imports the data, but first I execute the SQL schema above to create the tables with the appropriate constraints.
One really cool feature here is the @click.command('init-db')
decorator. The Click library5https://click.palletsprojects.com/en/7.x/ lets you turn a function into a command-line application with barely any code. In this case, adding the decorator to the init_db_command()
function lets us do flask init-db
from the command line, which automatically set up the Flask app and initializes the database, without us having to worry about calling Python and calling it with a specific function of a specific module. Beautiful.
round.py
Finally, round.py
actually defines the API endpoint that returns the questions and answers for a Jeopardy round. The query is included in the Python code above, but I’ll break it out here separately:
SELECT c.name, s.air_date, value, question, answer FROM question as q INNER JOIN category c on c.id = q.category_id INNER JOIN show s on s.id = c.show_id WHERE q.category_id in ( SELECT c.id FROM category as c WHERE c.round_id == 0 ORDER BY RANDOM() LIMIT 6
The main query selects a set of rows from the question table, along with the Category Name (very important!) and original Air Date (sometimes important for contemporaneous reasons) for each. However, the query needs to limit the set of rows to all of those linked to 6 randomly selected Jeopardy (or Double Jeopardy) categories. I achieved this through a sub-query, where I first randomly select 6 categories and then feed this into the WHERE
statement for my main query.
Performance for this is pretty good, it only takes around 50 ms from an SQLite database. Before this goes into production I’ll switch to a more scalable database engine, I have the most experience with MS SQL Server, and since it’s now available in Linux (and there’s an official Docker image6https://hub.docker.com/_/microsoft-mssql-server – good job Microsoft!), I might give that a go. However, the current situation is fine for this stage of the project. If performance becomes an issue later on I’ll look to denormalize the database to avoid the need for the sub-query, but it’s working well enough right now.
The rest of the round.py
code sets up the blueprint. The blueprint contains a single function mapped to the '/api'
path, which returns the results of the SQL query above transformed into a JSON payload. Another great feature of Flask is that I don’t need to worry about serializing the JSON before I return it, I just return the dict
and Flask figures out what to do automatically!
Response
Here’s the response from the endpoint:
{ "BUSINESS & INDUSTRY (December 23, 2002)": { "200": [ "This company serving about 30 million customers features the following in its ads [(actor walking across the <i>Jeopardy!</i> stage) <i>Can you hear me now? Good! Can you hear me now? Good!</i>]", "Verizon Wireless" ], "400": [ "It was once described as 2 \"beautifully embossed, choc. flavored wafers with a rich cream filling\"", "Oreo cookies" ], "600": [ "This Australian-born moguls's first U.S. newspaper acquisition was the San Antonio Express & News", "Rupert Murdoch" ], "800": [ "The \"K\" in K-Mart's name stands for this founder", "S.S. Kresge" ], "1000": [ "The lion on this French auto company's logo was adopted from Belfort, the city where an early model was made", "Peugeot" ] }, "EPITAPHS (March 26, 1999)": { "100": [ "\"Author of the Declaration of American Independence...\"", "Thomas Jefferson" ], "200": [ "\"Quoth the raven nevermore\"", "Edgar Allan Poe" ], "300": [ "\"Truth and history. 21 men. The boy bandit king -- he died as he had lived\"", "Billy the Kid" ], "500": [ "\"Man of 1000 Voices\"", "Mel Blanc" ], "1000": [ "\"61/61. Against all odds\"", "Roger Maris" ] }, ... }
The real response contained 6 categories, but I’ve truncated it to 2 categories here to save space. Something interesting though is that some categories have the current values ($200 to $1,000 in $200 increments), but some questions have half that ($100 to $500 in $100 increments), and some have some kind of hybrid, with $100 to $300 in $100 increments, and then the final two questions being worth $500 and $1,000. I might want to normalize these at some point, possibly by transforming all questions that were aired before the date on which Jeopardy settled on the current prize money structure.
Another quirk of the dataset is that sometimes there are missing question/answer pairs in a given category. I assume this is because sometimes those questions are never revealed; sometimes the time expires, and so the question may never be revealed, and therefore never makes it onto http://www.j-archive.com/. I’ll have to decide what I want to do with these, it may be that I drop them from the initial database, or maybe flag the category as incomplete so they can be filtered out when the query runs.
Summary & Next Steps
I’m pretty pleased with how far I managed to get in maybe 1 or 2 hours! The code samples above include a few firsts for me:
- Defining a schema in SQL, including ensuring referential integrity
- Using a nested SQL query
- Using Flask to build an API
The surprising thing is, none of this was really that difficult! I think it’s a testament to the Flask framework for eliminating the need to write a whole load of boilerplate code and having a great tutorial, but it also gives me a sense of confidence that this stuff isn’t really that hard, I just need an engaging project as a reason to learn it.
The next steps I want to accomplish are to serve this up using a self-hosted WSGI server and then to write a website using the Jinja7https://jinja.palletsprojects.com/en/2.11.x/ templating language. Hopefully, the next article will include some screenshots of a real interactive Jeopardy board!