pandas – Andy Grigg

Hi everyone! I know I said I was going to start work on a weight tracking app next, but I came up with something a bit more fun! I was crawling through some of the fascinating datasets on Kaggle looking for some inspiration, and found a set of over 200,000 Jeopardy questions!¹.

My plan is to use this to learn a little more about front-end stuff, which is still the area I’m weakest in the software development world. My plan is to build a page that generates random combinations of previous Jeopardy games, potentially allowing two people to play against each other in some kind of shared session.

First things first though, I need to get the raw data from Kaggle into a format that I can use in my front end. Python seems like a sensible tool to do that, and the Pandas library should make that process easy. The first thing I did was to write a simple script that would suck the data out of the CSV and dump it into a SQLite database. Thanks to the magic of Pandas and SqlAlchemy, the code is alarmingly simple…

import pandas as pd
from sqlalchemy import create_engine

df_questions = pd.read_csv('JEOPARDY_CSV.csv', parse_dates=[1])
engine = create_engine('sqlite:///questions.db', echo=False)
df_questions.to_sql('questions', con=engine, if_exists='replace')

And that’s it! Done! Time for a pint! Pandas automatically reads the CSV (including headers), parses the dates into real DateTime objects, and then SqlAlchemy creates a SQLite database file and dumps the data into the database. 5 lines including import statements, even for Python that’s impressive.

However, there are some problems with this. The first is that the column names in the CSV have spaces all over the place, which is not ideal since both SQL and Pandas make you jump through all sorts of hoops to address columns with spaces in the name. However, that’s an easy fix with a renaming dict, which turns the code above into something like this:

import pandas as pd
from sqlalchemy import create_engine

column_rename = {"Show Number": "show_identity",
                 " Air Date": "air_date",
                 " Round": "round_name",
                 " Category": "category",
                 " Value": "value",
                 " Question":"question",
                 " Answer": "answer"}

df_questions = pd.read_csv('JEOPARDY_CSV.csv', parse_dates=[1])
df_questions.rename(columns = column_rename, inplace=True)
engine = create_engine('sqlite:///questions.db', echo=False)
df_questions.to_sql('questions', con=engine, if_exists='replace')

OK, so I’m now up to 7 lines. Still not bad. However, another problem remains, and that is the normalization of the data (or lack of). I’ll talk through the pros and cons of normalizing a dataset in a moment, but first, he’s a look at the current state of the database:

show_identity	air_date	round_name	category	value	question	answer
5462	2008-05-13	Double Jeopardy!	Who’s Your Mommy?	$800	King Solomon	…
5462	2008-05-13	Double Jeopardy!	It’s all from the Greek to me	$800	An institution for the care…	…
5462	2008-05-13	Double Jeopardy!	Dr. Drew	$1200	People 18 to 25 are vulnerable to…	…
5462	2008-05-13	Double Jeopardy!	Recent Films	$1200	The attempted assassination…	…
5462	2008-05-13	Double Jeopardy!	Music Class	$2000	This large woodwind…	…
5462	2008-05-13	Double Jeopardy!	Lend me you Iroquois	$1200	The Iroquois Nationals…	…
5462	2008-05-13	Double Jeopardy!	Who’s Your Mommy?	$1200	Invention (proverbially)	…

This is a random sample of 7 rows from the dataset, and the lack of normalization is pretty obvious. From the 7 columns shown here, only two of them contain unique values (Question and Answer), 2 columns contain repeated values, and 3 columns contain all identical values! In fact, from the 216,930 unique questions in this dataset, there are only 3,640 unique shows and 47,200 unique categories. So there’s certainly a huge scope for normalization here!

I’ll go through the normalization steps I went through in a moment, but as a starting point, here’s the schema diagram for the current unnormalized table:

The data as pulled from Kaggle doesn’t even really conform to the Unnormalized Form², since it doesn’t have an explicitly defined primary key (however Pandas was kind enough to create one for me automatically when I loaded the CSV). It does however have no duplicate rows, so it’s halfway there.

Normalizing The Dataset

Typically, before I start writing any database code I want to design my schema to figure out how I’m going to represent the data. I design relational databases for a living, and I’ve developed a pretty simple approach to the initial coarse-grained normalization of a database schema.

The trick is to figure out what each table conceptualizes.

Throughout my career, I have typically been involved in designing and building databases to store engineering data. If I need to build a database to store mechanical testing data (for example), I’m probably going to need tables about the material being tested (both the specification of the material and the specific instance of the material), the machine that’s doing the testing, the results of the test, the calculations performed on those results, etc.

In this case, the key Jeopardy concepts I came up with were:

The Show (show number, air date)
The Rounds (Jeopardy!, Double Jeopardy!, Final Jeopardy!)
The Categories (History, Brass, Birds, “T”elevision)
Question & Answers

Everything associated with Jeopardy (except for the contestants and Aaaaalex Trebek!) fits into one of those buckets, and so that sounds like a pretty good way to go about designing the normalized database. To formalize things up a bit, here it is in UML:

This feels like the appropriate way to construct the database since it relates the key concepts of a Jeopardy game in a way that makes conceptual sense, e.g. multiple categories are included in a single show, and each category contains multiple questions and answers. Each question and answer pair belongs to a single category, and has a specific ‘value’.

However, I want to be sure I’ve considered the most significant possible ways of normalizing the data, so I’ll now go through the first three normal forms to make sure the database either adheres to that form, or that I have a justification for not doing so.

1st Normal Form (1NF)

The First normal form simply states that the columns in a table should be atomic (i.e. indivisible). We were actually already there with the initial CSV, so we pass this first test straight off. Great!

2nd Normal Form (2NF)

The Second normal form get’s a bit trickier. To pass this one, we need to ensure that every column that is not part of a candidate key depends on the entire candidate key, and not just a part of it. A common ‘smell’ associated with breaking 2NF is that there’s a lot of duplicated data.

Our original database didn’ satisfy 2NF all over the place. A candidate key for our original database was the Question itself, but everything except the answer doesn’t depend on the question. Pulling the Question, Answer, and Value columns into a separate table helps with this, and then doing the same for the category name ensures all tables are compliant with 2NF.

3rd Normal Form (3NF)

The Third normal form states that every non-prime column is non-transitively dependent on every key in that table, i.e. there should not be a column whose values depend on something other than a key. Again, this is something that my schema satisfies due to the creation of separate tables for each entity.

For example, in the original dataset, the air date of the show is functionally dependent on the show_identity and not on the question candidate key. Therefore there’s nothing to stop the same episode identity from having multiple different air dates. This is a violation of 3NF.

The normalized schema creates a dedicated Show table with the show identity, and the air_date is dependent purely on this column.

Pros and Cons of Excessive Normalization

Before I go on to the code that performs the nomalization, I want to address the question of whether too much normalization is a bad thing. The answer in my experience is that you have to determine that on a case-by-case basis, based on both the data that you are storing and the way that data is being used. It’s pointless having a theoretically perfect database if you have to perform 30 joins that inur a huge performance penalty, and even worse if the front-end tool you are using just can’t do the joins that you need!

With that being said, I always like to start with an over-normalized database since during the implementation phase of a project it’s generally easier to denormalize a database than it is to normalize a database. Also, you only realize what you need to denormalize once you actually try to use the database. So I think it’s always better to over-normalize early on in a project, and then denormalize as the front-end comes together and additional constraints become clear.

The Code

The code to normalize the database makes extensive use of copying subsets of DataFames in Pandas, dropping columns, removing duplicates, and mapping values. Most of it was pretty straight forward, but one particular area that required some figuring out was where I generate the category table. This step required creating an entirely new identity that was dependent on the category name, the round, and the show (since category names could be re-used between shows). My original approach to doing this was to:

Copy the questions DataFrame to a ‘category’ DataFrame
Drop the columns I didn’t want (value, question, answer)
Delete duplicate rows
Re-generate the DataFrame index
Find the category identity (index) based on each round, category, show tuple in the questions DataFrame

# Create a categories table
df_categories = pd.DataFrame(columns=['show_identity', 'round_name', 'name'])

df_categories['show_identity'] = df_questions['show_identity'].to_numpy()
df_categories['round_name'] = df_questions['round_name'].to_numpy()
df_categories['name'] = df_questions['category'].to_numpy()
df_categories.drop_duplicates(inplace=True, ignore_index=True)

def get_category_id(question):
    show_identity = question.show_identity
    round_name = question.round_name
    category = question.category

    category_id = df_categories[(df_categories.round_name == round_name) &
                                       (df_categories.name == category) &
                                       (df_categories.show_identity == show_identity)].index[0]
    return category_id

# Get the category ID for each question
category_ids = df_questions.swifter.apply(get_category_id, axis=1)

# Add the category identity to the questions table, and delete unneeded columns
df_questions.insert(loc=0, column='category_identity', value=category_ids)

This just felt wrong when I was writing it; doing 200,000 lookups across three columns in a data frame is never going to be efficient, and even with the swifter³ library that parallelized the operation across my 12-core Ryzen it still took over 20 minutes to complete.

The better approach was to create the category index before I duplicated the DataFrame, that way the cross-referencing was already done without having to resort to a whole load of lookups. The steps to do this were as follows:

Create the category DataFrame and copy the relevant columns
Create the composite key based on those relevant columns and create a map between those values and an int
Map the composite key in the category DataFrame
Map the composite key in the question DataFrame, and drop the columns that were moved to the category DataFrame

This cut the time taken to build the category table from around 20 minutes to about 10 seconds, a factor of improvement of about 120! You can see that code below, starting from the comment line # Create a categories table.

import pandas as pd
from sqlalchemy import create_engine
import numpy as np

column_rename = {"Show Number": "show_identity",
                 " Air Date": "air_date",
                 " Round": "round_name",
                 " Category": "category",
                 " Value": "value",
                 " Question":"question",
                 " Answer": "answer"}

df_questions = pd.read_csv('JEOPARDY_CSV.csv', parse_dates=[1])

df_questions.rename(columns = column_rename, inplace=True)

# Turn the value into an integer
df_questions['value'] = df_questions['value'].map(
    lambda a: int((a[1:].replace(',', '')))
        if a != 'None'
    else np.NaN)

# Create a shows table, and delete corresponding columns in questions table
df_shows = pd.DataFrame(columns=['identity', 'air_date'])
df_shows['identity'] = df_questions['show_identity'].to_numpy()
df_shows['air_date'] = df_questions['air_date'].to_numpy()
df_shows.drop_duplicates(inplace=True, ignore_index=True)
df_shows.set_index('identity', inplace=True, append=False)
df_questions.drop('air_date', axis=1, inplace=True)

# Create a categories table
df_categories = pd.DataFrame()

df_categories['show_identity'] = df_questions['show_identity'].to_numpy()
df_categories['round_name'] = df_questions['round_name'].to_numpy()
df_categories['category'] = df_questions['category'].to_numpy()

# Define the composite key that uniquely identifies a category
category_composite_key = ['show_identity', 'round_name', 'category']
df_categories.drop_duplicates(inplace=True, ignore_index=True)

old_category_identities = df_categories[category_composite_key].apply(lambda row: '_'.join(row.values.astype(str)),
                                                                      axis=1)
new_category_identities = range(len(old_category_identities))
category_identity_mapping = dict(zip(old_category_identities, new_category_identities))


def map_category_identity(category):
    composite_key = '_'.join(category.values.astype(str))
    return category_identity_mapping[composite_key]


df_categories['identity'] = df_categories.apply(map_category_identity, axis=1)
df_categories.rename(columns={'category': 'name'}, inplace=True)
df_categories.index.rename('identity', inplace=True)
df_categories.set_index(keys=['identity'], append=False, inplace=True, drop=True)

# Perform the same mapping on the questions themselves, and drop the individual elements of the key
df_questions.insert(0, 'category_identity', value=df_questions[category_composite_key].apply(map_category_identity,
                                                                                             axis=1))
df_questions.drop(category_composite_key, axis=1, inplace=True)
df_questions.index.rename('identity', inplace=True)

# Create a rounds table
round_name_mapping = {'Jeopardy!': 0, 'Double Jeopardy!': 1, 'Final Jeopardy!': 2, 'Tiebreaker': 3}
df_rounds = pd.DataFrame.from_dict({'round_name': round_name_mapping.keys()})
df_rounds.index.rename('identity', inplace=True)

round_identity = df_categories.apply(lambda c: round_name_mapping[c.round_name], axis=1)
df_categories.insert(loc=0, column='round_identity', value=round_identity)
df_categories.drop('round_name', axis=1, inplace=True)

# Export to SQLite
engine = create_engine('sqlite:///questions.db', echo=False)
df_questions.to_sql('questions', con=engine, if_exists='replace')
df_shows.to_sql('shows', con=engine, if_exists='replace')
df_categories.to_sql('categories', con=engine, if_exists='replace')
df_rounds.to_sql('rounds', con=engine, if_exists='replace')

Referential Integrity

The code above successfully creates the different tables in SQLite with the right columns, which allow all the various joins to be performed to find out which questions appeared on a certain show (through the category table), or ultimately even to re-create the original CSV.

However, the SQLite database doesn’t have any constraints to ensure that any new data added to the database satisfy the requirements to be able to do the joins described above. And unfortunately, since I’m using SQLite at the moment, there’s no simple way to apply constraints to tables in the database if you are creating them with dataframe_to_sql()⁴. Since this is still in the prototype stage it’s not really critical that I handle this now, but before going into production I’d probably change this to a MySQL or Postgres backend, and at that point, I’d want to make sure my referential integrity is all squared away.

Next Steps

As I mentioned at the top, my goal for this is to be able to create something in a front-end that maybe lets me play a randomly generated game of jeopardy, potentially with other people in some kind of shared session over the web.

However, the first step in that is going to be building a REST API that can return a random Jeopardy ‘game’ when requested. I’ll probably implement the first pass in Flask, but I also want to give some other languages a try; I’ve done some very basic work with Spring Boot in Java, and I think it would make a lot of sense to use that here as well. Maybe have the backend in Spring Boot and the front-end in a Flask-based web app…

Anyway, watch this space and see what I come up with!

Tag: pandas

Jeopardy!

Normalizing The Dataset

1st Normal Form (1NF)

2nd Normal Form (2NF)

3rd Normal Form (3NF)

Pros and Cons of Excessive Normalization

The Code

Referential Integrity

Next Steps

CC License

Normalizing The Dataset

1st Normal Form (1NF)

2nd Normal Form (2NF)

3rd Normal Form (3NF)

Pros and Cons of Excessive Normalization

The Code

Referential Integrity

Next Steps

Tags

CC License