lahmanlite/README.md

1.2 KiB

lahmanlite

lahmanlite is a project for creating a SQLite database of baseball statistics from the Lahman database/Baseball Databank.

A makefile and SQL scripts are provided that can create a database from Lahman's CSV files. Ideally, this means that as long as new releases continue (and the structure of the releases is maintained), an up-to-date database can be created. I have also done my best to normalize the data, incorporate constraints, and correct errors I've found.

How to use

Using either export or env, set the LAHMANLITE_CSV_DIR environment variable to the directory containing Lahman data, then run make. This will generate two files:

  • lahman-raw.db, a straight import of the CSV data into SQLite.
  • lahman.db, a modified version of lahman-raw.db with data corrections, key constraints, and additional schema modifications.

If you only want the raw data, run make lahman-raw.db instead.

Corrections

Many of the corrections are simple in nature, like:

  • correcting obvious typos
  • changing empty cells to NULL
  • deleting duplicated data

See the sql directory to view the exact SQL statements run for each table.