diff --git a/README.md b/README.md new file mode 100644 index 0000000..6513fd5 --- /dev/null +++ b/README.md @@ -0,0 +1,27 @@ +# lahmanlite +`lahmanlite` is a project for creating a SQLite database of baseball statistics +from [the Lahman database/Baseball Databank](https://seanlahman.com). + +A makefile and SQL scripts are provided that can create a database from +Lahman's CSV files. Ideally, this means that as long as new releases continue, +(and the structure of the releases is maintained), an up-to-date database can +be created. I have also done my best to normalize the data, incorporate +constraints, and correct errors I've found. + +## How to use +Using either `export` or `env`, set the `LAHMANLITE_CSV_DIR` environment +variable to the directory containing Lahman data, then run `make`. This will +generate two files: +* `lahman-raw.db`, a straight import of the CSV data into SQLite. +* `lahman.db`, a modified version of `lahman-raw.db` with data corrections, key + constraints, and additional schema modifications. + +If you only want the raw data, run `make lahman-raw.db` instead. + +### Corrections +Many of the corrections are simple in nature, like: +* correcting obvious typos +* changing empty cells to NULL +* deleting duplicated data + +See the `sql` directory to view the exact SQL statements run for each table.