Skip to Content

Analyzing Baseball Data with R

By Max Marchi, Jim Albert

Chapman and Hall/CRC – 2013 – 334 pages

Series: Chapman & Hall/CRC The R Series

Purchasing Options:

  • Add to CartPaperback: $39.95
    978-1-46-657022-1
    October 28th 2013

Description

With its flexible capabilities and open-source platform, R has become a major tool for analyzing detailed, high-quality baseball data. Analyzing Baseball Data with R provides an introduction to R for sabermetricians, baseball enthusiasts, and students interested in exploring the rich sources of baseball data. It equips readers with the necessary skills and software tools to perform all of the analysis steps, from gathering the datasets and entering them in a convenient format to visualizing the data via graphs to performing a statistical analysis.

The authors first present an overview of publicly available baseball datasets and a gentle introduction to the type of data structures and exploratory and data management capabilities of R. They also cover the traditional graphics functions in the base package and introduce more sophisticated graphical displays available through the lattice and ggplot2 packages. Much of the book illustrates the use of R through popular sabermetrics topics, including the Pythagorean formula, runs expectancy, career trajectories, simulation of games and seasons, patterns of streaky behavior of players, and fielding measures. Each chapter contains exercises that encourage readers to perform their own analyses using R. All of the datasets and R code used in the text are available online.

This book helps readers answer questions about baseball teams, players, and strategy using large, publically available datasets. It offers detailed instructions on downloading the datasets and putting them into formats that simplify data exploration and analysis. Through the book’s various examples, readers will learn about modern sabermetrics and be able to conduct their own baseball analyses.

Contents

The Baseball Datasets

Introduction

The Lahman Database: Season-by-Season Data

Retrosheet Game-by-Game Data

Retrosheet Play-by-Play Data

Pitch-by-Pitch Data

Introduction to R

Introduction

Installing R and RStudio

Vectors

Objects and Containers in R

Collection of R Commands

Reading and Writing Data in R

Data Frames

Packages

Splitting, Applying, and Combining Data

Traditional Graphics

Introduction

Factor Variable

Saving Graphs

Dot Plots

Numeric Variable: Stripchart and Histogram

Two Numeric Variables

A Numeric Variable and a Factor Variable

Comparing Ruth, Aaron, Bonds, and A-Rod

The 1998 Home Run Race

The Relation between Runs and Wins

Introduction

The Teams Table in Lahman's Database

Linear Regression

The Pythagorean Formula for Winning Percentage

The Exponent in the Pythagorean Formula

Good and Bad Predictions by the Pythagorean Formula

How Many Runs for a Win?

Value of Plays Using Run Expectancy

The Runs Expectancy Matrix

Runs Scored in the Remainder of the Inning

Creating the Matrix

Measuring Success of a Batting Play

Albert Pujols

Opportunity and Success for All Hitters

Position in the Batting Lineup

Run Values of Different Base Hits

Value of Base Stealing

Advanced Graphics

Introduction

The lattice Package

The ggplot2 Package

Balls and Strikes Effects

Introduction

Hitter's Counts and Pitcher's Counts

Behaviors by Count

Career Trajectories

Introduction

Mickey Mantle's Batting Trajectory

Comparing Trajectories

General Patterns of Peak Ages

Trajectories and Fielding Position

Simulation

Introduction

Simulating a Half Inning

Simulating a Baseball Season

Exploring Streaky Performances

Introduction

The Great Streak

Streaks in Individual At-Bats

Local Patterns of Weighted On-Base Average

Learning about Park Effects by Database Management Tools

Introduction

Installing MySQL and Creating a Database

Connecting R to MySQL

Filling a MySQL Game Log Database from R

Querying Data from R

Baseball Data as MySQL Dumps

Calculating Basic Park Factors

Exploring Fielding Metrics with Contributed R Packages

Introduction

A Motivating Example: Comparing Fielding Metrics

Comparing Two Shortstops

Appendix A: Retrosheet Files Reference

Appendix B: Accessing and Using MLBAM Gameday and PITCHf/x Data

Bibliography

Index

Further Reading and Exercises appear at the end of each chapter.

Author Bio

Max Marchi is a baseball analyst with the Cleveland Indians. He was previously a statistician at the Emilia-Romagna Regional Health Agency. He has been a regular contributor to The Hardball Times and Baseball Prospectus websites and has consulted for MLB clubs.

Jim Albert is a professor of statistics at Bowling Green State University. He has authored or coauthored several books and is the editor of the Journal of Quantitative Analysis of Sports. His interests include Bayesian modeling, statistics education, and the application of statistical thinking in sports.

Name: Analyzing Baseball Data with R (Paperback)Chapman and Hall/CRC 
Description: By Max Marchi, Jim Albert. With its flexible capabilities and open-source platform, R has become a major tool for analyzing detailed, high-quality baseball data. Analyzing Baseball Data with R provides an introduction to R for sabermetricians, baseball enthusiasts, and students...
Categories: Quantitative methods in sport, Statistics & Probability, Sports Performance Analysis, Sports Business