Learn R With Baseball Data: Top First Basemen Hitters of 2021 (In Hot Weather)

Spotswood
7 min readDec 8, 2021

A walkthrough of my first experience building a project in the R programming language.

Photo by David Lezcano on Unsplash

There are many tools at an analyst’s disposal. Excel has to be the most widely accepted, and it’s what I have the most experience with. It’s quick, efficient, and understood by almost all analysts. With that said, I’m ready to challenge myself to learn R with baseball data.

Some of my earliest memories come from my time in baseball. A player since the age of 3, my baseball career lasted through my sophomore year of college, and I’ve missed it ever since. After working different analyst jobs over the past few years, it’s time to dive back in and learn a new skill while I’m at it.

Learn R with Baseball Data

Those familiar with the current state of baseball are aware of the different use-cases for the vast amount of baseball data available. I’m interested in finding actionable insights that piggyback off the work of giants in the industry.

I’ve considered working on some ideas:

  • Report on the best hitters for teams in certain weather conditions
  • The league-wide effect of weather on batting performance
  • A regression analysis model finding the most significant metrics that contribute to improved hitting performance in bad weather conditions

This is where my mind goes, but there is already a lot of literature out there on the topic of weather and hitting performance in the MLB.

This analysis by Koch and Panorska solidifies my understanding of hitting performance in cold and warm weather conditions.

Here’s a more recent article on how weather can affect lineup decisions. It places more emphasis on the air density during games and how temperature affects air density.

The biggest takeaway: Hitters do best in hot weather conditions.

Okay. That’s helpful. I’m interested in getting more granular.

Freddie Freeman is a player I enjoy watching. He’s also a hot topic right now while he’s in negotiations with the Atlanta Braves.

For this analysis, I will compare Freddie Freeman’s hitting performance to every other 1st baseman in the league, but I want to add one more restraint. I want to analyze his performance in games above 70 degrees. Why? According to the first article I linked to above, Atlanta has the second-lowest number of games played in cold weather.

Why is this relevant? Freddie Freeman helps bring fans to games in Atlanta. Most of those games are in warm weather. Historically, hitters do better in warm weather.

Hypothesis

Freddie Freeman’s hitting performance in warm games is the best in the league on average.

Why do I think this? Freeman plays in more warm games than other first basemen.

Work In Sprints

To get started, I import data from Baseball Savant, show the data in R, and filter by 1st basemen events in weather > 70 degrees. I’ll import weather data later.

Here’s what my first file looks like after 20 minutes:

install.packages('tidyverse')
install.packages('readr')

require(devtools)
library(readr)
library(readxl)
library(ggplot2)

savant_data_12_7 <- read_excel("savant_data_12_7.xlsx")
View(savant_data_12_7)

ggplot(savant_data_12_7, aes(pitches, woba)) + geom_point() + geom_smooth()

I want to look at the data from Baseball Savant to see if I’m moving in the right direction. The last line of code from above produced a graph like this:

pitches x wOBA

Let’s switch the pitches and wOBA coordinates and see if that gives a better visual:

wOBA x pitches

This makes sense. Batters that see the most pitches hover around a wOB of slightly greater than .3.

It’s time to work on importing weather data, filter hitters by those with the highest wOBA and longest hit distance, and present the data nicely.

Tasks In Coding Projects May Take Longer Than You Think

It took me a minute to try to filter data from Baseball Savant. Thankfully, I found better data with Stathead Baseball. I was able to filter data based on how many games every first baseman played in the 2020 and 2021 season in temperatures greater than 70 degrees.

Edit: I only ended up using data from the 2021 season because of how long it took to export Excel workbooks for every page of data returned by Stathead Baseball.

I exported the file as an Excel workbook and created a pivot table for the data. Here is a chart depicting the number of games each first baseman played in 2021 where the weather was greater than 70 degrees:

First basemen with most games played > 70 degrees in 2021

I filtered data in Excel to find the number of games played and created the graph above.

In 2021, Freddie Freeman played more games than any other first baseman under these conditions. Let’s look at his average wOBA during these games compared to other players.

Side note: Yuri Gurriel’s columns return N/A when I computed average wOBA, so it’s better to count him out of the group for now.

My Ignorance Showing

When trying to add a column to a data frame, I tried so many things. I finally figured out that I can probably just include the existing column by adding it to the list function. It worked. Here’s the code:

groupedwoba <- wobadata %>%
group_by(Player) %>%
summarise_at(vars(wOBA),
list(avgwOBA = mean, PA = sum)) %>%
arrange(desc(avgwOBA)) %>%
filter(Player=="Freddie Freeman" | Player=="Pete Alonso" | Player=="Paul Goldschmidt" | Player=="Matt Olson" | Player=="Yuli Gurriel" | Player=="Carlos Santana" | Player=="José Abreu" | Player=="Vladimir Guerrero Jr." | Player=="Joey Votto" | Player=="Anthony Rizzo")

Here’s the above code’s output:

Top First Basemen Hitters in 2021 according to average wOBA

I’m not sure why Yuli Gurriel’s data returned as N/A. That leaves work for another time. Anyway, the above is a list of 2021’s top MLB first basemen according to avgwOBA during games where the temperature was ≥ 70 degrees Fahrenheit.

Wrapping It Up

The results are (finally) in, and it looks like Freddie Freeman was not the best hitting first baseman according to average wOBA during games where the weather was greater than 70 degrees in 2021.

Here is a graph depicting the findings:

PA x avgwOBA for Top First Basemen Hitters in 2021

Here’s the code that produced the above graph:

testgraph <- ggplot(groupedwoba, aes(x = PA, y = avgwOBA, fill=Player))+
geom_bar(position="dodge", stat="identity")
testgraph

Interestingly, this graph included what seems to be the correct data for Yuli Gurriel. The groupedwoba object returned NA values for his stats earlier in the project. Weird, but the graph seems right.

Conclusion

Yuli Gurriel is without a doubt one of the most impressive first basemen of 2021. The results show that. They also show that Freddie Freeman is definitely not the only one that frequently plays in 70 degree weather. Freeman still puts up impressive numbers, but he’s not the best. I think it’s safe to say Freeman is in the top 5 of hitting first basemen during warm weather, and I reject my original hypothesis.

Here’s my finished R file:

install.packages('tidyverse')
install.packages('readr')
install.packages('dplyr')
require(devtools)
require(dplyr)
library(readr)
library(readxl)
library(ggplot2)
savant_data_12_7 <- read_excel("savant_data_12_7.xlsx")
mydata <- savant_data_12_7[, c(3, 1, 11)]
mydata
ggplot(savant_data_12_7, aes(woba, pitches)) + geom_point() + geom_smooth()library(readxl)
sportsref_download <- read_excel("sportsref_download.xlsx",
col_types = c("numeric", "text", "date",
"text", "text", "text", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"text", "numeric", "numeric"))
wobadata <- sportsref_download %>% select(Player, HBP, H, `2B`, `3B`, HR, AB, BB, IBB, SF,PA) %>%
mutate(wOBA=((.69*(BB-IBB)+(.72*HBP)+(.89*H)+(1.27*`2B`)+(1.62*`3B`)+(2.1*HR))/(AB+BB-IBB+SF+HBP)))
wobadata
groupedwoba <- wobadata %>%
group_by(Player) %>%
summarise_at(vars(wOBA),
list(avgwOBA = mean, PA = sum)) %>%
arrange(desc(avgwOBA)) %>%
filter(Player=="Freddie Freeman" | Player=="Pete Alonso" | Player=="Paul Goldschmidt" | Player=="Matt Olson" | Player=="Yuli Gurriel" | Player=="Carlos Santana" | Player=="José Abreu" | Player=="Vladimir Guerrero Jr." | Player=="Joey Votto" | Player=="Anthony Rizzo")
testgraph <- ggplot(groupedwoba, aes(x = PA, y = avgwOBA, fill=Player))+
geom_bar(position="dodge", stat="identity")
testgraph

You can find more of the code here: https://github.com/spotswoodb/weathered_hitting

Future Improvement

There’s more room for analysis in this space. I’d like to regress some of the numbers like wOBA against plate appearances in warm games vs. cold games or average hit distance against game temperature.

I hope you enjoyed this story. Thank you for reading and double thank you if you read the whole thing. I appreciate your time more than you know. I learned so much in this project and since I began learning R yesterday. I hope you do too. The opportunities are endless.

Baseball Savant

How the weather should factor in your fantasy baseball lineup decisions

Stathead Baseball

The Impact of Temperature on Major League Baseball

--

--

Spotswood

I write about software development projects in JavaScript (React), Ruby on Rails, R, and other languages 🧱🏗👷🏻‍♂️