

Keith Karani


The Lahman package contains a database of pitching, hitting and fielding statistics from Major League Baseball from 1871 to 2022 including data from the present leagues American and National, and the four other major leagues, (American Association, Union Association, Player League, Federal League) and the National Association of 1871 to 1875.

Data Dictionary

The data is comprised of the following main tables:

  1. People - player names, date of birth, death and other biological information.

  2. Batting - batting statistics

  3. Pitching - pitching statistics

  4. Fielding - fielding statistics

    A collection of other tables is also provided:


    Teams yearly stats and standings
    TeamsHalf split season data for teams
    TeamsFranchises franchise information

    Post-season play:

    BattingPost post-season batting statistics
    PitchingPost post-season pitching statistics
    FieldingPost post-season fielding data
    SeriesPost post-season series information


    AwardsManagers awards won by managers
    AwardsPlayers awards won by players
    AwardsShareManagers award voting for manager awards
    AwardsSharePlayers award voting for player awards

    Hall of Fame: links to People via hofID

    HallOfFame Hall of Fame voting data

Information is different tables relating to a player is tagged with his playerID and are linked to names and birthdates in the People table.

Other tables:

AllstarFull - All-Star games appearances; Managers - managerial statistics; FieldingOF - outfield position data; ManagersHalf - split season data for managers; Salaries - player salary data; Appearances - data on player appearances; Schools - Information on schools players attended; CollegePlaying - Information on schools players attended, by player and year;

Variable label tables are provided for some of the tables:

battingLabels, pitchingLabels, fieldingLabels


Lahman, S. (2023) Lahman’s Baseball Database, 1871-2022, Main page,

Load packages to use


View the Lahman package to display the dataset with the data dictionary on the baseball data

teams <- Teams

Rows: 3,015
Columns: 48
$ yearID         <int> 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1…
$ lgID           <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ teamID         <fct> BS1, CH1, CL1, FW1, NY2, PH1, RC1, TRO, WS3, BL1, BR1, …
$ franchID       <fct> BNA, CNA, CFC, KEK, NNA, PNA, ROK, TRO, OLY, BLC, ECK, …
$ divID          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Rank           <int> 3, 2, 8, 7, 5, 1, 9, 6, 4, 2, 9, 6, 1, 7, 8, 3, 4, 5, 1…
$ G              <int> 31, 28, 29, 19, 33, 28, 25, 29, 32, 58, 29, 37, 48, 22,…
$ Ghome          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ W              <int> 20, 19, 10, 7, 16, 21, 4, 13, 15, 35, 3, 9, 39, 6, 5, 3…
$ L              <int> 10, 9, 19, 12, 17, 7, 21, 15, 15, 19, 26, 28, 8, 16, 19…
$ DivWin         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ WCWin          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ LgWin          <chr> "N", "N", "N", "N", "N", "Y", "N", "N", "N", "N", "N", …
$ WSWin          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ R              <int> 401, 302, 249, 137, 302, 376, 231, 351, 310, 617, 152, …
$ AB             <int> 1372, 1196, 1186, 746, 1404, 1281, 1036, 1248, 1353, 25…
$ H              <int> 426, 323, 328, 178, 403, 410, 274, 384, 375, 753, 248, …
$ X2B            <int> 70, 52, 35, 19, 43, 66, 44, 51, 54, 106, 29, 35, 107, 2…
$ X3B            <int> 37, 21, 40, 8, 21, 27, 25, 34, 26, 31, 9, 10, 30, 5, 9,…
$ HR             <int> 3, 10, 7, 2, 1, 9, 3, 6, 6, 14, 0, 1, 7, 0, 2, 4, 4, 5,…
$ BB             <int> 60, 60, 26, 33, 33, 46, 38, 49, 48, 29, 18, 19, 29, 17,…
$ SO             <int> 19, 22, 25, 9, 15, 23, 30, 19, 13, 28, 40, 25, 26, 13, …
$ SB             <int> 73, 69, 18, 16, 46, 56, 53, 62, 48, 53, 8, 19, 48, 12, …
$ CS             <int> 16, 21, 8, 4, 15, 12, 10, 24, 13, 18, 13, 16, 14, 3, 7,…
$ HBP            <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ SF             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ RA             <int> 303, 241, 341, 243, 313, 266, 287, 362, 303, 434, 413, …
$ ER             <int> 109, 77, 116, 97, 121, 137, 108, 153, 137, 166, 160, 16…
$ ERA            <dbl> 3.55, 2.76, 4.11, 5.17, 3.72, 4.95, 4.30, 5.51, 4.37, 2…
$ CG             <int> 22, 25, 23, 19, 32, 27, 23, 28, 32, 48, 28, 37, 41, 15,…
$ SHO            <int> 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 4, 0, 0, 3, 1, 2, 0…
$ SV             <int> 3, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 4, 0, 0, 1, 0, 1, 0…
$ IPouts         <int> 828, 753, 762, 507, 879, 747, 678, 750, 846, 1548, 778,…
$ HA             <int> 367, 308, 346, 261, 373, 329, 315, 431, 371, 573, 484, …
$ HRA            <int> 2, 6, 13, 5, 7, 3, 3, 4, 4, 3, 7, 6, 0, 6, 6, 2, 3, 2, …
$ BBA            <int> 42, 28, 53, 21, 42, 53, 34, 75, 45, 63, 36, 21, 27, 24,…
$ SOA            <int> 23, 22, 34, 17, 22, 16, 16, 12, 13, 77, 13, 13, 29, 11,…
$ E              <int> 243, 229, 234, 163, 235, 194, 220, 198, 218, 432, 274, …
$ DP             <int> 24, 16, 15, 8, 14, 13, 14, 22, 20, 22, 9, 15, 44, 17, 1…
$ FP             <dbl> 0.834, 0.829, 0.818, 0.803, 0.840, 0.845, 0.821, 0.845,…
$ name           <chr> "Boston Red Stockings", "Chicago White Stockings", "Cle…
$ park           <chr> "South End Grounds I", "Union Base-Ball Grounds", "Nati…
$ attendance     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ BPF            <int> 103, 104, 96, 101, 90, 102, 97, 101, 94, 106, 87, 115, …
$ PPF            <int> 98, 102, 100, 107, 88, 98, 99, 100, 98, 102, 96, 122, 1…
$ teamIDBR       <chr> "BOS", "CHI", "CLE", "KEK", "NYU", "ATH", "ROK", "TRO",…
$ teamIDlahman45 <chr> "BS1", "CH1", "CL1", "FW1", "NY2", "PH1", "RC1", "TRO",…
$ teamIDretro    <chr> "BS1", "CH1", "CL1", "FW1", "NY2", "PH1", "RC1", "TRO",…

lets conduct exploratory data analysis

Winning a game in baseball is counted using run, so for our first exploration can we find the average number of runs made in every season in Major league baseball

teams_runs <- teams |> 
  mutate(runs_game = R/(W + L))

we can narrow down our analysis to find the average number of runs per games for every team for a given year for all teams

runs_per_yr <- teams_runs |> 
  group_by(yearID) |> 
  summarize(mean_runs = mean(runs_game, na.rm = TRUE))

# lets graph this summary and observe it over time

ggplot(runs_per_yr, aes(x = yearID,  y = mean_runs)) +
  geom_line() +
  geom_point() +
    title = "Average MLB Runs by Year",
    caption = "Source:"
  ) +

What team scored the most runs per year

runs_teams <- Teams |> 
  group_by(name) |> 
  filter(yearID == 2022) |> 
  select(name, R)


# arrange the Runs in descending order to view wha team made the highest runs
arrange(runs_teams, desc(R))
# plot
ggplot(runs_teams, aes(x = name, y = R)) +
  geom_bar(stat = "identity") +
  coord_flip() +
    title = "Runs scored by each team",
    subtitle = "year 2022",
    x = "Teams",
    y = "Runs"
  ) +

What team scored the highest Homeruns in the year 2022

homeruns <- Teams |> 
  group_by(name) |> 
  filter(yearID == 2022) |> 
  select(name, H) 

arrange(homeruns, desc(H))
# plot 
ggplot(homeruns, aes(x = name, y = H)) +
  geom_bar(stat = "identity") +
  coord_flip() +
    title = "Homeruns by each team",
    subtitle = "year 2022",
    x = "Teams",
    y = "Homeruns"
  ) +

How does different metrics compare to various teams

# Restrict to AL and NL in mordern era
teams <- Teams |> 
  filter(yearID >= 2022 & lgID %in% c("AL", "NL")) |> 
  drop_na() |> 
  group_by(yearID, teamID) |> 
  mutate(TB = H + X2B + 2 * X3B + 3 * HR,
         WinPct = W/G,
         rpg = R/G,
         hrpg = HR/G,
         tbpg = TB/G,
         kpg = SO/G,
         k2bb = SO/BB,
         whip = 3 * (H + BB) / IPouts)

# ggplot by year for selected team stats

yrPlot <- function(yvar, label)
    ggplot(teams, aes_string(x = "yearID", y = yvar)) +
       geom_point(size = 0.5) +
       geom_smooth(method="loess") +
       labs(x = "Year", y = paste(label, "per game"))

Plot of win percentage against run differential (R - RA)

ggplot(teams, aes(x = R - RA, y = WinPct)) +
  geom_point(size = 0.75) +
  geom_smooth(method = "loess") +
  geom_hline(yintercept = 0.5, color = "red") +
  geom_vline(xintercept = 0, color = "orange") +
    title = "Teams Win Percentage vs Run Differential",
    x = "Run differential",
    y = "Win percentage") +
Teams with over 4 million attendance in a season

teams |> 
  filter(attendance >= 4e6) |> 
  select(yearID, lgID, teamID, Rank, attendance) |> 
# A tibble: 6 × 56
# Groups:   yearID, teamID [6]
  yearID lgID  teamID franchID divID  Rank     G Ghome     W     L DivWin WCWin
   <int> <fct> <fct>  <fct>    <chr> <int> <int> <int> <int> <int> <chr>  <chr>
1   2022 NL    ARI    ARI      W         4   162    81    74    88 N      N    
2   2022 NL    ATL    ATL      E         1   162    81   101    61 Y      N    
3   2022 AL    BAL    BAL      E         4   162    81    83    79 N      N    
4   2022 AL    BOS    BOS      E         5   162    81    78    84 N      N    
5   2022 AL    CHA    CHW      C         2   162    81    81    81 N      N    
6   2022 NL    CHN    CHC      C         3   162    81    74    88 N      N    
# ℹ 44 more variables: LgWin <chr>, WSWin <chr>, R <int>, AB <int>, H <int>,
#   X2B <int>, X3B <int>, HR <int>, BB <int>, SO <int>, SB <int>, CS <int>,
#   HBP <int>, SF <int>, RA <int>, ER <int>, ERA <dbl>, CG <int>, SHO <int>,
#   SV <int>, IPouts <int>, HA <int>, HRA <int>, BBA <int>, SOA <int>, E <int>,
#   DP <int>, FP <dbl>, name <chr>, park <chr>, attendance <int>, BPF <int>,
#   PPF <int>, teamIDBR <chr>, teamIDlahman45 <chr>, teamIDretro <chr>,
#   TB <dbl>, WinPct <dbl>, rpg <dbl>, hrpg <dbl>, tbpg <dbl>, kpg <dbl>, …
ggplot(teams, aes(x = yearID, y = attendance/1000)) +
  geom_point() +
  facet_wrap(~ lgID)

Average season Homeruns by Park, post-2000

teams %>% 
   filter(yearID >= 2000) %>%
   group_by(park) %>%
     summarise(meanHRpg = mean((HR + HRA)/Ghome), nyears = n()) %>%
     filter(nyears >= 20) %>%
     arrange(desc(meanHRpg)) %>%
     head(., 10) 
# A tibble: 6 × 56
# Groups:   yearID, teamID [6]
  yearID lgID  teamID franchID divID  Rank     G Ghome     W     L DivWin WCWin
   <int> <fct> <fct>  <fct>    <chr> <int> <int> <int> <int> <int> <chr>  <chr>
1   2022 NL    ARI    ARI      W         4   162    81    74    88 N      N    
2   2022 NL    ATL    ATL      E         1   162    81   101    61 Y      N    
3   2022 AL    BAL    BAL      E         4   162    81    83    79 N      N    
4   2022 AL    BOS    BOS      E         5   162    81    78    84 N      N    
5   2022 AL    CHA    CHW      C         2   162    81    81    81 N      N    
6   2022 NL    CHN    CHC      C         3   162    81    74    88 N      N    
# ℹ 44 more variables: LgWin <chr>, WSWin <chr>, R <int>, AB <int>, H <int>,
#   X2B <int>, X3B <int>, HR <int>, BB <int>, SO <int>, SB <int>, CS <int>,
#   HBP <int>, SF <int>, RA <int>, ER <int>, ERA <dbl>, CG <int>, SHO <int>,
#   SV <int>, IPouts <int>, HA <int>, HRA <int>, BBA <int>, SOA <int>, E <int>,
#   DP <int>, FP <dbl>, name <chr>, park <chr>, attendance <int>, BPF <int>,
#   PPF <int>, teamIDBR <chr>, teamIDlahman45 <chr>, teamIDretro <chr>,
#   TB <dbl>, WinPct <dbl>, rpg <dbl>, hrpg <dbl>, tbpg <dbl>, kpg <dbl>, …

Ofcos every baseball fan wants his/her team to win. Lets go ahead and create a model to predict wins by team

base_df <- teams |> 
  drop_na() |> 
  select(name, yearID, W, L, R, H, X2B, X3B, HR, SO, RA) |> 
  filter(yearID >= 2009)
lets train a linear model based on the variables we filtered above and find out how statistically significant they are.

lm1 <- lm(W ~ R + H + X2B + X3B + HR + SO + RA, data = base_df)


lm(formula = W ~ R + H + X2B + X3B + HR + SO + RA, data = base_df)

    Min      1Q  Median      3Q     Max 
-6.4969 -1.6329  0.0259  2.5762  4.9378 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 89.821959  21.045980   4.268 0.000314 ***
R            0.080205   0.025537   3.141 0.004749 ** 
H            0.002605   0.014597   0.178 0.860013    
X2B          0.045802   0.032690   1.401 0.175134    
X3B          0.059980   0.088585   0.677 0.505411    
HR           0.007321   0.044776   0.164 0.871610    
SO          -0.009420   0.008719  -1.080 0.291677    
RA          -0.100571   0.009168 -10.970 2.18e-10 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.275 on 22 degrees of freedom
Multiple R-squared:  0.9622,    Adjusted R-squared:  0.9502 
F-statistic: 80.11 on 7 and 22 DF,  p-value: 3.537e-14
# we can observe that tripples(X3B), double(x2b) and strike outs(SO) are not statically significant 

we can create another model with only the statistically significant variable and compare

lm2 <- lm(W ~ R + H + RA, data = base_df)


lm(formula = W ~ R + H + RA, data = base_df)

    Min      1Q  Median      3Q     Max 
-8.6557 -1.5666  0.0654  2.6116  4.7752 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 69.168997  12.866651   5.376 1.25e-05 ***
R            0.078940   0.014397   5.483 9.43e-06 ***
H            0.021647   0.010548   2.052   0.0503 .  
RA          -0.103148   0.008748 -11.791 6.18e-12 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.275 on 26 degrees of freedom
Multiple R-squared:  0.9554,    Adjusted R-squared:  0.9502 
F-statistic: 185.6 on 3 and 26 DF,  p-value: < 2.2e-16

using our second model we can try predicting team wins

preds <- predict(lm2, base_df)

#present the predicted value in a column to compare with the actual win value

base_df$pred <- preds 

# plot the results

base_df |> 
  ggplot(aes(x = pred, y = W)) +
  geom_point() +
  geom_smooth() +
    title = "Predicted wins against actual wins",
    x = "Predicted Wins",
    y = "Actual Wins"
  ) +
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'