baseball_in_r

Author

Keith Karani

Introduction

The Lahman package contains a database of pitching, hitting and fielding statistics from Major League Baseball from 1871 to 2022 including data from the present leagues American and National, and the four other major leagues, (American Association, Union Association, Player League, Federal League) and the National Association of 1871 to 1875.

Data Dictionary

The data is comprised of the following main tables:

  1. People - player names, date of birth, death and other biological information.

  2. Batting - batting statistics

  3. Pitching - pitching statistics

  4. Fielding - fielding statistics

    A collection of other tables is also provided:

    Teams:

    Teams yearly stats and standings
    TeamsHalf split season data for teams
    TeamsFranchises franchise information

    Post-season play:

    BattingPost post-season batting statistics
    PitchingPost post-season pitching statistics
    FieldingPost post-season fielding data
    SeriesPost post-season series information

    Awards:

    AwardsManagers awards won by managers
    AwardsPlayers awards won by players
    AwardsShareManagers award voting for manager awards
    AwardsSharePlayers award voting for player awards

    Hall of Fame: links to People via hofID

    HallOfFame Hall of Fame voting data

Information is different tables relating to a player is tagged with his playerID and are linked to names and birthdates in the People table.

Other tables:

AllstarFull - All-Star games appearances; Managers - managerial statistics; FieldingOF - outfield position data; ManagersHalf - split season data for managers; Salaries - player salary data; Appearances - data on player appearances; Schools - Information on schools players attended; CollegePlaying - Information on schools players attended, by player and year;

Variable label tables are provided for some of the tables:

battingLabels, pitchingLabels, fieldingLabels

Source

Lahman, S. (2023) Lahman’s Baseball Database, 1871-2022, Main page, https://www.seanlahman.com/baseball-archive/statistics/

Load packages to use

library(Lahman)
library(tidyr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
library(readr)
library(caret)
Loading required package: lattice

View the Lahman package to display the dataset with the data dictionary on the baseball data

teams <- Teams

head(teams)
  yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
1   1871   NA    BS1      BNA  <NA>    3 31    NA 20 10   <NA>  <NA>     N
2   1871   NA    CH1      CNA  <NA>    2 28    NA 19  9   <NA>  <NA>     N
3   1871   NA    CL1      CFC  <NA>    8 29    NA 10 19   <NA>  <NA>     N
4   1871   NA    FW1      KEK  <NA>    7 19    NA  7 12   <NA>  <NA>     N
5   1871   NA    NY2      NNA  <NA>    5 33    NA 16 17   <NA>  <NA>     N
6   1871   NA    PH1      PNA  <NA>    1 28    NA 21  7   <NA>  <NA>     Y
  WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
1  <NA> 401 1372 426  70  37  3 60 19 73 16  NA NA 303 109 3.55 22   1  3
2  <NA> 302 1196 323  52  21 10 60 22 69 21  NA NA 241  77 2.76 25   0  1
3  <NA> 249 1186 328  35  40  7 26 25 18  8  NA NA 341 116 4.11 23   0  0
4  <NA> 137  746 178  19   8  2 33  9 16  4  NA NA 243  97 5.17 19   1  0
5  <NA> 302 1404 403  43  21  1 33 15 46 15  NA NA 313 121 3.72 32   1  0
6  <NA> 376 1281 410  66  27  9 46 23 56 12  NA NA 266 137 4.95 27   0  0
  IPouts  HA HRA BBA SOA   E DP    FP                    name
1    828 367   2  42  23 243 24 0.834    Boston Red Stockings
2    753 308   6  28  22 229 16 0.829 Chicago White Stockings
3    762 346  13  53  34 234 15 0.818  Cleveland Forest Citys
4    507 261   5  21  17 163  8 0.803    Fort Wayne Kekiongas
5    879 373   7  42  22 235 14 0.840        New York Mutuals
6    747 329   3  53  16 194 13 0.845  Philadelphia Athletics
                          park attendance BPF PPF teamIDBR teamIDlahman45
1          South End Grounds I         NA 103  98      BOS            BS1
2      Union Base-Ball Grounds         NA 104 102      CHI            CH1
3 National Association Grounds         NA  96 100      CLE            CL1
4               Hamilton Field         NA 101 107      KEK            FW1
5     Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
6     Jefferson Street Grounds         NA 102  98      ATH            PH1
  teamIDretro
1         BS1
2         CH1
3         CL1
4         FW1
5         NY2
6         PH1
glimpse(teams)
Rows: 3,015
Columns: 48
$ yearID         <int> 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1…
$ lgID           <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ teamID         <fct> BS1, CH1, CL1, FW1, NY2, PH1, RC1, TRO, WS3, BL1, BR1, …
$ franchID       <fct> BNA, CNA, CFC, KEK, NNA, PNA, ROK, TRO, OLY, BLC, ECK, …
$ divID          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Rank           <int> 3, 2, 8, 7, 5, 1, 9, 6, 4, 2, 9, 6, 1, 7, 8, 3, 4, 5, 1…
$ G              <int> 31, 28, 29, 19, 33, 28, 25, 29, 32, 58, 29, 37, 48, 22,…
$ Ghome          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ W              <int> 20, 19, 10, 7, 16, 21, 4, 13, 15, 35, 3, 9, 39, 6, 5, 3…
$ L              <int> 10, 9, 19, 12, 17, 7, 21, 15, 15, 19, 26, 28, 8, 16, 19…
$ DivWin         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ WCWin          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ LgWin          <chr> "N", "N", "N", "N", "N", "Y", "N", "N", "N", "N", "N", …
$ WSWin          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ R              <int> 401, 302, 249, 137, 302, 376, 231, 351, 310, 617, 152, …
$ AB             <int> 1372, 1196, 1186, 746, 1404, 1281, 1036, 1248, 1353, 25…
$ H              <int> 426, 323, 328, 178, 403, 410, 274, 384, 375, 753, 248, …
$ X2B            <int> 70, 52, 35, 19, 43, 66, 44, 51, 54, 106, 29, 35, 107, 2…
$ X3B            <int> 37, 21, 40, 8, 21, 27, 25, 34, 26, 31, 9, 10, 30, 5, 9,…
$ HR             <int> 3, 10, 7, 2, 1, 9, 3, 6, 6, 14, 0, 1, 7, 0, 2, 4, 4, 5,…
$ BB             <int> 60, 60, 26, 33, 33, 46, 38, 49, 48, 29, 18, 19, 29, 17,…
$ SO             <int> 19, 22, 25, 9, 15, 23, 30, 19, 13, 28, 40, 25, 26, 13, …
$ SB             <int> 73, 69, 18, 16, 46, 56, 53, 62, 48, 53, 8, 19, 48, 12, …
$ CS             <int> 16, 21, 8, 4, 15, 12, 10, 24, 13, 18, 13, 16, 14, 3, 7,…
$ HBP            <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ SF             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ RA             <int> 303, 241, 341, 243, 313, 266, 287, 362, 303, 434, 413, …
$ ER             <int> 109, 77, 116, 97, 121, 137, 108, 153, 137, 166, 160, 16…
$ ERA            <dbl> 3.55, 2.76, 4.11, 5.17, 3.72, 4.95, 4.30, 5.51, 4.37, 2…
$ CG             <int> 22, 25, 23, 19, 32, 27, 23, 28, 32, 48, 28, 37, 41, 15,…
$ SHO            <int> 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 4, 0, 0, 3, 1, 2, 0…
$ SV             <int> 3, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 4, 0, 0, 1, 0, 1, 0…
$ IPouts         <int> 828, 753, 762, 507, 879, 747, 678, 750, 846, 1548, 778,…
$ HA             <int> 367, 308, 346, 261, 373, 329, 315, 431, 371, 573, 484, …
$ HRA            <int> 2, 6, 13, 5, 7, 3, 3, 4, 4, 3, 7, 6, 0, 6, 6, 2, 3, 2, …
$ BBA            <int> 42, 28, 53, 21, 42, 53, 34, 75, 45, 63, 36, 21, 27, 24,…
$ SOA            <int> 23, 22, 34, 17, 22, 16, 16, 12, 13, 77, 13, 13, 29, 11,…
$ E              <int> 243, 229, 234, 163, 235, 194, 220, 198, 218, 432, 274, …
$ DP             <int> 24, 16, 15, 8, 14, 13, 14, 22, 20, 22, 9, 15, 44, 17, 1…
$ FP             <dbl> 0.834, 0.829, 0.818, 0.803, 0.840, 0.845, 0.821, 0.845,…
$ name           <chr> "Boston Red Stockings", "Chicago White Stockings", "Cle…
$ park           <chr> "South End Grounds I", "Union Base-Ball Grounds", "Nati…
$ attendance     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ BPF            <int> 103, 104, 96, 101, 90, 102, 97, 101, 94, 106, 87, 115, …
$ PPF            <int> 98, 102, 100, 107, 88, 98, 99, 100, 98, 102, 96, 122, 1…
$ teamIDBR       <chr> "BOS", "CHI", "CLE", "KEK", "NYU", "ATH", "ROK", "TRO",…
$ teamIDlahman45 <chr> "BS1", "CH1", "CL1", "FW1", "NY2", "PH1", "RC1", "TRO",…
$ teamIDretro    <chr> "BS1", "CH1", "CL1", "FW1", "NY2", "PH1", "RC1", "TRO",…

lets conduct exploratory data analysis

Winning a game in baseball is counted using run, so for our first exploration can we find the average number of runs made in every season in Major league baseball

teams_runs <- teams |> 
  mutate(runs_game = R/(W + L))


head(teams_runs)
  yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
1   1871   NA    BS1      BNA  <NA>    3 31    NA 20 10   <NA>  <NA>     N
2   1871   NA    CH1      CNA  <NA>    2 28    NA 19  9   <NA>  <NA>     N
3   1871   NA    CL1      CFC  <NA>    8 29    NA 10 19   <NA>  <NA>     N
4   1871   NA    FW1      KEK  <NA>    7 19    NA  7 12   <NA>  <NA>     N
5   1871   NA    NY2      NNA  <NA>    5 33    NA 16 17   <NA>  <NA>     N
6   1871   NA    PH1      PNA  <NA>    1 28    NA 21  7   <NA>  <NA>     Y
  WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
1  <NA> 401 1372 426  70  37  3 60 19 73 16  NA NA 303 109 3.55 22   1  3
2  <NA> 302 1196 323  52  21 10 60 22 69 21  NA NA 241  77 2.76 25   0  1
3  <NA> 249 1186 328  35  40  7 26 25 18  8  NA NA 341 116 4.11 23   0  0
4  <NA> 137  746 178  19   8  2 33  9 16  4  NA NA 243  97 5.17 19   1  0
5  <NA> 302 1404 403  43  21  1 33 15 46 15  NA NA 313 121 3.72 32   1  0
6  <NA> 376 1281 410  66  27  9 46 23 56 12  NA NA 266 137 4.95 27   0  0
  IPouts  HA HRA BBA SOA   E DP    FP                    name
1    828 367   2  42  23 243 24 0.834    Boston Red Stockings
2    753 308   6  28  22 229 16 0.829 Chicago White Stockings
3    762 346  13  53  34 234 15 0.818  Cleveland Forest Citys
4    507 261   5  21  17 163  8 0.803    Fort Wayne Kekiongas
5    879 373   7  42  22 235 14 0.840        New York Mutuals
6    747 329   3  53  16 194 13 0.845  Philadelphia Athletics
                          park attendance BPF PPF teamIDBR teamIDlahman45
1          South End Grounds I         NA 103  98      BOS            BS1
2      Union Base-Ball Grounds         NA 104 102      CHI            CH1
3 National Association Grounds         NA  96 100      CLE            CL1
4               Hamilton Field         NA 101 107      KEK            FW1
5     Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
6     Jefferson Street Grounds         NA 102  98      ATH            PH1
  teamIDretro runs_game
1         BS1 13.366667
2         CH1 10.785714
3         CL1  8.586207
4         FW1  7.210526
5         NY2  9.151515
6         PH1 13.428571

we can narrow down our analysis to find the average number of runs per games for every team for a given year for all teams

runs_per_yr <- teams_runs |> 
  group_by(yearID) |> 
  summarize(mean_runs = mean(runs_game, na.rm = TRUE))

head(runs_per_yr)
# A tibble: 6 × 2
  yearID mean_runs
   <int>     <dbl>
1   1871     10.5 
2   1872      8.85
3   1873      8.21
4   1874      7.35
5   1875      5.54
6   1876      5.93
# lets graph this summary and observe it over time

ggplot(runs_per_yr, aes(x = yearID,  y = mean_runs)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Average MLB Runs by Year",
    caption = "Source: https://www.seanlahman.com/baseball-archive/statistics/"
  ) +
  theme_minimal()

What team scored the most runs per year

runs_teams <- Teams |> 
  group_by(name) |> 
  filter(yearID == 2022) |> 
  select(name, R)

#head(runs_teams)

# arrange the Runs in descending order to view wha team made the highest runs
arrange(runs_teams, desc(R))
# A tibble: 30 × 2
# Groups:   name [30]
   name                      R
   <chr>                 <int>
 1 Los Angeles Dodgers     847
 2 New York Yankees        807
 3 Atlanta Braves          789
 4 Toronto Blue Jays       775
 5 New York Mets           772
 6 St. Louis Cardinals     772
 7 Philadelphia Phillies   747
 8 Houston Astros          737
 9 Boston Red Sox          735
10 Milwaukee Brewers       725
# ℹ 20 more rows
# plot
ggplot(runs_teams, aes(x = name, y = R)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(
    title = "Runs scored by each team",
    subtitle = "year 2022",
    x = "Teams",
    y = "Runs"
  ) +
  theme_minimal()

What team scored the highest Homeruns in the year 2022

homeruns <- Teams |> 
  group_by(name) |> 
  filter(yearID == 2022) |> 
  select(name, H) 


arrange(homeruns, desc(H))
# A tibble: 30 × 2
# Groups:   name [30]
   name                      H
   <chr>                 <int>
 1 Toronto Blue Jays      1464
 2 Chicago White Sox      1435
 3 Boston Red Sox         1427
 4 New York Mets          1422
 5 Los Angeles Dodgers    1418
 6 Cleveland Guardians    1410
 7 Colorado Rockies       1408
 8 Atlanta Braves         1394
 9 Philadelphia Phillies  1392
10 St. Louis Cardinals    1386
# ℹ 20 more rows
# plot 
ggplot(homeruns, aes(x = name, y = H)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(
    title = "Homeruns by each team",
    subtitle = "year 2022",
    x = "Teams",
    y = "Homeruns"
  ) +
  theme_minimal()

How does different metrics compare to various teams

# Restrict to AL and NL in mordern era
teams <- Teams |> 
  filter(yearID >= 2022 & lgID %in% c("AL", "NL")) |> 
  drop_na() |> 
  group_by(yearID, teamID) |> 
  mutate(TB = H + X2B + 2 * X3B + 3 * HR,
         WinPct = W/G,
         rpg = R/G,
         hrpg = HR/G,
         tbpg = TB/G,
         kpg = SO/G,
         k2bb = SO/BB,
         whip = 3 * (H + BB) / IPouts)

# ggplot by year for selected team stats

yrPlot <- function(yvar, label)
  
{
    ggplot(teams, aes_string(x = "yearID", y = yvar)) +
       geom_point(size = 0.5) +
       geom_smooth(method="loess") +
       labs(x = "Year", y = paste(label, "per game"))
}

Plot of win percentage against run differential (R - RA)

ggplot(teams, aes(x = R - RA, y = WinPct)) +
  geom_point(size = 0.75) +
  geom_smooth(method = "loess") +
  geom_hline(yintercept = 0.5, color = "red") +
  geom_vline(xintercept = 0, color = "orange") +
  labs(
    title = "Teams Win Percentage vs Run Differential",
    x = "Run differential",
    y = "Win percentage") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Teams with over 4 million attendance in a season

teams |> 
  filter(attendance >= 4e6) |> 
  select(yearID, lgID, teamID, Rank, attendance) |> 
  arrange(desc(attendance))
# A tibble: 0 × 5
# Groups:   yearID, teamID [0]
# ℹ 5 variables: yearID <int>, lgID <fct>, teamID <fct>, Rank <int>,
#   attendance <int>
head(teams)
# A tibble: 6 × 56
# Groups:   yearID, teamID [6]
  yearID lgID  teamID franchID divID  Rank     G Ghome     W     L DivWin WCWin
   <int> <fct> <fct>  <fct>    <chr> <int> <int> <int> <int> <int> <chr>  <chr>
1   2022 NL    ARI    ARI      W         4   162    81    74    88 N      N    
2   2022 NL    ATL    ATL      E         1   162    81   101    61 Y      N    
3   2022 AL    BAL    BAL      E         4   162    81    83    79 N      N    
4   2022 AL    BOS    BOS      E         5   162    81    78    84 N      N    
5   2022 AL    CHA    CHW      C         2   162    81    81    81 N      N    
6   2022 NL    CHN    CHC      C         3   162    81    74    88 N      N    
# ℹ 44 more variables: LgWin <chr>, WSWin <chr>, R <int>, AB <int>, H <int>,
#   X2B <int>, X3B <int>, HR <int>, BB <int>, SO <int>, SB <int>, CS <int>,
#   HBP <int>, SF <int>, RA <int>, ER <int>, ERA <dbl>, CG <int>, SHO <int>,
#   SV <int>, IPouts <int>, HA <int>, HRA <int>, BBA <int>, SOA <int>, E <int>,
#   DP <int>, FP <dbl>, name <chr>, park <chr>, attendance <int>, BPF <int>,
#   PPF <int>, teamIDBR <chr>, teamIDlahman45 <chr>, teamIDretro <chr>,
#   TB <dbl>, WinPct <dbl>, rpg <dbl>, hrpg <dbl>, tbpg <dbl>, kpg <dbl>, …
ggplot(teams, aes(x = yearID, y = attendance/1000)) +
  geom_point() +
  facet_wrap(~ lgID)

Average season Homeruns by Park, post-2000

teams %>% 
   filter(yearID >= 2000) %>%
   group_by(park) %>%
     summarise(meanHRpg = mean((HR + HRA)/Ghome), nyears = n()) %>%
     filter(nyears >= 20) %>%
     arrange(desc(meanHRpg)) %>%
     head(., 10) 
# A tibble: 0 × 3
# ℹ 3 variables: park <chr>, meanHRpg <dbl>, nyears <int>
head(teams)
# A tibble: 6 × 56
# Groups:   yearID, teamID [6]
  yearID lgID  teamID franchID divID  Rank     G Ghome     W     L DivWin WCWin
   <int> <fct> <fct>  <fct>    <chr> <int> <int> <int> <int> <int> <chr>  <chr>
1   2022 NL    ARI    ARI      W         4   162    81    74    88 N      N    
2   2022 NL    ATL    ATL      E         1   162    81   101    61 Y      N    
3   2022 AL    BAL    BAL      E         4   162    81    83    79 N      N    
4   2022 AL    BOS    BOS      E         5   162    81    78    84 N      N    
5   2022 AL    CHA    CHW      C         2   162    81    81    81 N      N    
6   2022 NL    CHN    CHC      C         3   162    81    74    88 N      N    
# ℹ 44 more variables: LgWin <chr>, WSWin <chr>, R <int>, AB <int>, H <int>,
#   X2B <int>, X3B <int>, HR <int>, BB <int>, SO <int>, SB <int>, CS <int>,
#   HBP <int>, SF <int>, RA <int>, ER <int>, ERA <dbl>, CG <int>, SHO <int>,
#   SV <int>, IPouts <int>, HA <int>, HRA <int>, BBA <int>, SOA <int>, E <int>,
#   DP <int>, FP <dbl>, name <chr>, park <chr>, attendance <int>, BPF <int>,
#   PPF <int>, teamIDBR <chr>, teamIDlahman45 <chr>, teamIDretro <chr>,
#   TB <dbl>, WinPct <dbl>, rpg <dbl>, hrpg <dbl>, tbpg <dbl>, kpg <dbl>, …

Ofcos every baseball fan wants his/her team to win. Lets go ahead and create a model to predict wins by team

base_df <- teams |> 
  drop_na() |> 
  select(name, yearID, W, L, R, H, X2B, X3B, HR, SO, RA) |> 
  filter(yearID >= 2009)
Adding missing grouping variables: `teamID`
head(base_df) 
# A tibble: 6 × 12
# Groups:   yearID, teamID [6]
  teamID name       yearID     W     L     R     H   X2B   X3B    HR    SO    RA
  <fct>  <chr>       <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ARI    Arizona D…   2022    74    88   702  1232   262    24   173  1341   740
2 ATL    Atlanta B…   2022   101    61   789  1394   298    11   243  1498   609
3 BAL    Baltimore…   2022    83    79   674  1281   275    25   171  1390   688
4 BOS    Boston Re…   2022    78    84   735  1427   352    12   155  1373   787
5 CHA    Chicago W…   2022    81    81   686  1435   272     9   149  1269   717
6 CHN    Chicago C…   2022    74    88   657  1293   265    31   159  1448   731

lets train a linear model based on the variables we filtered above and find out how statistically significant they are.

lm1 <- lm(W ~ R + H + X2B + X3B + HR + SO + RA, data = base_df)

summary(lm1)

Call:
lm(formula = W ~ R + H + X2B + X3B + HR + SO + RA, data = base_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4969 -1.6329  0.0259  2.5762  4.9378 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 89.821959  21.045980   4.268 0.000314 ***
R            0.080205   0.025537   3.141 0.004749 ** 
H            0.002605   0.014597   0.178 0.860013    
X2B          0.045802   0.032690   1.401 0.175134    
X3B          0.059980   0.088585   0.677 0.505411    
HR           0.007321   0.044776   0.164 0.871610    
SO          -0.009420   0.008719  -1.080 0.291677    
RA          -0.100571   0.009168 -10.970 2.18e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.275 on 22 degrees of freedom
Multiple R-squared:  0.9622,    Adjusted R-squared:  0.9502 
F-statistic: 80.11 on 7 and 22 DF,  p-value: 3.537e-14
#observation
# we can observe that tripples(X3B), double(x2b) and strike outs(SO) are not statically significant 

we can create another model with only the statistically significant variable and compare

lm2 <- lm(W ~ R + H + RA, data = base_df)

summary(lm2)

Call:
lm(formula = W ~ R + H + RA, data = base_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.6557 -1.5666  0.0654  2.6116  4.7752 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 69.168997  12.866651   5.376 1.25e-05 ***
R            0.078940   0.014397   5.483 9.43e-06 ***
H            0.021647   0.010548   2.052   0.0503 .  
RA          -0.103148   0.008748 -11.791 6.18e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.275 on 26 degrees of freedom
Multiple R-squared:  0.9554,    Adjusted R-squared:  0.9502 
F-statistic: 185.6 on 3 and 26 DF,  p-value: < 2.2e-16

using our second model we can try predicting team wins

preds <- predict(lm2, base_df)

#present the predicted value in a column to compare with the actual win value

base_df$pred <- preds 

base_df
# A tibble: 30 × 13
# Groups:   yearID, teamID [30]
   teamID name      yearID     W     L     R     H   X2B   X3B    HR    SO    RA
   <fct>  <chr>      <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1 ARI    Arizona …   2022    74    88   702  1232   262    24   173  1341   740
 2 ATL    Atlanta …   2022   101    61   789  1394   298    11   243  1498   609
 3 BAL    Baltimor…   2022    83    79   674  1281   275    25   171  1390   688
 4 BOS    Boston R…   2022    78    84   735  1427   352    12   155  1373   787
 5 CHA    Chicago …   2022    81    81   686  1435   272     9   149  1269   717
 6 CHN    Chicago …   2022    74    88   657  1293   265    31   159  1448   731
 7 CIN    Cincinna…   2022    62   100   648  1264   235    18   156  1430   815
 8 CLE    Clevelan…   2022    92    70   698  1410   273    31   127  1122   634
 9 COL    Colorado…   2022    68    94   698  1408   280    34   149  1330   873
10 DET    Detroit …   2022    66    96   557  1240   235    27   110  1413   713
# ℹ 20 more rows
# ℹ 1 more variable: pred <dbl>
# plot the results

base_df |> 
  ggplot(aes(x = pred, y = W)) +
  geom_point() +
  geom_smooth() +
  labs(
    title = "Predicted wins against actual wins",
    x = "Predicted Wins",
    y = "Actual Wins"
  ) +
  theme_minimal()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'