Batter Up

A Computer Simulated Look at Baseball

By: Nathan Miller

The idea ... how it works ... numerical breakdown ... run production ... pitch count ... NLCS ... free agency ... NLCS again ... evaluation of results ... evaluation of program ... references

The idea...
    .through the creation of a computer simulation, I established a means by which any two major league baseball teams can be compared.  For the purposes of this project, I used the San Francisco Giants and the Los Angeles Dodgers.  The reason for my selection of the San Francisco Giants is that they are the team I follow.  The reason for selecting the Los Angeles Dodgers is slightly different.  Not only are the Giants and Dodgers long time rivals, but their statistics are easily comparable, where one team will not always dominate the other.
    .the comparisons are based on the team batting statistics from the 2000 MLB Season.  The use of these batting statistics allows for the establishment of a foundation for each team's batting averages based on the team's statistical breakdown of the exact number of singles, doubles, triples and homeruns.  The teams have similar batting averages but contrastingly different statistical breakdowns.  Given this information, the results are quite interesting.  The exact numerical breakdown can be seen on the tree diagram.  The diagram illustrates, numerically, exactly what happens when the program is run. it works...
    .the program was created entirely in Maple.  The program uses random numbers to produce results based on the inputs given to it.  The program selects a random number (throws a pitch), which determines a numerical output (in the strike zone or out of the strike zone).  If the pitch is out of the strike zone, one of four things may happen: the pitch is a ball (four balls results in a walk), the pitch hits the batter, the pitch is a strike (out of the strike zone but swung at by the batter, three strikes results in an out), or the pitch is hit hit into fair territory.  If the pitch is in the strike zone, one of two things may happen: the pitch is a strike (swung on and missed by the batter or watched, three strikes results in an out), or the pitch is hit into fair territory.
    .once the pitch has been hit into fair territory a random number is again selected to decide if the result will be an out or a base hit.  If the result is an out, the program stores it as such.  After three outs, the computer breaks its cycle and ends the inning.  If the result is a base hit, a random number will again be selected to decide if the base hit is a single, double, triple or homerun.  The program tallies all runs scored at the end of one inning as well as the number or pitches thrown.  The program's output shows runs scored, pitches thrown and the pitch-by-pitch result of the inning.
    .with the establishment of this basic program a second program was created.  The second program runs the basic program a given number of times, in most cases nine (with the exception of extra inning games) to simulate a complete game.  The output yields the total number of runs and pitches for the game as well as a pitch-by-pitch account of the game.
    .a third program was then created to run the second program.  The purpose of the third program was to establish useful numerical results.  The third program outputs the average runs and pitch count for 100 games.

...numerical breakdown... production...

Minimum: 1.06
q1: 1.6, q2: 2.41, q3: 3.22
Maximum: 3.86
Line of best fit: y = -2.33+17.45x

Minimum: 1.31
q1: 1.71, q2: 2.51, q3: 3.19
Maximum: 3.94
Line of best fit: y = -2.31+17.63x

...pitch count...

Minimum: 72.6
q1: 74.02, q2: 77.93, q3: 80.45
Maximum: 86.69
Line of best fit: y = 54.52+86.65x

...National League Championship Series...
    .to make my simulation exciting, I set up a situation where the San Francisco Giants were playing the Los Angeles Dodgers in the National League Championship Series; the NLCS.  The Giants carried a team batting average of 0.275 going into the series while the Dodgers carried a 0.265.  As predicted, with a higher batting average, the Giants won the series; they did so in six games.  The run production of the games can be seen in the graph below.

    .the Giants' run production is represented by the orange boxes and the Dodgers' by the blue crosses.  The Giants scoring average for the series was 5.3 runs per game.  The Dodgers scoring average was 4 runs per game. agency...
    .the Giants' free agent acquisition during the season produced large increases in run production.  The graph below represents the increase in run production from one season to the next.  The boxes represent the first season and the crosses the second. To simplify the graphs, the runs are plotted against the numbers 1-16, representing the averages.  The averages vary between .200 and .350 before the acquisition and .217 and .350 after.  The Giants' acquisition increased the run production from the previous year by an average of 0.65 runs, shown in the graph on the left.  If the Dodgers had been able to acquire the free agent, their increase in run production would not have been as productive as the Giants'.  The Dodgers' run production would have increased by an average of only .093 runs.  This is seen in the graph on the right.  The blue boxes represent the Dodgers' run production prior to the acquisition and the black crosses represent the run production after.

    .during the off-season, the San Francisco Giants, due to their lower payroll, were able to pick up a free agent, all-star with a .350 batting average.  This bolstered their team average to 0.283.  They again met the Dodgers in the National League Championship Series.  This time, however, despite picking the new acquisition, the Dodgers beat the Giants in seven.  The run production of the games can be seen in the graph below.

    .again the Giants are represented by the orange boxes and the Dodgers by the blue crosses.  The Giants' scoring average dropped to 2.6; the Dodgers prevailed with an average of 2.14.

...evaluation of results...
    .in the first NLCS the results produced are exactly what I expected.  The team with the higher batting average and what I considered to be a better statistical breakdown won the series.  The numbers for the two teams were close enough to make the results interesting, but not close enough to disprove my hypothesis.  However, in the seconds NLCS I was perplexed.  The Giants free agent acquisition bolstered the team average to 0.283, 0.008 points higher than the previous year, while the Dodgers remained at 0.265.  The Dodgers, however, were able to win the second NLCS in seven games.  There are two peculiar aspects to the second NLCS.  The first is that both the Giants and the Dodgers average run production decreased quite drastically from one year to the next, particularly strange given the Giants' free agent acquisition.  The second thing is that the Dodgers managed to win the second NLCS despite having a lower run production average than the Giants for the seven game series.  Also peculiar is the Giants inability to win the second NLCS despite having an average 0.018 points higher than the Dodgers.  I believe the program's use of random numbers is to site for these peculiarities...

...evaluation of program
    .there are some aspects that are not included in the program.  Some assumptions needed to be made for the program given time constraints and lack of in depth programming knowledge.  The assumptions can be divided up into four categories:

Offensive assumptions:
1) No mental errors
2) No base stealing
Batter assumptions:
1) Statistics based on 2000 S.F. Giants and L.A. Dodgers
Defensive assumptions:
1) No mental errors
2) No fielding errors
3) No special plays
Pitcher assumptions:
1) No wild pitches
2) No pitcher fatigue
3) Uniform pitcher

    .some of these assumptions may produce results that do not coincide with realistic baseball results.  For example, the pitch count produced by my program is extremely low when compared to real life results.  Through experimentation, though, I have noted that slight variations in the 'in the strike zone out of the strike zone' percentages could cause the pitch count to vary up or down by as much as fifteen pitches.  The program's use of random numbers as a means for producing results could also allow for unrealistically high or low outputs.  However, in defense of the outputs produced by my programs, if I wanted the outputs to be identical to reality, I would have gone and bought a video game.

    Hogg and Tanis: Probability and Statistical Inference
    Dr. Cynthia Wyels

Other pages to visit: CLU math page, CLU private math tutors page, California Lutheran University homepage, my homepage.
Page updated: May 3, 2001
By: Nathan Miller