MLB Player Similarity - Calculation

As I mentioned in Similar Players in MLB, I want to be able to see how similar players can be. I decided to take a somewhat different approach by looking at how a player’s career compares against another player’s career. In order to put things in terms of a career, I didn’t want to simply sum up their statistics or normalize given the number of years they played. I wanted to be able to compare a player’s second year against another player’s second year. This isn’t a simple problem though.

First part of the problem, what statistics do you use to compare two different players? Using the Lahman Database, I had easy access to the common statistics like games played, at bats, runs scored, hits, doubles, triples, etc. However, this database doesn’t simply just have the counting stats by year. It is a compiled record of stats by league and team. In order to simplify the data collection, I aggregated the information up to the combination of player and year.

Second part of the problem, how accurate is it to compare a player’s first year against another player’s first year? Is it possible to correct for a player being sent to the majors a little early or taking a couple years to develop? This means being able to compare a player’s first year against another player’s first, second, or perhaps third year to determine the closest match. How do you adjust for a player’s third year against a player’s first year? While I use SAS at work, I don’t have access to it’s functions at home. I noticed that SAS’s PROC SIMILARITY has the capability for being able to calculate the minimum distance between two time series. Consider the example below of two player’s games.

Note that they’re not the exact same, but you can see the similarity between the two. Using dynamic programming, you can find the minimum distance between these two time series. Distance between two time series can be a simple euclidean distance or something slightly different. However, I don’t have access to SAS at home. Luckily enough, I was able to find a package in R that has this capability. Using the “dtw” package, you can easily calculate the distance between two different time series. Applying this package to the above example, gives you the results below. This three-way plot shows each player’s data in the margins and how each point maps. The closer the plot is to Y=X implies the how close the mapping of one series of data is to another.

The next problem is how to use all of this similarity to figure out what players are actually similar to each other. Keep an eye out for my next blog on using Social Network Analysis to cluster these players together and incorporating more than just Games Played.

Christopher Teixeira
Christopher Teixeira
Department Chief Engineer

My interests include using my skills for the public good and playing with baseball data.