A little about me…




Christopher Teixeira

Data Scientist



Interests

  • Applied Probability
  • Data Visualization
  • Machine Learning
  • Responsible AI

---
displayMode: ""
---
gantt
    title Professional Timeline
    dateFormat  YYYY-MM
    axisFormat %Y
    todayMarker off
    section Education
    WPI (BS Mathematics):a1, 2002-09, 2006-06
    GMU (MS Operations Research):a2, 2008-08, 2010-12
    section Experience
    SAIC                    :b1, 2006-06, 2010-11
    IBM                     :b2, 2010-11, 2012-08
    Epsilon                 :b3, 2012-08, 2014-07
    MITRE                   :b4, 2014-07, 2026-04
    To be Announced :active, b5, 2026-04, 2028-12

Reading in data

Code
# baseballr provides functions for accessing baseball data, including 
#   the statcast_search function which allows you to access Statcast data from MLB.

# More info can be found here: https://baseballr.com/reference/statcast_search/

library(baseballr)

# Download a week of batting data from Statcast.
df <- statcast_search(
  start_date = "2025-05-01", 
  end_date = "2025-05-07", 
  player_type = "batter"
)

# View the data frame in a friendly format for Quarto
knitr::kable(head(df), format="html")
pitch_type game_date release_speed release_pos_x release_pos_z player_name batter pitcher events description spin_dir spin_rate_deprecated break_angle_deprecated break_length_deprecated zone des game_type stand p_throws home_team away_team type hit_location bb_type balls strikes game_year pfx_x pfx_z plate_x plate_z on_3b on_2b on_1b outs_when_up inning inning_topbot hc_x hc_y tfs_deprecated tfs_zulu_deprecated umpire sv_id vx0 vy0 vz0 ax ay az sz_top sz_bot hit_distance_sc launch_speed launch_angle effective_speed release_spin_rate release_extension game_pk fielder_2 fielder_3 fielder_4 fielder_5 fielder_6 fielder_7 fielder_8 fielder_9 release_pos_y estimated_ba_using_speedangle estimated_woba_using_speedangle woba_value woba_denom babip_value iso_value launch_speed_angle at_bat_number pitch_number pitch_name home_score away_score bat_score fld_score post_away_score post_home_score post_bat_score post_fld_score if_fielding_alignment of_fielding_alignment spin_axis delta_home_win_exp delta_run_exp bat_speed swing_length estimated_slg_using_speedangle delta_pitcher_run_exp hyper_speed home_score_diff bat_score_diff home_win_exp bat_win_exp age_pit_legacy age_bat_legacy age_pit age_bat n_thruorder_pitcher n_priorpa_thisgame_player_at_bat pitcher_days_since_prev_game batter_days_since_prev_game pitcher_days_until_next_game batter_days_until_next_game api_break_z_with_gravity api_break_x_arm api_break_x_batter_in arm_angle attack_angle attack_direction swing_path_tilt intercept_ball_minus_batter_pos_x_inches intercept_ball_minus_batter_pos_y_inches
KC 2025-05-07 85.8 -1.23 6.69 Chapman, Matt 656305 676962 strikeout swinging_strike_blocked NA NA NA NA 14 Matt Chapman strikes out swinging. R R R CHC SF S 2 2 2 2025 0.24 -0.10 0.1722480 0.8157616 NA 642715 NA 2 5 Top NA NA NA NA NA NA 2.854902 -124.7786 -7.422442 1.923976 26.60315 -31.88787 3.440000 1.660000 NA NA NA 86.2 2227 6.7 778020 608348 457759 663538 542932 621020 664023 691718 673548 53.78 NA 0.000000 0 1 0 0 NA 39 5 Knuckle Curve 1 3 3 1 3 1 3 1 Standard Standard 70 0.024 -0.220 65.7 8.7 NA 0.220 NA -2 2 0.265 0.735 25 32 26 32 3 2 5 1 6 2 3.22 -0.24 -0.24 50.2 16.39766 -36.450081 30.99105 40.65807 47.07546
FF 2025-05-07 95.1 -1.44 6.63 Chapman, Matt 656305 676962 ball NA NA NA NA 11 Ball R R R CHC SF B NA 1 2 2025 -0.18 1.28 -0.2905371 4.1049142 NA 642715 NA 2 5 Top NA NA NA NA NA NA 3.446445 -138.4992 -3.908060 -3.035323 29.83358 -14.98578 3.212147 1.583246 NA NA NA 96.3 2178 6.8 778020 608348 457759 663538 542932 621020 664023 691718 673548 53.67 NA NA NA NA NA NA NA 39 4 4-Seam Fastball 1 3 3 1 3 1 3 1 Standard Standard 204 -0.004 0.012 NA NA NA -0.012 NA -2 2 0.269 0.731 25 32 26 32 3 2 5 1 6 2 1.23 0.18 0.18 49.0 NA NA NA NA NA
FF 2025-05-07 94.5 -1.15 6.56 Chapman, Matt 656305 676962 ball NA NA NA NA 14 Ball R R R CHC SF B NA 0 2 2025 -0.39 1.35 0.8352713 2.1587841 NA 642715 NA 2 5 Top NA NA NA NA NA NA 6.063019 -137.3188 -8.826828 -6.217584 30.38614 -13.29819 3.254478 1.625378 NA NA NA 95.0 2074 6.7 778020 608348 457759 663538 542932 621020 664023 691718 673548 53.82 NA NA NA NA NA NA NA 39 3 4-Seam Fastball 1 3 3 1 3 1 3 1 Standard Standard 204 -0.003 0.015 NA NA NA -0.015 NA -2 2 0.272 0.728 25 32 26 32 3 2 5 1 6 2 1.21 0.39 0.39 48.0 NA NA NA NA NA
SI 2025-05-07 93.0 0.74 5.67 Capra, Vinny 681962 664285 field_out hit_into_play NA NA NA NA 8 Brewers challenged (play at 1st), call on the field was upheld: Vinny Capra grounds out, pitcher Framber Valdez to first baseman Christian Walker. R R L MIL HOU X 1 ground_ball 2 1 2025 1.19 0.26 0.2660518 1.9673470 NA NA NA 2 7 Bot 119.89 194.25 NA NA NA NA -3.792811 -135.3023 -4.220491 15.361850 32.19198 -28.18434 3.180000 1.500000 2 65.5 -44 92.3 2138 6.1 778015 673237 572233 663898 670623 665161 701305 676694 676801 54.44 0.160 0.148000 0 1 0 0 2 57 4 Sinker 1 6 1 6 6 1 1 6 Standard Standard 134 -0.003 -0.296 67.5 6.9 0.161 0.296 88.0 -5 -5 0.021 0.021 31 28 32 29 3 2 5 2 6 17 2.39 1.19 -1.19 36.1 10.63560 9.183412 32.07321 38.23280 23.51717
FF 2025-05-07 96.3 1.21 6.06 Maldonado, Martín 455117 608331 field_out hit_into_play NA NA NA NA 6 Martín Maldonado flies out sharply to center fielder Cody Bellinger. R R L NYY SD X 8 fly_ball 2 2 2025 0.18 1.47 0.6825741 2.4890805 NA NA NA 2 7 Top 132.54 42.17 NA NA NA NA -1.821405 -140.1033 -7.259126 2.824731 34.04066 -11.35005 3.380000 1.570000 387 103.7 31 96.8 2363 6.8 778019 669224 502671 678391 665828 683011 691176 641355 592450 53.74 0.641 1.279542 0 1 0 0 6 48 5 4-Seam Fastball 0 1 1 0 1 0 1 0 Standard Standard 158 0.011 -0.224 71.7 7.7 2.526 0.224 103.7 -1 1 0.353 0.647 31 38 31 39 3 2 5 2 6 2 1.00 0.18 -0.18 51.3 14.48879 5.455989 33.84591 45.37516 29.01512
CU 2025-05-07 77.0 0.74 5.74 Capra, Vinny 681962 664285 ball NA NA NA NA 12 Ball R R L MIL HOU B NA 1 1 2025 -1.04 -1.05 1.2998572 2.7571789 NA NA NA 2 7 Bot NA NA NA NA NA NA 3.054727 -112.0218 2.561620 -9.318751 21.90586 -41.66357 3.221396 1.607159 NA NA NA 76.1 2727 5.8 778015 673237 572233 663898 670623 665161 701305 676694 676801 54.68 NA NA NA NA NA NA NA 57 3 Curveball 1 6 1 6 6 1 1 6 Standard Standard 314 0.000 0.053 NA NA NA -0.053 NA -5 -5 0.021 0.021 31 28 32 29 3 2 5 2 6 17 4.92 -1.04 1.04 38.3 NA NA NA NA NA
Code
# pybaseball provides functions for accessing baseball data, including 
#   the statcast function which allows you to access Statcast data from MLB.

# More info can be found here: https://baseballr.com/reference/statcast_search/

import pybaseball as pb
import pandas as pd

df = pb.statcast(
    start_dt="2025-05-01", 
    end_dt="2025-05-07",
    verbose=False
) 
# View the data frame in a friendly format for Quarto
print(df.head().to_markdown(index=False))
pitch_type game_date release_speed release_pos_x release_pos_z player_name batter pitcher events description spin_dir spin_rate_deprecated break_angle_deprecated break_length_deprecated zone des game_type stand p_throws home_team away_team type hit_location bb_type balls strikes game_year pfx_x pfx_z plate_x plate_z on_3b on_2b on_1b outs_when_up inning inning_topbot hc_x hc_y tfs_deprecated tfs_zulu_deprecated umpire sv_id vx0 vy0 vz0 ax ay az sz_top sz_bot hit_distance_sc launch_speed launch_angle effective_speed release_spin_rate release_extension game_pk fielder_2 fielder_3 fielder_4 fielder_5 fielder_6 fielder_7 fielder_8 fielder_9 release_pos_y estimated_ba_using_speedangle estimated_woba_using_speedangle woba_value woba_denom babip_value iso_value launch_speed_angle at_bat_number pitch_number pitch_name home_score away_score bat_score fld_score post_away_score post_home_score post_bat_score post_fld_score if_fielding_alignment of_fielding_alignment spin_axis delta_home_win_exp delta_run_exp bat_speed swing_length estimated_slg_using_speedangle delta_pitcher_run_exp hyper_speed home_score_diff bat_score_diff home_win_exp bat_win_exp age_pit_legacy age_bat_legacy age_pit age_bat n_thruorder_pitcher n_priorpa_thisgame_player_at_bat pitcher_days_since_prev_game batter_days_since_prev_game pitcher_days_until_next_game batter_days_until_next_game api_break_z_with_gravity api_break_x_arm api_break_x_batter_in arm_angle attack_angle attack_direction swing_path_tilt intercept_ball_minus_batter_pos_x_inches intercept_ball_minus_batter_pos_y_inches
FC 2025-05-07 00:00:00 86.9 -1.96 5.41 Pagán, Emilio 592696 641941 strikeout swinging_strike 7 Eddie Rosario strikes out swinging. R L R ATL CIN S 2 nan 1 2 2025 0.33 0.65 -0.619877 2.03626 671739 2 9 Bot 2.57022 -126.55 -3.30744 3.07245 24.4291 -24.6717 3.35 1.62 88 2400 6.8 778023 663886 668715 680574 669289 682829 694362 670770 677956 53.66 0.0 0.0 1 0 0 70 5 Cutter 3 4 3 4 4 3 3 4 Infield shade Strategic 199 -0.099 -0.183 68.4 7.9 0.183 -1 -1 0.099 0.099 34 33 34 34 1 0 1 7 3 2.35 -0.33 0.33 38.2 19.318001205408 -12.604511466980634 28.10243518548305 44.832212177848795 35.250630209870266
FF 2025-05-07 00:00:00 95.6 -1.81 5.58 Pagán, Emilio 592696 641941 nan foul 11 Foul R L R ATL CIN S nan 1 2 2025 -0.88 1.54 -0.516144 3.70284 671739 2 9 Bot 5.41895 -139.086 -2.81051 -12.6175 31.9237 -11.7958 3.35 1.62 226 80.9 62 96.2 2648 6.7 778023 663886 668715 680574 669289 682829 694362 670770 677956 53.8 70 4 4-Seam Fastball 3 4 3 4 4 3 3 4 Infield shade Strategic 214 0 0 65.2 7.1 0 88.0 -1 -1 0.099 0.099 34 33 34 34 1 0 1 7 3 0.96 0.88 -0.88 40 11.65636616091739 -13.12215688959648 22.8287189010902 42.578184020970454 30.960237026850052
FS 2025-05-07 00:00:00 83.2 -2.01 5.39 Pagán, Emilio 592696 641941 nan blocked_ball 13 Ball In Dirt R L R ATL CIN B nan 0 2 2025 -0.95 0.03 -0.598467 -0.570105 671739 2 9 Bot 5.1799 -121.001 -7.38721 -10.3128 23.0312 -30.6815 3.36039 1.44416 84.7 1028 7.2 778023 663886 668715 680574 669289 682829 694362 670770 677956 53.28 70 3 Split-Finger 3 4 3 4 4 3 3 4 Infield shade Strategic 241 0 0.009 -0.009 -1 -1 0.099 0.099 34 33 34 34 1 0 1 7 3 3.26 0.95 -0.95 37.9
FC 2025-05-07 00:00:00 87.3 -1.92 5.49 Pagán, Emilio 592696 641941 nan swinging_strike 13 Swinging Strike R L R ATL CIN S nan 0 1 2025 0.16 0.58 -1.08005 2.42234 671739 2 9 Bot 1.70213 -127.283 -2.51703 1.45595 23.9962 -25.5608 3.35 1.62 88.6 2455 6.8 778023 663886 668715 680574 669289 682829 694362 670770 677956 53.7 70 2 Cutter 3 4 3 4 4 3 3 4 Infield shade Strategic 178 0 -0.055 66.3 7.0 0.055 -1 -1 0.099 0.099 34 33 34 34 1 0 1 7 3 2.38 -0.16 0.16 41.6 5.126280742107504 23.36120966854403 27.126194862765757 51.43128585676405 13.633519173010594
FC 2025-05-07 00:00:00 89.1 -1.95 5.52 Pagán, Emilio 592696 641941 nan swinging_strike 14 Swinging Strike R L R ATL CIN S nan 0 0 2025 0.24 0.67 0.197498 1.27333 671739 2 9 Bot 4.82683 -129.612 -5.84193 1.62425 27.9346 -23.5849 3.35 1.62 90 2465 6.9 778023 663886 668715 680574 669289 682829 694362 670770 677956 53.55 70 1 Cutter 3 4 3 4 4 3 3 4 Infield shade Strategic 195 0 -0.042 70.4 8.7 0.042 -1 -1 0.099 0.099 34 33 34 34 1 0 1 7 3 2.21 -0.24 0.24 40.9 29.90112031496792 -38.31602940321417 35.061709811554415 36.5660666665002 54.539916242397666

Changing data types

Code
# Subset the data down to a select few to work with. 
# Then convert player_name and description variables to factors. 
df.subset <- df |> 
  select(player_name, 
          launch_speed, 
          launch_angle, 
          launch_speed_angle, 
          description) |>
  mutate(across(c(player_name,description), factor))

# View the data frame in a friendly format for Quarto
knitr::kable(head(df.subset), format="html")
player_name launch_speed launch_angle launch_speed_angle description
Chapman, Matt NA NA NA swinging_strike_blocked
Chapman, Matt NA NA NA ball
Chapman, Matt NA NA NA ball
Capra, Vinny 65.5 -44 2 hit_into_play
Maldonado, Martín 103.7 31 6 hit_into_play
Capra, Vinny NA NA NA ball
Code
# Subset the data down to a select few to work with. 
# Then convert player_name and description variables to categorical variables. 
df_subset = df[['player_name', 'description', 'launch_speed', 'launch_angle', 'launch_speed_angle']].copy()
categorical_cols = ["player_name","description"]
df_subset[categorical_cols]=df_subset[categorical_cols].apply(pd.Categorical)
player_name description launch_speed launch_angle launch_speed_angle
Pagán, Emilio swinging_strike
Pagán, Emilio foul 80.9 62
Pagán, Emilio blocked_ball
Pagán, Emilio swinging_strike
Pagán, Emilio swinging_strike

Exploratory data analysis

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.1


Four primary types of EDA:

  1. Univariate non-graphical: Describe the data and find patterns that exist within a variable.
  2. Univariate graphical: For a single variable, explore the values visaully using graphs like box plots or histograms.
  3. Multivariate nongraphical: Describe the relationships between two or more variables in the data.
  4. Multivariate graphical: Visualize the relationships between two or more variables through graphs like scatter plots or heat maps,

Applying EDA: Univariate

Code
# Use the skimr package to examine the dataset. 
# Produces a high level summary (# of rows/columns, column types)
# For each column type, it produces details about each variable.

library(skimr)
skim(df.subset)
Data summary
Name df.subset
Number of rows 25000
Number of columns 5
Key NULL
_______________________
Column type frequency:
factor 2
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
player_name 0 1 FALSE 401 Alo: 140, Kwa: 139, Sot: 138, Ols: 130
description 0 1 FALSE 15 bal: 8409, fou: 4532, hit: 4457, cal: 4087

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
launch_speed 16607 0.34 83.33 15.32 11.1 73.6 83.2 95.3 118.4 ▁▁▅▇▃
launch_angle 16588 0.34 17.35 32.77 -87.0 -5.0 20.0 42.0 89.0 ▁▃▇▇▃
launch_speed_angle 20553 0.18 3.23 1.30 1.0 2.0 3.0 4.0 6.0 ▇▆▆▂▂
Code
# Use the pyskim package to examine the dataset. 
# Produces a high level summary (# of rows/columns, column types)
# For each column type, it produces details about each variable.

from pyskim import skim
skim(df_subset)
── Data Summary ────────────────────────────────────────────────────────────────────────────────────
type                 value
-----------------  -------
Number of rows       27706
Number of columns        5
──────────────────────────────────────────────────
Column type frequency:
            Count
--------  -------
Int64           2
category        1
category        1
Float64         1

── Variable type: number ───────────────────────────────────────────────────────────────────────────
    name                  na_count    mean    sd     p0    p25    p50    p75    p100  hist
--  ------------------  ----------  ------  ----  -----  -----  -----  -----  ------  ----------
 0  launch_speed             18407   83.3   15.3   11.1   73.5   83.2   95.3     118  ▁▁▁▁▂▆▇▆▆▁
 1  launch_angle             18388   17.5   32.6  -87     -4     20     42        89  ▁▁▂▄▄▇▇▆▄▁
 2  launch_speed_angle       22751    3.23   1.3    1      2      3      4         6  ▁▁▇▁▆▁▆▁▂▂

── Variable type: category ─────────────────────────────────────────────────────────────────────────
    name           na_count    n_unique  top_counts
--  -----------  ----------  ----------  ------------------------------------------------------
 0  player_name           0         424  Allen, Logan: 195, Patrick, Chad: 194, Fried, Max: 192
 1  description           0          15  ball: 9327, foul: 4993, hit_into_play: 4966

Applying EDA: Multivariate

Code
library(ggplot2)
library(ggExtra)
library(ggthemes)

# Create a scatter plot with the launch speed and launch angle.
g <- ggplot(df.subset |>
              filter(description=="hit_into_play"), 
            aes(x=launch_speed, 
                y=launch_angle)) + 
  geom_point() +
  theme(legend.position="none") +
  labs(x="Launch Speed",
       y="Launch Angle") +
  scale_x_continuous(labels=label_number(scale_cut=cut_short_scale())) +
  scale_y_continuous(labels=label_number(scale_cut=cut_short_scale())) +
  theme_hc()

ggMarginal(g, type="histogram")

Code
import seaborn as sns
import matplotlib.pyplot as plt

# Create a scatter plot with the launch speed and launch angle.
g = sns.jointplot(
  data=df_subset,
  x='launch_speed',
  y='launch_angle',
  kind='scatter')
g.set_axis_labels("Launch Speed", "Launch Angle")

plt.show()

Working with missing data

Questions to ask when working with missing data

  • Does “missing” mean something different from “0”?
    • If you have data on the amount of candy sold per day, does a missing value mean no candy was sold? or the amount of candy sold is unknown?
  • Is “missing” captured in another way?
    • Sometimes negative values or “99” can imply a value is missing.
  • Was there a change in how data was being captured?
    • For long standing data capture initiatives (e.g., surveys), the data collection methods can change without notice to the analysts.
      • Was the way data was being capture changed?
      • Did the range of values change?
      • Do the values represent something different?
  • Does it make sense to replace missing values?
    • If a variable is mostly missing, replacing it with any method could lead false conclusions.

Imputing missing data

There are two general approaches:

  • Overly simple approach: replace missing values with mean, median, or mode
  • Sophisticated approach: replacing missing values by analyzing the full dataset and building a model per variable with missing data

Multivariate Imputation by Chained Equations (MICE)

Code
library(mice)

# Filter down to balls put in play that should have values.
df.to.impute <- df.subset |> filter(description=="hit_into_play")

# Impute missing data using predictive mean matching
imputed <- df.to.impute |>
    mice(m=1, maxit=10, seed=42, method="pmm", print=FALSE)

# Show the complete dataset including imputed values
complete(imputed) |> 
  head() |> 
  knitr::kable(format="html")
player_name launch_speed launch_angle launch_speed_angle description
Capra, Vinny 65.5 -44 2 hit_into_play
Maldonado, Martín 103.7 31 6 hit_into_play
Turner, Justin 98.6 33 5 hit_into_play
McNeil, Jeff 101.2 12 4 hit_into_play
Durbin, Caleb 92.2 -6 2 hit_into_play
Cameron, Daz 94.3 37 3 hit_into_play
Code
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Filter down to balls put in play that should have values.
df_train = df_subset[df_subset['description'] == 'hit_into_play'].copy()

numeric_cols = ['launch_speed', 'launch_angle']
df_numeric = df_train[numeric_cols]

# Impute missing data using predictive mean matching
imputer = IterativeImputer(random_state=42, max_iter=10)
imputed_data = imputer.fit_transform(df_numeric)

df_imputed = pd.DataFrame(imputed_data, columns=numeric_cols, index=df_train.index)

df_final = pd.concat([df_train[['player_name', 'description']], df_imputed], axis=1)

print(df_final.head(20).to_markdown(index=False))
player_name description launch_speed launch_angle
Pagán, Emilio hit_into_play 108.2 6
Pagán, Emilio hit_into_play 85.8 39
De Los Santos, Enyel hit_into_play 102.2 15
Santillan, Tony hit_into_play 84.5 48
Santillan, Tony hit_into_play 84.2 45
Lee, Dylan hit_into_play 64.2 -33
Lee, Dylan hit_into_play 93.1 56
Mey, Luis hit_into_play 106.2 -18
Mey, Luis hit_into_play 72.8 -23
Bummer, Aaron hit_into_play 95.1 28
Bummer, Aaron hit_into_play 100.7 -19
Rogers, Taylor hit_into_play 84.6 -8
Rogers, Taylor hit_into_play 95.9 26
Rogers, Taylor hit_into_play 109.5 -4
Barlow, Scott hit_into_play 85.9 13
Barlow, Scott hit_into_play 87.8 48
Bummer, Aaron hit_into_play 91.8 3
Holmes, Grant hit_into_play 103.1 14
Holmes, Grant hit_into_play 101.2 4
Suter, Brent hit_into_play 62.3 -42

Cautionary tales in working with data

Data drift

Changes in the data can happen over time, resulting in “data drift” that can impact model performance or other decisions that can be overlooked if only near term changes are considered.

Four different types of data drift described visually.

Cognitive biases

Cognitive biases are systematic patterns of deviation from norm and/or rationality in judgment. They are often studied in psychology, sociology and behavioral economics.1

Some biases to be aware of:

  • Survivorship bias: Analyzing just the data that is available without analyzing the larger situation.
  • False causality: Seeing correlation between two variables does not imply one causes the other to occur.2
  • Availability bias: Drawing conclusions on limited data.
  • Confirmation bias: Manipulating data to confirm your own hypothesis.

Simpson’s paradox

Simpson’s paradox occurs when groups of data show one particular trend, but this trend is reversed when the groups are combined together. Understanding and identifying this paradox is important for correctly interpreting data.1


A baseball player can have higher batting average than another on each of two years, but lower than the other when the two are combined. In one case, David Justice had a higher batting average than Derek Jeter in 1995 and 1996, but across the two years, Jeter’s average was higher.2

Picture explaining Simpson's paradox.

Goodhart’s law

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.1


or a better way of putting it is:


When a measure becomes a target, it ceases to be a good measure.2

Sketchplanations cartoon explaining Goodhart's Law

Sharing your data with others

Reproducibility

Reproducibility is the ability of independent investigators to draw the same conclusions from an experiment by following the documentation shared by the original investigators.1

Easy tips for enabling others to reproduce your work:

  • Use a static random seed
    • in R: set.seed(42)
    • in Python: random.seed(42)
  • Document your environment
    • in R: library(renv)
    • in Python: pip freeze > requirements.txt
  • Use version control (e.g., Github, Bitbucket)
  • Use notebooks

2

Using pins

The pins package publishes data, models, and other R objects, making it easy to share them across projects and with your colleagues. You can pin objects to a variety of pin boards, including:

  • folders (to share on a networked drive or with services like DropBox)
  • Posit Connect
  • Amazon S3
  • Google Cloud Storage
  • Azure storage
  • Microsoft 365 (OneDrive and SharePoint).

Pins can be automatically versioned, making it straightforward to track changes, re-run analyses on historical data, and undo mistakes.1 Pins is available in R and Python.

How might AI start to impact working with messy data?

Shifting from manual scripting to “ambient” analysis

Historically, Exploratory Data Analysis (EDA) required a context switch: writing code to see what the data looked like. Tools like Databot represents a shift toward Ambient EDA, where the data’s profile is always visible alongside the code.

Key Concepts for the Future of Data Quality

  • Zero-Latency Profiling: Instead of “Ask and Wait,” Databot provides “Always-On” insights. The moment a dataframe is created or modified, quality metrics (distributions, missing values, and types) are updated instantly.
  • The “Sidecar” UI Pattern: The future of data science IDEs is moving toward a hybrid approach: using the console for transformation and a dedicated, persistent UI for observation.
  • Integrated Data Health Monitoring: Future tools will likely treat data quality as a “linter” for datasets. Just as code editors highlight syntax errors, Databot highlights “data errors” as you work.
  • Language-Agnostic Exploration: Databot works across Python and R, pointing to a future where EDA tools are tied to the environment rather than specific language libraries, ensuring a consistent quality-check workflow regardless of the stack.

Discussion / Contact Info