Working with Messy Data

Christopher Teixeira
April 10, 2026

A little about me…

Christopher Teixeira

Data Scientist

Interests

Applied Probability
Data Visualization
Machine Learning
Responsible AI

---
displayMode: ""
---
gantt
    title Professional Timeline
    dateFormat  YYYY-MM
    axisFormat %Y
    todayMarker off
    section Education
    WPI (BS Mathematics):a1, 2002-09, 2006-06
    GMU (MS Operations Research):a2, 2008-08, 2010-12
    section Experience
    SAIC                    :b1, 2006-06, 2010-11
    IBM                     :b2, 2010-11, 2012-08
    Epsilon                 :b3, 2012-08, 2014-07
    MITRE                   :b4, 2014-07, 2026-04
    To be Announced :active, b5, 2026-04, 2028-12

Reading in data

R
Python

Code

# baseballr provides functions for accessing baseball data, including 
#   the statcast_search function which allows you to access Statcast data from MLB.

# More info can be found here: https://baseballr.com/reference/statcast_search/

library(baseballr)

# Download a week of batting data from Statcast.
df <- statcast_search(
  start_date = "2025-05-01", 
  end_date = "2025-05-07", 
  player_type = "batter"
)

# View the data frame in a friendly format for Quarto
knitr::kable(head(df), format="html")

pitch_type	game_date	release_speed	release_pos_x	release_pos_z	player_name	batter	pitcher	events	description	spin_dir	spin_rate_deprecated	break_angle_deprecated	break_length_deprecated	zone	des	game_type	stand	p_throws	home_team	away_team	type	hit_location	bb_type	balls	strikes	game_year	pfx_x	pfx_z	plate_x	plate_z	on_3b	on_2b	on_1b	outs_when_up	inning	inning_topbot	hc_x	hc_y	tfs_deprecated	tfs_zulu_deprecated	umpire	sv_id	vx0	vy0	vz0	ax	ay	az	sz_top	sz_bot	hit_distance_sc	launch_speed	launch_angle	effective_speed	release_spin_rate	release_extension	game_pk	fielder_2	fielder_3	fielder_4	fielder_5	fielder_6	fielder_7	fielder_8	fielder_9	release_pos_y	estimated_ba_using_speedangle	estimated_woba_using_speedangle	woba_value	woba_denom	babip_value	iso_value	launch_speed_angle	at_bat_number	pitch_number	pitch_name	home_score	away_score	bat_score	fld_score	post_away_score	post_home_score	post_bat_score	post_fld_score	if_fielding_alignment	of_fielding_alignment	spin_axis	delta_home_win_exp	delta_run_exp	bat_speed	swing_length	estimated_slg_using_speedangle	delta_pitcher_run_exp	hyper_speed	home_score_diff	bat_score_diff	home_win_exp	bat_win_exp	age_pit_legacy	age_bat_legacy	age_pit	age_bat	n_thruorder_pitcher	n_priorpa_thisgame_player_at_bat	pitcher_days_since_prev_game	batter_days_since_prev_game	pitcher_days_until_next_game	batter_days_until_next_game	api_break_z_with_gravity	api_break_x_arm	api_break_x_batter_in	arm_angle	attack_angle	attack_direction	swing_path_tilt	intercept_ball_minus_batter_pos_x_inches	intercept_ball_minus_batter_pos_y_inches
KC	2025-05-07	85.8	-1.23	6.69	Chapman, Matt	656305	676962	strikeout	swinging_strike_blocked	NA	NA	NA	NA	14	Matt Chapman strikes out swinging.	R	R	R	CHC	SF	S	2		2	2	2025	0.24	-0.10	0.1722480	0.8157616	NA	642715	NA	2	5	Top	NA	NA	NA	NA	NA	NA	2.854902	-124.7786	-7.422442	1.923976	26.60315	-31.88787	3.440000	1.660000	NA	NA	NA	86.2	2227	6.7	778020	608348	457759	663538	542932	621020	664023	691718	673548	53.78	NA	0.000000	0	1	0	0	NA	39	5	Knuckle Curve	1	3	3	1	3	1	3	1	Standard	Standard	70	0.024	-0.220	65.7	8.7	NA	0.220	NA	-2	2	0.265	0.735	25	32	26	32	3	2	5	1	6	2	3.22	-0.24	-0.24	50.2	16.39766	-36.450081	30.99105	40.65807	47.07546
FF	2025-05-07	95.1	-1.44	6.63	Chapman, Matt	656305	676962		ball	NA	NA	NA	NA	11	Ball	R	R	R	CHC	SF	B	NA		1	2	2025	-0.18	1.28	-0.2905371	4.1049142	NA	642715	NA	2	5	Top	NA	NA	NA	NA	NA	NA	3.446445	-138.4992	-3.908060	-3.035323	29.83358	-14.98578	3.212147	1.583246	NA	NA	NA	96.3	2178	6.8	778020	608348	457759	663538	542932	621020	664023	691718	673548	53.67	NA	NA	NA	NA	NA	NA	NA	39	4	4-Seam Fastball	1	3	3	1	3	1	3	1	Standard	Standard	204	-0.004	0.012	NA	NA	NA	-0.012	NA	-2	2	0.269	0.731	25	32	26	32	3	2	5	1	6	2	1.23	0.18	0.18	49.0	NA	NA	NA	NA	NA
FF	2025-05-07	94.5	-1.15	6.56	Chapman, Matt	656305	676962		ball	NA	NA	NA	NA	14	Ball	R	R	R	CHC	SF	B	NA		0	2	2025	-0.39	1.35	0.8352713	2.1587841	NA	642715	NA	2	5	Top	NA	NA	NA	NA	NA	NA	6.063019	-137.3188	-8.826828	-6.217584	30.38614	-13.29819	3.254478	1.625378	NA	NA	NA	95.0	2074	6.7	778020	608348	457759	663538	542932	621020	664023	691718	673548	53.82	NA	NA	NA	NA	NA	NA	NA	39	3	4-Seam Fastball	1	3	3	1	3	1	3	1	Standard	Standard	204	-0.003	0.015	NA	NA	NA	-0.015	NA	-2	2	0.272	0.728	25	32	26	32	3	2	5	1	6	2	1.21	0.39	0.39	48.0	NA	NA	NA	NA	NA
SI	2025-05-07	93.0	0.74	5.67	Capra, Vinny	681962	664285	field_out	hit_into_play	NA	NA	NA	NA	8	Brewers challenged (play at 1st), call on the field was upheld: Vinny Capra grounds out, pitcher Framber Valdez to first baseman Christian Walker.	R	R	L	MIL	HOU	X	1	ground_ball	2	1	2025	1.19	0.26	0.2660518	1.9673470	NA	NA	NA	2	7	Bot	119.89	194.25	NA	NA	NA	NA	-3.792811	-135.3023	-4.220491	15.361850	32.19198	-28.18434	3.180000	1.500000	2	65.5	-44	92.3	2138	6.1	778015	673237	572233	663898	670623	665161	701305	676694	676801	54.44	0.160	0.148000	0	1	0	0	2	57	4	Sinker	1	6	1	6	6	1	1	6	Standard	Standard	134	-0.003	-0.296	67.5	6.9	0.161	0.296	88.0	-5	-5	0.021	0.021	31	28	32	29	3	2	5	2	6	17	2.39	1.19	-1.19	36.1	10.63560	9.183412	32.07321	38.23280	23.51717
FF	2025-05-07	96.3	1.21	6.06	Maldonado, Martín	455117	608331	field_out	hit_into_play	NA	NA	NA	NA	6	Martín Maldonado flies out sharply to center fielder Cody Bellinger.	R	R	L	NYY	SD	X	8	fly_ball	2	2	2025	0.18	1.47	0.6825741	2.4890805	NA	NA	NA	2	7	Top	132.54	42.17	NA	NA	NA	NA	-1.821405	-140.1033	-7.259126	2.824731	34.04066	-11.35005	3.380000	1.570000	387	103.7	31	96.8	2363	6.8	778019	669224	502671	678391	665828	683011	691176	641355	592450	53.74	0.641	1.279542	0	1	0	0	6	48	5	4-Seam Fastball	0	1	1	0	1	0	1	0	Standard	Standard	158	0.011	-0.224	71.7	7.7	2.526	0.224	103.7	-1	1	0.353	0.647	31	38	31	39	3	2	5	2	6	2	1.00	0.18	-0.18	51.3	14.48879	5.455989	33.84591	45.37516	29.01512
CU	2025-05-07	77.0	0.74	5.74	Capra, Vinny	681962	664285		ball	NA	NA	NA	NA	12	Ball	R	R	L	MIL	HOU	B	NA		1	1	2025	-1.04	-1.05	1.2998572	2.7571789	NA	NA	NA	2	7	Bot	NA	NA	NA	NA	NA	NA	3.054727	-112.0218	2.561620	-9.318751	21.90586	-41.66357	3.221396	1.607159	NA	NA	NA	76.1	2727	5.8	778015	673237	572233	663898	670623	665161	701305	676694	676801	54.68	NA	NA	NA	NA	NA	NA	NA	57	3	Curveball	1	6	1	6	6	1	1	6	Standard	Standard	314	0.000	0.053	NA	NA	NA	-0.053	NA	-5	-5	0.021	0.021	31	28	32	29	3	2	5	2	6	17	4.92	-1.04	1.04	38.3	NA	NA	NA	NA	NA

Code

# pybaseball provides functions for accessing baseball data, including 
#   the statcast function which allows you to access Statcast data from MLB.

# More info can be found here: https://baseballr.com/reference/statcast_search/

import pybaseball as pb
import pandas as pd

df = pb.statcast(
    start_dt="2025-05-01", 
    end_dt="2025-05-07",
    verbose=False
) 
# View the data frame in a friendly format for Quarto
print(df.head().to_markdown(index=False))

pitch_type	game_date	release_speed	release_pos_x	release_pos_z	player_name	batter	pitcher	events	description	zone	des	game_type	stand	p_throws	home_team	away_team	type	hit_location	bb_type	balls	strikes	game_year	pfx_x	pfx_z	plate_x	plate_z	on_1b	outs_when_up	inning	inning_topbot	vx0	vy0	vz0	ax	ay	az	sz_top	sz_bot	hit_distance_sc	launch_speed	launch_angle	effective_speed	release_spin_rate	release_extension	game_pk	fielder_2	fielder_3	fielder_4	fielder_5	fielder_6	fielder_7	fielder_8	fielder_9	release_pos_y	estimated_woba_using_speedangle	woba_value	woba_denom	babip_value	iso_value	at_bat_number	pitch_number	pitch_name	home_score	away_score	bat_score	fld_score	post_away_score	post_home_score	post_bat_score	post_fld_score	if_fielding_alignment	of_fielding_alignment	spin_axis	delta_home_win_exp	delta_run_exp	bat_speed	swing_length	delta_pitcher_run_exp	hyper_speed	home_score_diff	bat_score_diff	home_win_exp	bat_win_exp	age_pit_legacy	age_bat_legacy	age_pit	age_bat	n_thruorder_pitcher	pitcher_days_since_prev_game	batter_days_since_prev_game	pitcher_days_until_next_game	api_break_z_with_gravity	api_break_x_arm	api_break_x_batter_in	arm_angle	attack_angle	attack_direction	swing_path_tilt	intercept_ball_minus_batter_pos_x_inches	intercept_ball_minus_batter_pos_y_inches
FC	2025-05-07 00:00:00	86.9	-1.96	5.41	Pagán, Emilio	592696	641941	strikeout	swinging_strike	7	Eddie Rosario strikes out swinging.	R	L	R	ATL	CIN	S	2	nan	1	2	2025	0.33	0.65	-0.619877	2.03626	671739	2	9	Bot	2.57022	-126.55	-3.30744	3.07245	24.4291	-24.6717	3.35	1.62				88	2400	6.8	778023	663886	668715	680574	669289	682829	694362	670770	677956	53.66	0.0	0.0	1	0	0	70	5	Cutter	3	4	3	4	4	3	3	4	Infield shade	Strategic	199	-0.099	-0.183	68.4	7.9	0.183		-1	-1	0.099	0.099	34	33	34	34	1	1	7	3	2.35	-0.33	0.33	38.2	19.318001205408	-12.604511466980634	28.10243518548305	44.832212177848795	35.250630209870266
FF	2025-05-07 00:00:00	95.6	-1.81	5.58	Pagán, Emilio	592696	641941	nan	foul	11	Foul	R	L	R	ATL	CIN	S		nan	1	2	2025	-0.88	1.54	-0.516144	3.70284	671739	2	9	Bot	5.41895	-139.086	-2.81051	-12.6175	31.9237	-11.7958	3.35	1.62	226	80.9	62	96.2	2648	6.7	778023	663886	668715	680574	669289	682829	694362	670770	677956	53.8						70	4	4-Seam Fastball	3	4	3	4	4	3	3	4	Infield shade	Strategic	214	0	0	65.2	7.1	0	88.0	-1	-1	0.099	0.099	34	33	34	34	1	1	7	3	0.96	0.88	-0.88	40	11.65636616091739	-13.12215688959648	22.8287189010902	42.578184020970454	30.960237026850052
FS	2025-05-07 00:00:00	83.2	-2.01	5.39	Pagán, Emilio	592696	641941	nan	blocked_ball	13	Ball In Dirt	R	L	R	ATL	CIN	B		nan	0	2	2025	-0.95	0.03	-0.598467	-0.570105	671739	2	9	Bot	5.1799	-121.001	-7.38721	-10.3128	23.0312	-30.6815	3.36039	1.44416				84.7	1028	7.2	778023	663886	668715	680574	669289	682829	694362	670770	677956	53.28						70	3	Split-Finger	3	4	3	4	4	3	3	4	Infield shade	Strategic	241	0	0.009			-0.009		-1	-1	0.099	0.099	34	33	34	34	1	1	7	3	3.26	0.95	-0.95	37.9
FC	2025-05-07 00:00:00	87.3	-1.92	5.49	Pagán, Emilio	592696	641941	nan	swinging_strike	13	Swinging Strike	R	L	R	ATL	CIN	S		nan	0	1	2025	0.16	0.58	-1.08005	2.42234	671739	2	9	Bot	1.70213	-127.283	-2.51703	1.45595	23.9962	-25.5608	3.35	1.62				88.6	2455	6.8	778023	663886	668715	680574	669289	682829	694362	670770	677956	53.7						70	2	Cutter	3	4	3	4	4	3	3	4	Infield shade	Strategic	178	0	-0.055	66.3	7.0	0.055		-1	-1	0.099	0.099	34	33	34	34	1	1	7	3	2.38	-0.16	0.16	41.6	5.126280742107504	23.36120966854403	27.126194862765757	51.43128585676405	13.633519173010594
FC	2025-05-07 00:00:00	89.1	-1.95	5.52	Pagán, Emilio	592696	641941	nan	swinging_strike	14	Swinging Strike	R	L	R	ATL	CIN	S		nan	0	0	2025	0.24	0.67	0.197498	1.27333	671739	2	9	Bot	4.82683	-129.612	-5.84193	1.62425	27.9346	-23.5849	3.35	1.62				90	2465	6.9	778023	663886	668715	680574	669289	682829	694362	670770	677956	53.55						70	1	Cutter	3	4	3	4	4	3	3	4	Infield shade	Strategic	195	0	-0.042	70.4	8.7	0.042		-1	-1	0.099	0.099	34	33	34	34	1	1	7	3	2.21	-0.24	0.24	40.9	29.90112031496792	-38.31602940321417	35.061709811554415	36.5660666665002	54.539916242397666

Code

# Subset the data down to a select few to work with. 
# Then convert player_name and description variables to factors. 
df.subset <- df |> 
  select(player_name, 
          launch_speed, 
          launch_angle, 
          launch_speed_angle, 
          description) |>
  mutate(across(c(player_name,description), factor))

# View the data frame in a friendly format for Quarto
knitr::kable(head(df.subset), format="html")

player_name	launch_speed	launch_angle	launch_speed_angle	description
Chapman, Matt	NA	NA	NA	swinging_strike_blocked
Chapman, Matt	NA	NA	NA	ball
Chapman, Matt	NA	NA	NA	ball
Capra, Vinny	65.5	-44	2	hit_into_play
Maldonado, Martín	103.7	31	6	hit_into_play
Capra, Vinny	NA	NA	NA	ball

Code

# Subset the data down to a select few to work with. 
# Then convert player_name and description variables to categorical variables. 
df_subset = df[['player_name', 'description', 'launch_speed', 'launch_angle', 'launch_speed_angle']].copy()
categorical_cols = ["player_name","description"]
df_subset[categorical_cols]=df_subset[categorical_cols].apply(pd.Categorical)

player_name	description	launch_speed	launch_angle
Pagán, Emilio	swinging_strike
Pagán, Emilio	foul	80.9	62
Pagán, Emilio	blocked_ball
Pagán, Emilio	swinging_strike
Pagán, Emilio	swinging_strike

Exploratory data analysis

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.¹

Four primary types of EDA:

Univariate non-graphical: Describe the data and find patterns that exist within a variable.
Univariate graphical: For a single variable, explore the values visaully using graphs like box plots or histograms.
Multivariate nongraphical: Describe the relationships between two or more variables in the data.
Multivariate graphical: Visualize the relationships between two or more variables through graphs like scatter plots or heat maps,

Applying EDA: Univariate

R
Python

Code

# Use the skimr package to examine the dataset. 
# Produces a high level summary (# of rows/columns, column types)
# For each column type, it produces details about each variable.

library(skimr)
skim(df.subset)

Data summary
Name	df.subset
Number of rows	25000
Number of columns	5
Key	NULL
_______________________
Column type frequency:
factor	2
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
player_name	0	1	FALSE	401	Alo: 140, Kwa: 139, Sot: 138, Ols: 130
description	0	1	FALSE	15	bal: 8409, fou: 4532, hit: 4457, cal: 4087

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
launch_speed	16607	0.34	83.33	15.32	11.1	73.6	83.2	95.3	118.4	▁▁▅▇▃
launch_angle	16588	0.34	17.35	32.77	-87.0	-5.0	20.0	42.0	89.0	▁▃▇▇▃
launch_speed_angle	20553	0.18	3.23	1.30	1.0	2.0	3.0	4.0	6.0	▇▆▆▂▂

Code

# Use the pyskim package to examine the dataset. 
# Produces a high level summary (# of rows/columns, column types)
# For each column type, it produces details about each variable.

from pyskim import skim
skim(df_subset)

── Data Summary ────────────────────────────────────────────────────────────────────────────────────
type                 value
-----------------  -------
Number of rows       27706
Number of columns        5
──────────────────────────────────────────────────
Column type frequency:
            Count
--------  -------
Int64           2
category        1
category        1
Float64         1

── Variable type: number ───────────────────────────────────────────────────────────────────────────
    name                  na_count    mean    sd     p0    p25    p50    p75    p100  hist
--  ------------------  ----------  ------  ----  -----  -----  -----  -----  ------  ----------
 0  launch_speed             18407   83.3   15.3   11.1   73.5   83.2   95.3     118  ▁▁▁▁▂▆▇▆▆▁
 1  launch_angle             18388   17.5   32.6  -87     -4     20     42        89  ▁▁▂▄▄▇▇▆▄▁
 2  launch_speed_angle       22751    3.23   1.3    1      2      3      4         6  ▁▁▇▁▆▁▆▁▂▂

── Variable type: category ─────────────────────────────────────────────────────────────────────────
    name           na_count    n_unique  top_counts
--  -----------  ----------  ----------  ------------------------------------------------------
 0  player_name           0         424  Allen, Logan: 195, Patrick, Chad: 194, Fried, Max: 192
 1  description           0          15  ball: 9327, foul: 4993, hit_into_play: 4966

Applying EDA: Multivariate

R
Python

Code

library(ggplot2)
library(ggExtra)
library(ggthemes)

# Create a scatter plot with the launch speed and launch angle.
g <- ggplot(df.subset |>
              filter(description=="hit_into_play"), 
            aes(x=launch_speed, 
                y=launch_angle)) + 
  geom_point() +
  theme(legend.position="none") +
  labs(x="Launch Speed",
       y="Launch Angle") +
  scale_x_continuous(labels=label_number(scale_cut=cut_short_scale())) +
  scale_y_continuous(labels=label_number(scale_cut=cut_short_scale())) +
  theme_hc()

ggMarginal(g, type="histogram")

Code

import seaborn as sns
import matplotlib.pyplot as plt

# Create a scatter plot with the launch speed and launch angle.
g = sns.jointplot(
  data=df_subset,
  x='launch_speed',
  y='launch_angle',
  kind='scatter')
g.set_axis_labels("Launch Speed", "Launch Angle")

plt.show()

Working with missing data

Questions to ask when working with missing data

Does “missing” mean something different from “0”?
- If you have data on the amount of candy sold per day, does a missing value mean no candy was sold? or the amount of candy sold is unknown?
Is “missing” captured in another way?
- Sometimes negative values or “99” can imply a value is missing.
Was there a change in how data was being captured?
- For long standing data capture initiatives (e.g., surveys), the data collection methods can change without notice to the analysts.
  - Was the way data was being capture changed?
  - Did the range of values change?
  - Do the values represent something different?
Does it make sense to replace missing values?
- If a variable is mostly missing, replacing it with any method could lead false conclusions.

Imputing missing data

There are two general approaches:

Overly simple approach: replace missing values with mean, median, or mode
Sophisticated approach: replacing missing values by analyzing the full dataset and building a model per variable with missing data

Multivariate Imputation by Chained Equations (MICE)

R
Python

Code

library(mice)

# Filter down to balls put in play that should have values.
df.to.impute <- df.subset |> filter(description=="hit_into_play")

# Impute missing data using predictive mean matching
imputed <- df.to.impute |>
    mice(m=1, maxit=10, seed=42, method="pmm", print=FALSE)

# Show the complete dataset including imputed values
complete(imputed) |> 
  head() |> 
  knitr::kable(format="html")

player_name	launch_speed	launch_angle	launch_speed_angle	description
Capra, Vinny	65.5	-44	2	hit_into_play
Maldonado, Martín	103.7	31	6	hit_into_play
Turner, Justin	98.6	33	5	hit_into_play
McNeil, Jeff	101.2	12	4	hit_into_play
Durbin, Caleb	92.2	-6	2	hit_into_play
Cameron, Daz	94.3	37	3	hit_into_play

Code

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Filter down to balls put in play that should have values.
df_train = df_subset[df_subset['description'] == 'hit_into_play'].copy()

numeric_cols = ['launch_speed', 'launch_angle']
df_numeric = df_train[numeric_cols]

# Impute missing data using predictive mean matching
imputer = IterativeImputer(random_state=42, max_iter=10)
imputed_data = imputer.fit_transform(df_numeric)

df_imputed = pd.DataFrame(imputed_data, columns=numeric_cols, index=df_train.index)

df_final = pd.concat([df_train[['player_name', 'description']], df_imputed], axis=1)

print(df_final.head(20).to_markdown(index=False))

player_name	description	launch_speed	launch_angle
Pagán, Emilio	hit_into_play	108.2	6
Pagán, Emilio	hit_into_play	85.8	39
De Los Santos, Enyel	hit_into_play	102.2	15
Santillan, Tony	hit_into_play	84.5	48
Santillan, Tony	hit_into_play	84.2	45
Lee, Dylan	hit_into_play	64.2	-33
Lee, Dylan	hit_into_play	93.1	56
Mey, Luis	hit_into_play	106.2	-18
Mey, Luis	hit_into_play	72.8	-23
Bummer, Aaron	hit_into_play	95.1	28
Bummer, Aaron	hit_into_play	100.7	-19
Rogers, Taylor	hit_into_play	84.6	-8
Rogers, Taylor	hit_into_play	95.9	26
Rogers, Taylor	hit_into_play	109.5	-4
Barlow, Scott	hit_into_play	85.9	13
Barlow, Scott	hit_into_play	87.8	48
Bummer, Aaron	hit_into_play	91.8	3
Holmes, Grant	hit_into_play	103.1	14
Holmes, Grant	hit_into_play	101.2	4
Suter, Brent	hit_into_play	62.3	-42

Cautionary tales in working with data

Data drift

Changes in the data can happen over time, resulting in “data drift” that can impact model performance or other decisions that can be overlooked if only near term changes are considered.

Cognitive biases

Cognitive biases are systematic patterns of deviation from norm and/or rationality in judgment. They are often studied in psychology, sociology and behavioral economics.¹

Some biases to be aware of:

Survivorship bias: Analyzing just the data that is available without analyzing the larger situation.
False causality: Seeing correlation between two variables does not imply one causes the other to occur.²
Availability bias: Drawing conclusions on limited data.
Confirmation bias: Manipulating data to confirm your own hypothesis.

Simpson’s paradox

Simpson’s paradox occurs when groups of data show one particular trend, but this trend is reversed when the groups are combined together. Understanding and identifying this paradox is important for correctly interpreting data.¹

A baseball player can have higher batting average than another on each of two years, but lower than the other when the two are combined. In one case, David Justice had a higher batting average than Derek Jeter in 1995 and 1996, but across the two years, Jeter’s average was higher.²

Goodhart’s law

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.¹

or a better way of putting it is:

When a measure becomes a target, it ceases to be a good measure.²

Sketchplanations cartoon explaining Goodhart's Law

Reproducibility

Reproducibility is the ability of independent investigators to draw the same conclusions from an experiment by following the documentation shared by the original investigators.¹

Easy tips for enabling others to reproduce your work:

Use a static random seed
- in R: set.seed(42)
- in Python: random.seed(42)
Document your environment
- in R: library(renv)
- in Python: pip freeze > requirements.txt

Use version control (e.g., Github, Bitbucket)
Use notebooks
- in R: R Markdown
- in Python: Jupyter
- in either: Quarto

Using pins

The pins package publishes data, models, and other R objects, making it easy to share them across projects and with your colleagues. You can pin objects to a variety of pin boards, including:

folders (to share on a networked drive or with services like DropBox)
Posit Connect
Amazon S3
Google Cloud Storage
Azure storage
Microsoft 365 (OneDrive and SharePoint).

Pins can be automatically versioned, making it straightforward to track changes, re-run analyses on historical data, and undo mistakes.¹ Pins is available in R and Python.

How might AI start to impact working with messy data?

Shifting from manual scripting to “ambient” analysis

Historically, Exploratory Data Analysis (EDA) required a context switch: writing code to see what the data looked like. Tools like Databot represents a shift toward Ambient EDA, where the data’s profile is always visible alongside the code.

Key Concepts for the Future of Data Quality

Zero-Latency Profiling: Instead of “Ask and Wait,” Databot provides “Always-On” insights. The moment a dataframe is created or modified, quality metrics (distributions, missing values, and types) are updated instantly.
The “Sidecar” UI Pattern: The future of data science IDEs is moving toward a hybrid approach: using the console for transformation and a dedicated, persistent UI for observation.
Integrated Data Health Monitoring: Future tools will likely treat data quality as a “linter” for datasets. Just as code editors highlight syntax errors, Databot highlights “data errors” as you work.
Language-Agnostic Exploration: Databot works across Python and R, pointing to a future where EDA tools are tied to the environment rather than specific language libraries, ensuring a consistent quality-check workflow regardless of the stack.

Discussion / Contact Info

Christopher Teixeira

christopherteixeira.com

chris@christopherteixeira.com

in/christopherteixeira

christopher-teixeira

Working with Messy Data

A little about me…

Christopher Teixeira

Data Scientist

Interests

Reading in data

Changing data types

Exploratory data analysis

Applying EDA: Univariate

Applying EDA: Multivariate

Working with missing data

Questions to ask when working with missing data

Imputing missing data

Multivariate Imputation by Chained Equations (MICE)

Cautionary tales in working with data

Data drift

Cognitive biases

Simpson’s paradox

Goodhart’s law

Sharing your data with others

Reproducibility

Using pins

How might AI start to impact working with messy data?

Shifting from manual scripting to “ambient” analysis

Discussion / Contact Info

Christopher Teixeira