Outlier Analysis

Rob Wiederstein

Overview

Illustration

Princess Fiona:	“What kind of knight are you?”
Shrek:	“One of a kind.”

Based Upon

Article[1]

Roadmap

Basics
Distributions
Models (KNN)
Part B Claims Data
Scagnostics
Interactive Display

Also Known As

“outliers, novelty, faults, deviants, discordant observations, extreme values/cases, change points, rare events, intrusions, misuses, exceptions, aberrations, surprises, peculiarities, odd values and contaminants”[1]

Definitions

Kurtosis – is a measure of the tailedness of a distribution. Tailedness is how often outliers occur.
Outlier – “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.”[2]
Skewness is a measure of the asymetry of the probability distribution of a real-valued random variable about its mean.
Standardize scale all of the values in the dataset such that the mean value is 0 and the standard deviation is 1.

Symbols

Symbol	Short	Meaning
\(\mu\)	“mew”	mean
\(\sigma\)	“sigma”	std. dev.

Basics

Outliers Classified

\(c_1\) and \(c_2\) are clusters; \(x_1\) and \(x_2\) are global anomalies; \(x_3\) is a local anomaly; and \(c_3\) is potentially ambiguous. [3]

Continuum of Outlierness

Univariate Outliers

The detection of outliers in the observed distribution of a single variable spans the entire history of outlier detection. It spans this history not only because it is the simplest formulation of the problem, but also because it is deceptively simple.[4]

\[ \{1, 2, 3, 4, 50, 97, 98, 99\} \]

Distance from the Center Rule

“The word outlier implies lying at an extreme end of a set of ordered values – far away from the center of those values. The modern history of outlier detection emerged with methods that depend on a measure of centrality and a distance from that measure of centrality.” [4]

\[ \{1, 47, 47, 49, 51, 52, 55, 100\} \]

Common Outlier Definitions

1.5 x the inter quartile range - Tukey
3.0 x the standard deviation
Percentile?

Four Methods to Identify Outliers

Extreme Value Analysis
Probabilistic and Statistic Models
Linear Models
Proximity-Based Models
- Cluster
- Density
- Distance <==(We are here!)

EVA: “The most basic form of outlier detection is extreme-value analysis of 1-dimensional data. These are very speciﬁc types of outliers in which it is assumed that the values that are either too large or too small are outliers.” Singh and Upadhyaya 2012 “The key is to determine the statistical tails of the underlying distribution.” PSA: “In probabilistic and statistical models, the data is modeled in the form of a closed-form probability distribution, and the parameters of this model are learned.” LM: These methods model the data along lower-dimensional subspaces with the use of linear correlations. PB: “Proximity-based methods are among the most popular class of methods used in outlier analysis. Proximity-based methods may be applied in one of three ways, which are clustering methods, density-based methods”

-Proximity based. “Proximity-based techniques deﬁne a data point as an outlier when its locality (or proximity) is sparsely populated.”[5] Cluster, Distance and Density based. “The distance of a data point to its k-nearest neighbor (or other variant) is used in order to deﬁne proximity.”[5]

Tools Density Plot

Tools Histogram (Binning)

Tools Boxplots

Distributions

Normal

	vars	n	mean	sd	median	min	max	skew	kurtosis	se
y1	1	100	0.11	0.90	0.11	-2.21	2.40	-0.07	-0.05	0.09
y2	2	100	-0.08	1.92	-0.35	-3.83	4.62	0.44	-0.31	0.19
y3	3	100	0.09	3.10	0.00	-8.67	7.95	-0.24	0.26	0.31

Zipf

	vars	n	mean	sd	median	min	max	skew	kurtosis	se
y1	1	100	139.14	198.89	30.5	1	951	1.83	3.07	19.89
y2	2	100	237.34	270.50	123.0	1	976	1.23	0.40	27.05
y3	3	100	274.25	287.68	154.0	3	965	1.08	-0.11	28.77

Log

	vars	n	mean	sd	median	min	max	skew	kurtosis	se
y1	1	100	1.65	1.72	1.12	0.11	11.04	2.86	10.49	0.17
y2	2	100	6.25	16.53	0.70	0.02	101.08	3.89	15.74	1.65
y3	3	100	60.71	333.89	1.00	0.00	2828.50	7.10	51.51	33.39

Z-value Test

\[ Z_1 = \frac{| X_1 - \mu |} \sigma \]

where \(X_1\) = observation, \(\mu\) = mean, and \(\sigma\) = standard deviation

# in R
df$z <- (df$points-mean(df$points))/sd(df$points)

Normalized/Standardized

	vars	n	mean	sd	median	min	max	skew	kurtosis	se
uniform	1	100	25.97	13.94	25.94	0.59	48.78	-0.03	-1.17	1.39
transformed	2	100	0.00	1.00	0.00	-1.82	1.64	-0.03	-1.17	0.10

Boxplots

Ziph Box Plots

Normal Box Plot

Models

KNN

Best Unsupervised Learning Method

“By using a diverse collection of datasets, several evaluation measures, and a broad range of parameter settings, we argue here that it is typically pointless and unjustiﬁed to state the superior behavior of any method for the general case.”[6]

Evaluation of KNN

(a) is unnormalized data, aggregated over all datasets.
(b) is normalized data, aggregated over all datasets.[6]

What Works

The gist of our ﬁndings is that, when considering the totality of results . . . the seminal methods kNN, kNNW, and LOF still remain the state of the art —- none of the more recent methods tested offer any comprehensive improvement over those classics, while two methods in particular (LDF and KDEOS) have been found to be noticeably less robust to parameter choices.[6]

How many neighbors?

Proof

“A key question arises as to how the effectiveness of an outlier detection algorithm should be evaluated. Unfortunately, this is often a difficult task, because . . . ground-truth labeling of data points as outliers or non-outliers is often not available. Therefore, much of the research literature uses case studies to provide an intuitive and qualitative evaluation of the underlying outliers in unsupervised scenarios.”[5]

Data

Medicare Claims

Structure

	example
year	2021
rndrng_prvdr_ruca	1
rndrng_npi	1003000423
rndrng_prvdr_last_org_name	Velotta
rndrng_prvdr_type	Obstetrics & Gynecology
rndrng_prvdr_state_abrvtn	OH
rndrng_prvdr_city	Cleveland
tot_benes_mean	28.33333
tot_srvcs_mean	29
avg_mdcr_stdzd_amt_mean	28.49333

RUCA

USDA Economic Research Service.[7]

Medicare Top 25 Metros & Specialties - Outliers

	vars	n	mean	sd	median	min	max	range	skew	kurtosis	se
tot_benes_mean	1	132369	53.4	313.8	36.5	11	86733.0	86722.0	214.8	53786.7	0.9
tot_srvcs_mean	2	132369	110.3	422.5	56.0	11	86843.5	86832.5	120.8	22744.6	1.2

Medicare Top 25 Metros & Specialties - No Outliers

	vars	n	mean	sd	median	min	max	range	skew	kurtosis	se
tot_benes_mean	1	131404	49.3	46.5	36.3	11	1831.0	1820.0	6.1	97.6	0.1
tot_srvcs_mean	2	131404	101.2	194.4	55.7	11	12256.1	12245.1	14.1	414.8	0.5

Medicare Orthopedic Claims

	vars	n	mean	sd	median	min	max	range	skew	kurtosis	se
tot_benes_mean	1	430	47.0	27.3	41.8	11	205.9	194.9	1.6	4.3	1.3
tot_srvcs_mean	2	430	99.7	202.4	58.6	11	2185.3	2174.3	7.2	59.4	9.8

Scagnostics

Scag

Nostics

A statistical description of a bivariate plot using 9 metrics:

outlying
skewed
clumpy
sparse
striated

convex
skinny
stringy
monotony

For more information, see Wilkinson.[8]

Mtcars

scagnostics::scagnostics(mtcars[, c(1:3)])

             mpg * cyl mpg * disp cyl * disp
Outlying  0.0000000000 0.00000000  0.0000000
Skewed    0.7440304272 0.78121701  0.9476808
Clumpy    0.4571866855 0.09834646  0.6989270
Sparse    0.1349678932 0.16478294  0.4984704
Striated  0.5000000000 0.03333333  0.6000000
Convex    0.0003757415 0.20208117  0.0000000
Skinny    0.7556553584 0.67252782  1.0000000
Stringy   0.6141250000 0.49616609  1.0000000
Monotonic 0.4771277625 0.58381611  0.8182267
attr(,"class")
[1] "scagnostics"

Interactive Dashboard

Review

Do Medicare claims resemble a logrithmic or normal distribution? Why?
In describing the tails of a distribution, would skew or kurtosis be most appropriate?
Is the KNN algorithm a “seminal” approach that remains “state of the art” or an “avant-garde” approach?
What diagnostics were created to describe the “point cloud” of a bivariate plot?

Thank You!

Bibliography

[1]

P. D. Talagala, R. J. Hyndman, and K. Smith-Miles, “Anomaly Detection in High-Dimensional Data,” Journal of Computational and Graphical Statistics, vol. 30, no. 2, pp. 360–374, Jun. 2021, doi: 10.1080/10618600.2020.1807997.

[2]

D. M. Hawkins, Identification of outliers, vol. 11. Springer, 1980.

[3]

M. Goldstein and S. Uchida, “A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data,” PLOS ONE, vol. 11, no. 4, p. e0152173, Apr. 2016, doi: 10.1371/journal.pone.0152173.

[4]

L. Wilkinson, “Visualizing Big Data Outliers Through Distributed Aggregation,” IEEE Transactions on Visualization and Computer Graphics, vol. 24, no. 1, pp. 256–266, Jan. 2018, doi: 10.1109/TVCG.2017.2744685.

[5]

C. C. Aggarwal, Outlier Analysis. Cham: Springer International Publishing, 2017. doi: 10.1007/978-3-319-47578-3.

[6]

G. O. Campos et al., “On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study,” Data Mining and Knowledge Discovery, vol. 30, no. 4, pp. 891–927, Jul. 2016, doi: 10.1007/s10618-015-0444-8.

[7]

“USDA ERS - Rural-Urban Commuting Area Codes.” Accessed: Mar. 21, 2024. [Online]. Available: https://www.ers.usda.gov/data-products/rural-urban-commuting-area-codes/

[8]

L. Wilkinson and G. Wills, “Scagnostics distributions,” Journal of Computational and Graphical Statistics, vol. 17, no. 2, pp. 473–491, 2008.