Outlier Analysis

Rob Wiederstein

Overview

Illustration

Princess Fiona: “What kind of knight are you?”
Shrek: “One of a kind.”

Based Upon

Article[1]

Roadmap

  • Basics
  • Distributions
  • Models (KNN)
  • Part B Claims Data
  • Scagnostics
  • Interactive Display

Also Known As

“outliers, novelty, faults, deviants, discordant observations, extreme values/cases, change points, rare events, intrusions, misuses, exceptions, aberrations, surprises, peculiarities, odd values and contaminants”[1]

Definitions

  • Kurtosis – is a measure of the tailedness of a distribution. Tailedness is how often outliers occur.

  • Outlier – “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.”[2]

  • Skewness is a measure of the asymetry of the probability distribution of a real-valued random variable about its mean.

  • Standardize scale all of the values in the dataset such that the mean value is 0 and the standard deviation is 1.

Symbols

Symbol Short Meaning
\(\mu\) “mew” mean
\(\sigma\) “sigma” std. dev.

Basics

Outliers Classified

\(c_1\) and \(c_2\) are clusters; \(x_1\) and \(x_2\) are global anomalies; \(x_3\) is a local anomaly; and \(c_3\) is potentially ambiguous. [3]

Continuum of Outlierness

Univariate Outliers

The detection of outliers in the observed distribution of a single variable spans the entire history of outlier detection. It spans this history not only because it is the simplest formulation of the problem, but also because it is deceptively simple.[4]

\[ \{1, 2, 3, 4, 50, 97, 98, 99\} \]

Distance from the Center Rule

“The word outlier implies lying at an extreme end of a set of ordered values – far away from the center of those values. The modern history of outlier detection emerged with methods that depend on a measure of centrality and a distance from that measure of centrality.” [4]

\[ \{1, 47, 47, 49, 51, 52, 55, 100\} \]

Common Outlier Definitions

  • 1.5 x the inter quartile range - Tukey

  • 3.0 x the standard deviation

  • Percentile?

Four Methods to Identify Outliers

  1. Extreme Value Analysis
  2. Probabilistic and Statistic Models
  3. Linear Models
  4. Proximity-Based Models
    • Cluster
    • Density
    • Distance <==(We are here!)

Tools Density Plot

Tools Histogram (Binning)

Tools Boxplots

Distributions

Normal

vars n mean sd median min max skew kurtosis se
y1 1 100 0.11 0.90 0.11 -2.21 2.40 -0.07 -0.05 0.09
y2 2 100 -0.08 1.92 -0.35 -3.83 4.62 0.44 -0.31 0.19
y3 3 100 0.09 3.10 0.00 -8.67 7.95 -0.24 0.26 0.31

Zipf

vars n mean sd median min max skew kurtosis se
y1 1 100 139.14 198.89 30.5 1 951 1.83 3.07 19.89
y2 2 100 237.34 270.50 123.0 1 976 1.23 0.40 27.05
y3 3 100 274.25 287.68 154.0 3 965 1.08 -0.11 28.77

Log

vars n mean sd median min max skew kurtosis se
y1 1 100 1.65 1.72 1.12 0.11 11.04 2.86 10.49 0.17
y2 2 100 6.25 16.53 0.70 0.02 101.08 3.89 15.74 1.65
y3 3 100 60.71 333.89 1.00 0.00 2828.50 7.10 51.51 33.39

Z-value Test

\[ Z_1 = \frac{| X_1 - \mu |} \sigma \]

where \(X_1\) = observation, \(\mu\) = mean, and \(\sigma\) = standard deviation

# in R
df$z <- (df$points-mean(df$points))/sd(df$points)

Normalized/Standardized

vars n mean sd median min max skew kurtosis se
uniform 1 100 25.97 13.94 25.94 0.59 48.78 -0.03 -1.17 1.39
transformed 2 100 0.00 1.00 0.00 -1.82 1.64 -0.03 -1.17 0.10

Boxplots

Ziph Box Plots

Normal Box Plot

Models

KNN

Best Unsupervised Learning Method

“By using a diverse collection of datasets, several evaluation measures, and a broad range of parameter settings, we argue here that it is typically pointless and unjustified to state the superior behavior of any method for the general case.”[6]

Evaluation of KNN

(a) is unnormalized data, aggregated over all datasets.
(b) is normalized data, aggregated over all datasets.[6]

What Works

The gist of our findings is that, when considering the totality of results . . . the seminal methods kNN, kNNW, and LOF still remain the state of the art —- none of the more recent methods tested offer any comprehensive improvement over those classics, while two methods in particular (LDF and KDEOS) have been found to be noticeably less robust to parameter choices.[6]

How many neighbors?

Proof

“A key question arises as to how the effectiveness of an outlier detection algorithm should be evaluated. Unfortunately, this is often a difficult task, because . . . ground-truth labeling of data points as outliers or non-outliers is often not available. Therefore, much of the research literature uses case studies to provide an intuitive and qualitative evaluation of the underlying outliers in unsupervised scenarios.”[5]

Data

Medicare Claims

Structure

example
year 2021
rndrng_prvdr_ruca 1
rndrng_npi 1003000423
rndrng_prvdr_last_org_name Velotta
rndrng_prvdr_type Obstetrics & Gynecology
rndrng_prvdr_state_abrvtn OH
rndrng_prvdr_city Cleveland
tot_benes_mean 28.33333
tot_srvcs_mean 29
avg_mdcr_stdzd_amt_mean 28.49333

RUCA

USDA Economic Research Service.[7]

Medicare Top 25 Metros & Specialties - Outliers

vars n mean sd median min max range skew kurtosis se
tot_benes_mean 1 132369 53.4 313.8 36.5 11 86733.0 86722.0 214.8 53786.7 0.9
tot_srvcs_mean 2 132369 110.3 422.5 56.0 11 86843.5 86832.5 120.8 22744.6 1.2

Medicare Top 25 Metros & Specialties - No Outliers

vars n mean sd median min max range skew kurtosis se
tot_benes_mean 1 131404 49.3 46.5 36.3 11 1831.0 1820.0 6.1 97.6 0.1
tot_srvcs_mean 2 131404 101.2 194.4 55.7 11 12256.1 12245.1 14.1 414.8 0.5

Medicare Orthopedic Claims

vars n mean sd median min max range skew kurtosis se
tot_benes_mean 1 430 47.0 27.3 41.8 11 205.9 194.9 1.6 4.3 1.3
tot_srvcs_mean 2 430 99.7 202.4 58.6 11 2185.3 2174.3 7.2 59.4 9.8

Scagnostics

Scag

Nostics

A statistical description of a bivariate plot using 9 metrics:

  1. outlying
  2. skewed
  3. clumpy
  4. sparse
  5. striated
  1. convex
  2. skinny
  3. stringy
  4. monotony

For more information, see Wilkinson.[8]

Mtcars


scagnostics::scagnostics(mtcars[, c(1:3)])
             mpg * cyl mpg * disp cyl * disp
Outlying  0.0000000000 0.00000000  0.0000000
Skewed    0.7440304272 0.78121701  0.9476808
Clumpy    0.4571866855 0.09834646  0.6989270
Sparse    0.1349678932 0.16478294  0.4984704
Striated  0.5000000000 0.03333333  0.6000000
Convex    0.0003757415 0.20208117  0.0000000
Skinny    0.7556553584 0.67252782  1.0000000
Stringy   0.6141250000 0.49616609  1.0000000
Monotonic 0.4771277625 0.58381611  0.8182267
attr(,"class")
[1] "scagnostics"

Interactive Dashboard

Review

  • Do Medicare claims resemble a logrithmic or normal distribution? Why?

  • In describing the tails of a distribution, would skew or kurtosis be most appropriate?

  • Is the KNN algorithm a “seminal” approach that remains “state of the art” or an “avant-garde” approach?

  • What diagnostics were created to describe the “point cloud” of a bivariate plot?

Thank You!

Bibliography

[1]
P. D. Talagala, R. J. Hyndman, and K. Smith-Miles, “Anomaly Detection in High-Dimensional Data,” Journal of Computational and Graphical Statistics, vol. 30, no. 2, pp. 360–374, Jun. 2021, doi: 10.1080/10618600.2020.1807997.
[2]
D. M. Hawkins, Identification of outliers, vol. 11. Springer, 1980.
[3]
M. Goldstein and S. Uchida, “A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data,” PLOS ONE, vol. 11, no. 4, p. e0152173, Apr. 2016, doi: 10.1371/journal.pone.0152173.
[4]
L. Wilkinson, “Visualizing Big Data Outliers Through Distributed Aggregation,” IEEE Transactions on Visualization and Computer Graphics, vol. 24, no. 1, pp. 256–266, Jan. 2018, doi: 10.1109/TVCG.2017.2744685.
[5]
C. C. Aggarwal, Outlier Analysis. Cham: Springer International Publishing, 2017. doi: 10.1007/978-3-319-47578-3.
[6]
G. O. Campos et al., “On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study,” Data Mining and Knowledge Discovery, vol. 30, no. 4, pp. 891–927, Jul. 2016, doi: 10.1007/s10618-015-0444-8.
[7]
USDA ERS - Rural-Urban Commuting Area Codes.” Accessed: Mar. 21, 2024. [Online]. Available: https://www.ers.usda.gov/data-products/rural-urban-commuting-area-codes/
[8]
L. Wilkinson and G. Wills, “Scagnostics distributions,” Journal of Computational and Graphical Statistics, vol. 17, no. 2, pp. 473–491, 2008.