vars | n | mean | sd | median | min | max | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|
y1 | 1 | 100 | 0.11 | 0.90 | 0.11 | -2.21 | 2.40 | -0.07 | -0.05 | 0.09 |
y2 | 2 | 100 | -0.08 | 1.92 | -0.35 | -3.83 | 4.62 | 0.44 | -0.31 | 0.19 |
y3 | 3 | 100 | 0.09 | 3.10 | 0.00 | -8.67 | 7.95 | -0.24 | 0.26 | 0.31 |
Princess Fiona: | “What kind of knight are you?” |
Shrek: | “One of a kind.” |
“outliers, novelty, faults, deviants, discordant observations, extreme values/cases, change points, rare events, intrusions, misuses, exceptions, aberrations, surprises, peculiarities, odd values and contaminants”[1]
Kurtosis – is a measure of the tailedness of a distribution. Tailedness is how often outliers occur.
Outlier – “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.”[2]
Skewness is a measure of the asymetry of the probability distribution of a real-valued random variable about its mean.
Standardize scale all of the values in the dataset such that the mean value is 0 and the standard deviation is 1.
Symbol | Short | Meaning |
---|---|---|
\(\mu\) | “mew” | mean |
\(\sigma\) | “sigma” | std. dev. |
The detection of outliers in the observed distribution of a single variable spans the entire history of outlier detection. It spans this history not only because it is the simplest formulation of the problem, but also because it is deceptively simple.[4]
\[ \{1, 2, 3, 4, 50, 97, 98, 99\} \]
“The word outlier implies lying at an extreme end of a set of ordered values – far away from the center of those values. The modern history of outlier detection emerged with methods that depend on a measure of centrality and a distance from that measure of centrality.” [4]
\[ \{1, 47, 47, 49, 51, 52, 55, 100\} \]
1.5 x the inter quartile range - Tukey
3.0 x the standard deviation
Percentile?
vars | n | mean | sd | median | min | max | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|
y1 | 1 | 100 | 0.11 | 0.90 | 0.11 | -2.21 | 2.40 | -0.07 | -0.05 | 0.09 |
y2 | 2 | 100 | -0.08 | 1.92 | -0.35 | -3.83 | 4.62 | 0.44 | -0.31 | 0.19 |
y3 | 3 | 100 | 0.09 | 3.10 | 0.00 | -8.67 | 7.95 | -0.24 | 0.26 | 0.31 |
vars | n | mean | sd | median | min | max | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|
y1 | 1 | 100 | 139.14 | 198.89 | 30.5 | 1 | 951 | 1.83 | 3.07 | 19.89 |
y2 | 2 | 100 | 237.34 | 270.50 | 123.0 | 1 | 976 | 1.23 | 0.40 | 27.05 |
y3 | 3 | 100 | 274.25 | 287.68 | 154.0 | 3 | 965 | 1.08 | -0.11 | 28.77 |
vars | n | mean | sd | median | min | max | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|
y1 | 1 | 100 | 1.65 | 1.72 | 1.12 | 0.11 | 11.04 | 2.86 | 10.49 | 0.17 |
y2 | 2 | 100 | 6.25 | 16.53 | 0.70 | 0.02 | 101.08 | 3.89 | 15.74 | 1.65 |
y3 | 3 | 100 | 60.71 | 333.89 | 1.00 | 0.00 | 2828.50 | 7.10 | 51.51 | 33.39 |
\[ Z_1 = \frac{| X_1 - \mu |} \sigma \]
where \(X_1\) = observation, \(\mu\) = mean, and \(\sigma\) = standard deviation
vars | n | mean | sd | median | min | max | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|
uniform | 1 | 100 | 25.97 | 13.94 | 25.94 | 0.59 | 48.78 | -0.03 | -1.17 | 1.39 |
transformed | 2 | 100 | 0.00 | 1.00 | 0.00 | -1.82 | 1.64 | -0.03 | -1.17 | 0.10 |
“By using a diverse collection of datasets, several evaluation measures, and a broad range of parameter settings, we argue here that it is typically pointless and unjustified to state the superior behavior of any method for the general case.”[6]
The gist of our findings is that, when considering the totality of results . . . the seminal methods kNN, kNNW, and LOF still remain the state of the art —- none of the more recent methods tested offer any comprehensive improvement over those classics, while two methods in particular (LDF and KDEOS) have been found to be noticeably less robust to parameter choices.[6]
“A key question arises as to how the effectiveness of an outlier detection algorithm should be evaluated. Unfortunately, this is often a difficult task, because . . . ground-truth labeling of data points as outliers or non-outliers is often not available. Therefore, much of the research literature uses case studies to provide an intuitive and qualitative evaluation of the underlying outliers in unsupervised scenarios.”[5]
example | |
---|---|
year | 2021 |
rndrng_prvdr_ruca | 1 |
rndrng_npi | 1003000423 |
rndrng_prvdr_last_org_name | Velotta |
rndrng_prvdr_type | Obstetrics & Gynecology |
rndrng_prvdr_state_abrvtn | OH |
rndrng_prvdr_city | Cleveland |
tot_benes_mean | 28.33333 |
tot_srvcs_mean | 29 |
avg_mdcr_stdzd_amt_mean | 28.49333 |
vars | n | mean | sd | median | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|
tot_benes_mean | 1 | 132369 | 53.4 | 313.8 | 36.5 | 11 | 86733.0 | 86722.0 | 214.8 | 53786.7 | 0.9 |
tot_srvcs_mean | 2 | 132369 | 110.3 | 422.5 | 56.0 | 11 | 86843.5 | 86832.5 | 120.8 | 22744.6 | 1.2 |
vars | n | mean | sd | median | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|
tot_benes_mean | 1 | 131404 | 49.3 | 46.5 | 36.3 | 11 | 1831.0 | 1820.0 | 6.1 | 97.6 | 0.1 |
tot_srvcs_mean | 2 | 131404 | 101.2 | 194.4 | 55.7 | 11 | 12256.1 | 12245.1 | 14.1 | 414.8 | 0.5 |
vars | n | mean | sd | median | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|
tot_benes_mean | 1 | 430 | 47.0 | 27.3 | 41.8 | 11 | 205.9 | 194.9 | 1.6 | 4.3 | 1.3 |
tot_srvcs_mean | 2 | 430 | 99.7 | 202.4 | 58.6 | 11 | 2185.3 | 2174.3 | 7.2 | 59.4 | 9.8 |
A statistical description of a bivariate plot using 9 metrics:
For more information, see Wilkinson.[8]
mpg * cyl mpg * disp cyl * disp
Outlying 0.0000000000 0.00000000 0.0000000
Skewed 0.7440304272 0.78121701 0.9476808
Clumpy 0.4571866855 0.09834646 0.6989270
Sparse 0.1349678932 0.16478294 0.4984704
Striated 0.5000000000 0.03333333 0.6000000
Convex 0.0003757415 0.20208117 0.0000000
Skinny 0.7556553584 0.67252782 1.0000000
Stringy 0.6141250000 0.49616609 1.0000000
Monotonic 0.4771277625 0.58381611 0.8182267
attr(,"class")
[1] "scagnostics"
Do Medicare claims resemble a logrithmic or normal distribution? Why?
In describing the tails of a distribution, would skew or kurtosis be most appropriate?
Is the KNN algorithm a “seminal” approach that remains “state of the art” or an “avant-garde” approach?
What diagnostics were created to describe the “point cloud” of a bivariate plot?