The LEIE Database

Author

Rob Wiederstein

Abstract

U.S. Healthcare costs exceed 20% of GDP.
Healthcare fraud is estimated at 10% of total healthcare spending.
Dollars lost to fraud reduce resources available for treatment.
To date, supervised machine learning models frequently use the LEIE database for the outcome variable.
A small percentage of the LEIE have NPIs, resulting in highly unbalanced class.
Few, if any, machine learning studies have taken into account that the exclusions may not be the result of fraudulent activity.
Some exclusions are not indicative of fraud. Omitting them would create even more highly unbalanced classes, jeopardizing the LEIE as a viable outcome variable.

1 Healthcare Fraud

Estimates of healthcare fraud vary dramatically. A 2015 article cited the statistic that, “[i]n the United States, roughly one-third of all healthcare expenses are caused by fraud, waste, and abuse.” (Thornton et al., 2015, p. 713) (pdf).[1] While other articles have found fraud prevalent, they estimate it as about 10% of U.S. medical claims. [2]–[4]. Because a lot of healthcare fraud is undiscovered, there is “no record of these activities. The exact size of annual theft is unknown and is the subject of debate, for which healthcare fraud likely costs tens of billions of dollars a year2.” (Herland et al., 2020, p. 6) (pdf) [2]

Fraud is prevalent in healthcare due to “the difficulty of measuring performance output, the variable and complex nature of the work, little tolerance for ambiguity or error, the highly specialized nature of health services, management’s lack of control over the individuals doing the work, third party reimbursement” and the American “win-at-all-costs mentality.(pdf)[5]

Multiple organizations are involved in the prevention of healthcare fraud. Prominent among them, are the OIG, the Department of Jutice (DOJ), and the Center for Program Integrity (CPI). The OIG, maintainer of the LEIE database, supports Medicare Strike Force Teams who along with federal, state and local law enforcement seek out those defrauding Medicare. As of August, 2022, the website lists 2,688 criminal actions, 3483 indictments, $4.7B in receivables.[6].

In July of 2022, the DOJ announced “criminal charges against 36 defendants in 13 federal districts across the United States for more than $1.2 billion in alleged fraudulent telemedicine, cardiovascular and cancer genetic testing, and durable medical equipment (DME) schemes.”[7]

Additionally, CMS houses the Center for Program Integrity (CPI) whose mission “is to detect and combat fraud, waste and abuse of the Medicare and Medicaid programs.” [8]CPI reframes that mission as “making sure CMS is paying the right provider the right amount for services covered under our programs.”[8] CPI reports annually to Congress. The CPI reported an improper payment rate in 2021 of 6.26%.[9]

2 Machine Learning

“[I]n the U.S. alone, the application of machine learning and data mining approaches has the potential to save the healthcare industry up to $450 billion each year.” [2] Due to the volume of healthcare claims and the expense associated with fraud, machine learning is an economical way to detect and identify bad actors. Healthcare fraud is described as a classic “big data” problem in that it meets the 5 V’s: volume, variety, velocity, veracity and value.[[10]][11]

“The most common and well-accepted categorization that is used by machine learning experts divides data mining methods into ‘supervised’ and ‘unsupervised’ methods”[12] “Supervised methods attempt to discover the relationship between input variables (attributes or features) and an output (dependent) variable (or target attribute). Unsupervised learning methods are applied when no prior information of the dependent variable is available for use.” [12] “Supervised methods are usually used for classification and prediction objectives including traditional statistical methods such as regression analysis” [12] “Unsupervised methods are usually used for description including association rules extraction such as Apriori algorithm and segmentation methods such as clustering and anomaly detection.” [12] “the studies demonstrate that both supervised and unsupervised techniques have important merits in discovering different fraud strategies and schemes”[12]

In 2015, Thornton noted that some of the challenge was the lack of publically avaiable and accessible datasets. “For the health insurance industry to succeed in combatting fraudsters, it must also know itself – its systems and how data mining and analytic techniques can be applied within them to detect fraudulent activity.” [1] “Based on practical experience, we expect the lack of training data (structured datasets containing health care fraud cases) and a lack of useful open data available as the main causes for the relative small amount of research into the technological aspect of health insurance fraud.” [1]

Machine learning can train a model to identify fraudulent claim patterns by learning the training set’s characteristics. However, machine learning encounters a host of difficulties including:

the lack of an explicit rule to identify fraudulent claims from non-fraudulent claims;
the number of fraudulent claims to non-fraudulent claims is small, a problem referred to as “class imbalance”;
claim variability can be extreme in that many variables are involved like disease, patient characteristics, doctor preferences;
fraudulent actors change their behaviors and methods over time in response to the compliance environment; and
frequent regulatory changes like changing drug lists and slow responses in fraud detection algorithms.[13]

Machine learning holds promise for the efficient detection of healthcare fraud, but one study noted its preoccupation with technical methods as opposed to practical advice to managers and and policy makers.[12]

Bring current to 2023 for close

4 Methodology

4.1 The LEIE database

“LEIE” is an acronym for the List of Excluded Individuals and Entitities. The Centers for Medicare and Medicaid Office of Inspector General (OIG) maintain the list. The list includes individuals and entities who cannot be reimbursed from Medicare because of previous misconduct. “Medicare is a U.S. government program that provides healthcare insurance and financial support for the elderly population, ages 65 and older, and other select groups of beneficiaries.”[2] Examples of misconduct include a felony drug conviction or the fraudulent submission of a Medicare claim. Additionally, the employment of an individual or entity on the LEIE may expose the employer or contractor to civil monetary penalties (CMP).

The exclusion list is the result of series of Congressional initiatives to reduce healthcare fraud. “In 1977, in the Medicare-Medicaid Anti-Fraud and Abuse Amendments . . . Congress first mandated the exclusion of physicians and other practitioners convicted of program-related crimes from participation in Medicare and Medicaid.”[14]

Then in 1981, it was followed “with enactment of the Civil Monetary Penalties Law (CMPL) to further address health care fraud and abuse.The CMPL authorizes the Department and OIG to impose CMPs, assessments, and program exclusions against any person that submits false or fraudulent or certain other types of improper claims for Medicare or Medicaid payment.”[14] Beginning in 1996, the enactment of the “Health Insurance Portability and Accountability Act (HIPAA) . . . and the Balanced Budget Act (BBA)” in 1997 expanded OIG’s sanctioning authority. These statutes expanded OIG exclusion authority to all “Federal health care programs.”[14]

“The effect of an OIG exclusion is that no Federal health care program payment may be made for any items or services furnished (1) by an excluded person or (2) at the medical direction or on the prescription of an excluded person.8 The exclusion and the payment prohibition continue to apply to an individual even if he or she changes from one health care profession to another while excluded.”[14] Employers have an ongoing obligation to know whether their employees or contractors are on the LEIE. Since the OIG updates the LEIE monthly, 2013 guidance suggested that a monthly check would minimize the probability of a violation. CMS in 2011 issued final regulations requiring states screen all enrolled providers monthly.[14]

The list is not comprehensive because many who commit fraud are not included. “For example, providers who are accused of overcharging insurers or Medicare often relinquish the overpayments without any public acknowledgement or notice.” [15] The LEIE is used in many machine learning efforts to label the data and identify fraud or no fraud. [2], [15] As of the date of download, August 11, 2023, the LEIE dataset included 77,942 individuals who have been excluded from participating in Medicare. Two other sanctions databases are sometimes mentioned in connection with the LEIE: the National Practitioner Databank (NPDB) and the Healthcare Integrity and and Protection Databank (HIPDB). The HIPDB data was subsumed by the NPDB in 2013.[16].

4.1.1 Unique Identifiers for the Excluded

The list provides basic identifiers for those who are excluded from participating in Medicare. The information includes, name, last location, and date of birth. However, social security numbers are omitted in deference to prevailing federal law. For providers, they may be identified by either the National Provider Identifier (NPI) or the Unique Provider Identifier (UPIN).

According to the LEIE website, “the NPI (National Provider Identifier) has replaced the UPIN as the unique number used to identify health care providers. The Centers for Medicaid & Medicare Services first began assigning NPIs in 2006, and providers were required to use NPIs as of mid-2008. See page.

According to the LEIE website, “the UPIN (Unique Physician Identification Number) was established by the Centers for Medicare & Medicaid Services as a unique provider identifier in lieu of the SSN. UPINs were assigned to physicians as well as certain non-physician practitioners and medical group practices. CMS no longer maintains the UPIN registry.

“Many individuals and entities that are excluded by OIG do not have NPIs to include in the LEIE. For those individuals and entities that have NPIs, OIG has added that information to records starting in 2008 and has included NPIs in the LEIE since that time.”

4.1.2 Missing NPIs

Participants who can be identified by an NPI have been increasing since their adoption in 2009. While the overall rate is 8.7%, 25% of newly added individuals and entities have an NPI included. This significantly improves the quality and usability of the database.

4.1.3 Statutory Exclusions

Include count within table.

statute	agency	exclusion	description	n_count
1128b4	Social Security	Permissive	License revocation or suspension	31084
1128a1	Social Security	Mandatory	Conviction of program related crimes	23109
1128a2	Social Security	Mandatory	Conviction for patient abuse	7449
1128a3	Social Security	Mandatory	Felony conviction for healthcare fraud	4987
1128a4	Social Security	Mandatory	Felony conviction for controlled substance	3140
1128b14	Social Security	Permissive	Default on student loans or scholarships	2227
1128b8	Social Security	Permissive	Entities controlled by sanctioned individual	1494
1128b1	Social Security	Permissive	Conviction for misdemeanor fraud	861
1128b5	Social Security	Permissive	Suspension under federal or state health care program	802
1128b7	Social Security	Permissive	Fraud or kickbacks or other prohibited	683
1128b3	Social Security	Permissive	Conviction for misdemeanor controlled substance	316
1128b6	Social Security	Permissive	Claim for excessive charges	67
1128b2	Social Security	Permissive	Conviction for obstruction of an investigation	59
1156	Social Security	unknown	Services are not economical & medically necessary	56
1128b15	Social Security	Permissive	Individuals controlled by a sanctioned entity	34
1128b11	Social Security	Permissive	Failure to supply payment information	11
1128b16	Social Security	Permissive	Making a false statement of material fact	2
1128b12	Social Security	Permissive	Failure to grant immediate access	1

4.1.4 Previous Uses of the LEIE dataset

Herland 2020

Herland retained only physicians with mandatory exclusions under section 1128. “The LEIE database does not include NPI numbers for all physicians and after preliminary analysis, we found that combining first name, last name, and address is not 100% reliable in determining identity.” Only physicians with NPI numbers that matched in the Part B data to the LEIE database were used. 1,310 physicians were deemed fraudulent.

4.1.5 Mandatory vs. Permissive Exclusions

4.1.6 Average Monthly Exclusions

4.1.7 Exclusions by Statory Provision

4.1.8 Variable Importance

A random forest model was applied to the LEIE data to check for variable importance. An outcome variable was constucted that was “present” for an observation that contained either an NPI or UPIN number or “absent” for an observation that contained neither. Multiple variables were dropped like first name, middle name, last name, street address, the npi and npi columns. The most important variable was the date of the exclusion. Since the presence of the npi has been increasing over time, the results matched expectations. See Table 1.

4.2 CMS Part B Claims Data

CMS furnishes “Part B” data. The data is a series of datasets that “provide information on services and procedures provided to Original Medicare (fee-for-service) Part B (Medical Insurance) beneficiaries by physicians and other healthcare professionals. These datasets contain information on use, payments, and submitted charges organized by National Provider Identifier (NPI), Healthcare Common Procedure Coding System (HCPCS) code, and geography.”[17] CMS groups the “unique National Provider Identification (NPI) numbers, Healthcare Common Procedure Coding System (HCPCS) code, and place of service (e.g. office or hospital).” [18]

5 Results

6 Conclusions

References

[1]

D. Thornton, M. Brinkhuis, C. Amrit, and R. Aly, “Categorizing and Describing the Types of Fraud in Healthcare,” Procedia Computer Science, vol. 64, pp. 713–720, Jan. 2015, doi: 10.1016/j.procs.2015.08.594.

[2]

M. Herland, R. A. Bauder, and T. M. Khoshgoftaar, “Approaches for identifying US medicare fraud in provider claims data,” Health care management science, vol. 23, pp. 2–19, 2020.

[3]

H. Joudaki et al., “Improving Fraud and Abuse Detection in General Physician Claims: A Data Mining Study,” International Journal of Health Policy and Management, vol. 5, no. 3, pp. 165–172, Nov. 2015, doi: 10.15171/ijhpm.2015.196.

[4]

M. P. Pawar, “Review on data mining techniques for fraud detection in health insurance,” IJETT, vol. 3, no. 2, pp. 1128–1131, 2016.

[5]

J. Byrd, P. Powell, and D. Smith, “Health Care Fraud: An Introduction to a Major Cost Issue.” Rochester, NY, Jun. 2013. Accessed: Mar. 25, 2023. [Online]. Available: https://papers.ssrn.com/abstract=2285860

[6]

“Medicare Fraud Strike Force,” Office of Inspector General | Government Oversight | U.S. Department of Health and Human Services. Mar. 2021. Accessed: Aug. 17, 2023. [Online]. Available: https://oig.hhs.gov/fraud/strike-force/

[7]

Department of Justice, “Office of Public Affairs | Justice Department Charges Dozens for $1.2 Billion in Health Care Fraud | United States Department of Justice.” Jul. 2022. Accessed: Aug. 17, 2023. [Online]. Available: https://www.justice.gov/opa/pr/justice-department-charges-dozens-12-billion-health-care-fraud

[8]

CMS, “CMS Center for Program Integrity.” 2023. Accessed: Aug. 17, 2023. [Online]. Available: https://www.cms.gov/About-CMS/Components/CPI

[9]

U.S. Deparment of Health and Human Services Centers for Medicare & Medicaid Services, “2021 Report to Congress Medicare and Medicaid Program Integrity.” May 2023. Accessed: Aug. 17, 2023. [Online]. Available: https://www.cms.gov/files/document/fy2021-medicare-and-medicaid-annual-report-congress.pdf

[10]

J. M. Johnson and T. M. Khoshgoftaar, “Deep Learning and Data Sampling with Imbalanced Big Data,” pp. 175–183, Jul. 2019, doi: 10.1109/iri.2019.00038.

[11]

M. Herland, T. M. Khoshgoftaar, T. M. Khoshgoftaar, Taghi M. Khoshgoftaar, Taghi M. Khoshgoftaar, and R. A. Bauder, “Big Data fraud detection using multiple medicare data sources,” Journal of Big Data, vol. 5, no. 1, pp. 1–21, Dec. 2018, doi: 10.1186/s40537-018-0138-3.

[12]

H. Joudaki et al., “Using Data Mining to Detect Health Care Fraud and Abuse: A Review of Literature,” Global Journal of Health Science, vol. 7, no. 1, pp. 194–202, Jan. 2015, doi: 10.5539/gjhs.v7n1p194.

[13]

C. Zhang, X. Xiao, and C. Wu, “Medical Fraud and Abuse Detection System Based on Machine Learning,” International Journal of Environmental Research and Public Health, vol. 17, no. 19, p. 7265, Oct. 2020, doi: 10.3390/ijerph17197265.

[14]

Department of Health and Human Services Office of Inspector General, “Special Advisory Bulletin on the Effect of Exclusion from Participation in Federal Health Care Programs.” Jun. 2020. Accessed: Sep. 02, 2023. [Online]. Available: https://www.hhs.gov/guidance/document/updated-special-advisory-bulletin-effect-exclusion-participation-federal-health-care

[15]

L. K. Branting, F. Reeder, J. Gold, and T. Champney, “Graph analytics for healthcare fraud risk estimation,” in 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Aug. 2016, pp. 845–851. doi: 10.1109/ASONAM.2016.7752336.

[16]

“The NPDB - Home Page.” Accessed: Sep. 03, 2023. [Online]. Available: https://www.npdb.hrsa.gov/index.jsp

[17]

Centers for Medicare and Medicaid Services, “Medicare Physician & Other Practitioners - Centers for Medicare & Medicaid Services Data.” 2023. Accessed: Aug. 29, 2023. [Online]. Available: https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners

[18]

R. A. Bauder, R. A. Bauder, T. M. Khoshgoftaar, T. M. Khoshgoftaar, Taghi M. Khoshgoftaar, and Taghi M. Khoshgoftaar, “A Survey of Medicare Data Processing and Integration for Fraud Detection,” pp. 9–14, Jul. 2018, doi: 10.1109/iri.2018.00010.

[19]

“Explainable Machine Learning Models for Medicare Fraud Detection.” Jun. 2023. doi: 10.21203/rs.3.rs-3076353/v1.

[20]

“Fraud Costs Medicare Billions of Dollars Every Year,” AARP. May 2023. Accessed: Aug. 17, 2023. [Online]. Available: https://www.aarp.org/money/scams-fraud/info-2019/medicare.html

[21]

R. A. Bauder, T. M. Khoshgoftaar, T. M. Khoshgoftaar, Taghi M. Khoshgoftaar, and Taghi M. Khoshgoftaar, “Medicare Fraud Detection Using Machine Learning Methods,” pp. 858–865, Dec. 2017, doi: 10.1109/icmla.2017.00-48.

[22]

R. A. Bauder, T. M. Khoshgoftaar, T. M. Khoshgoftaar, Taghi M. Khoshgoftaar, and Taghi M. Khoshgoftaar, “Medicare Fraud Detection Using Random Forest with Class Imbalanced Big Data,” pp. 80–87, Jul. 2018, doi: 10.1109/iri.2018.00019.

[23]

R. A. Bauder and T. M. Khoshgoftaar, “The detection of medicare fraud using machine learning methods with excluded provider labels,” in The Thirty-First International Flairs Conference, 2018.

[24]

R. Bauder, R. da Rosa, and T. Khoshgoftaar, “Identifying Medicare Provider Fraud with Unsupervised Machine Learning,” in 2018 IEEE International Conference on Information Reuse and Integration (IRI), Jul. 2018, pp. 285–292. doi: 10.1109/IRI.2018.00051.

[25]

R. A. Bauder, T. M. Khoshgoftaar, T. M. Khoshgoftaar, Taghi M. Khoshgoftaar, and Taghi M. Khoshgoftaar, “The effects of varying class distribution on learner behavior for medicare fraud detection with imbalanced big data,” vol. 6, no. 1, pp. 9–9, Sep. 2018, doi: 10.1007/s13755-018-0051-3.

[26]

D. M. Berwick and A. D. Hackbarth, “Eliminating Waste in US Health Care,” JAMA, vol. 307, no. 14, pp. 1513–1516, Apr. 2012, doi: 10.1001/jama.2012.362.

[27]

V. Chandola, S. R. Sukumar, and J. C. Schryver, “Knowledge discovery from massive healthcare claims data,” pp. 1312–1320, Aug. 2013, doi: 10.1145/2487575.2488205.

[28]

M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we need hundreds of classifiers to solve real world classification problems?” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3133–3181, Jan. 2014.

[29]

P. J. García, “Corruption in global health: The open secret,” The Lancet, vol. 394, no. 10214, pp. 2119–2124, Dec. 2019, doi: 10.1016/S0140-6736(19)32527-9.

[30]

M. Herland, R. A. Bauder, T. M. Khoshgoftaar, T. M. Khoshgoftaar, Taghi M. Khoshgoftaar, and Taghi M. Khoshgoftaar, “The effects of class rarity on the evaluation of supervised healthcare fraud detection models,” Journal of Big Data, vol. 6, no. 1, pp. 1–33, Feb. 2019, doi: 10.1186/s40537-019-0181-8.

[31]

J. M. Johnson and T. M. Khoshgoftaar, “Medicare fraud detection using neural networks,” Journal of Big Data, vol. 6, no. 1, pp. 1–35, Jul. 2019, doi: 10.1186/s40537-019-0225-0.

[32]

OIG Texas HHS, “Exclusions | Office of Inspector General.” Accessed: Sep. 01, 2023. [Online]. Available: https://oig.hhs.texas.gov/exclusions

[33]

B. Peluso et al., “Health care fraud,” vol. 60, p. 937, 2023.

[34]

S. Shekhar, J. Leder-Luis, and L. Akoglu, “Unsupervised Machine Learning for Explainable Health Care Fraud Detection.” Rochester, NY, Feb. 2023. Accessed: Aug. 09, 2023. [Online]. Available: https://papers.ssrn.com/abstract=4356230

[35]

S. S. Waghade, “A Comprehensive Study of Healthcare Fraud Detection based on Machine Learning,” 2018. Accessed: Mar. 25, 2023. [Online]. Available: https://www.semanticscholar.org/paper/A-Comprehensive-Study-of-Healthcare-Fraud-Detection-Waghade/ca2337f7216a879b8595613adee14246dd4fead0

[36]

“The $272 billion swindle: Why thieves love America’s health-care system,” The Economist, Accessed: Aug. 22, 2023. [Online]. Available: https://www.economist.com/united-states/2014/05/31/the-272-billion-swindle

[37]

“Exclusions | Office of Inspector General | U.S. Department of Health and Human Services.” Accessed: Aug. 30, 2023. [Online]. Available: https://www.oig.hhs.gov/exclusions/index.asp

[38]

“Medicaid Exclusions | Office of the Medicaid Inspector General.” Accessed: Sep. 01, 2023. [Online]. Available: https://omig.ny.gov/medicaid-fraud/medicaid-exclusions

[39]

“Provider Suspended and Ineligible List (S&I List) - California Health and Human Services Open Data Portal.” Accessed: Sep. 01, 2023. [Online]. Available: https://data.chhs.ca.gov/dataset/provider-suspended-and-ineligible-list-s-i-list

[40]

R Core Team, R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2023. Available: https://www.R-project.org/

[41]

A. Liaw and M. Wiener, “Classification and regression by randomForest,” R News, vol. 2, no. 3, pp. 18–22, 2002, Available: https://CRAN.R-project.org/doc/Rnews/

Tables

Table 1: LEIE Variable Summary
skim_type	skim_variable	n_missing	complete_rate	numeric.mean	Date.n_unique	character.n_unique
Date	dob	4003	0.9475924	NA	21092	NA
Date	excldate	0	1.0000000	NA	2258	NA
character	lastname	3153	0.9587206	NA	NA	29340
character	firstname	3153	0.9587206	NA	NA	11900
character	midname	22445	0.7061480	NA	NA	8664
character	busname	73229	0.0412794	NA	NA	3100
character	general	0	1.0000000	NA	NA	87
character	specialty	4088	0.9464795	NA	NA	199
character	upin	70248	0.0803069	NA	NA	5957
character	npi	69963	0.0840381	NA	NA	6293
character	address	9	0.9998822	NA	NA	72452
character	city	1	0.9999869	NA	NA	9919
character	state	5	0.9999345	NA	NA	60
character	zip	0	1.0000000	NA	NA	17408
character	excltype	0	1.0000000	NA	NA	18
character	reindate	76382	0.0000000	NA	NA	0
character	waiverdate	76382	0.0000000	NA	NA	0
character	wvrstate	76381	0.0000131	NA	NA	1
character	exclusion	0	1.0000000	NA	NA	3
factor	type	0	1.0000000	NA	NA	NA
numeric	age_at_excl	4003	0.9475924	45.64651	NA	NA