The LEIE Database

Author

Rob Wiederstein

Abstract

  • U.S. Healthcare costs exceed 20% of GDP.
  • Healthcare fraud is estimated at 10% of total healthcare spending.
  • Dollars lost to fraud reduce resources available for treatment.
  • To date, supervised machine learning models frequently use the LEIE database for the outcome variable.
  • A small percentage of the LEIE have NPIs, resulting in highly unbalanced class.
  • Few, if any, machine learning studies have taken into account that the exclusions may not be the result of fraudulent activity.
  • Some exclusions are not indicative of fraud. Omitting them would create even more highly unbalanced classes, jeopardizing the LEIE as a viable outcome variable.

1 Healthcare Fraud

Estimates of healthcare fraud vary dramatically. A 2015 article cited the statistic that, “[i]n the United States, roughly one-third of all healthcare expenses are caused by fraud, waste, and abuse.” (Thornton et al., 2015, p. 713) (pdf).[1] While other articles have found fraud prevalent, they estimate it as about 10% of U.S. medical claims. [2][4]. Because a lot of healthcare fraud is undiscovered, there is “no record of these activities. The exact size of annual theft is unknown and is the subject of debate, for which healthcare fraud likely costs tens of billions of dollars a year2.” (Herland et al., 2020, p. 6) (pdf) [2]

Fraud is prevalent in healthcare due to “the difficulty of measuring performance output, the variable and complex nature of the work, little tolerance for ambiguity or error, the highly specialized nature of health services, management’s lack of control over the individuals doing the work, third party reimbursement” and the American “win-at-all-costs mentality.(pdf)[5]

Multiple organizations are involved in the prevention of healthcare fraud. Prominent among them, are the OIG, the Department of Jutice (DOJ), and the Center for Program Integrity (CPI). The OIG, maintainer of the LEIE database, supports Medicare Strike Force Teams who along with federal, state and local law enforcement seek out those defrauding Medicare. As of August, 2022, the website lists 2,688 criminal actions, 3483 indictments, $4.7B in receivables.[6].

In July of 2022, the DOJ announced “criminal charges against 36 defendants in 13 federal districts across the United States for more than $1.2 billion in alleged fraudulent telemedicine, cardiovascular and cancer genetic testing, and durable medical equipment (DME) schemes.”[7]

Additionally, CMS houses the Center for Program Integrity (CPI) whose mission “is to detect and combat fraud, waste and abuse of the Medicare and Medicaid programs.” [8]CPI reframes that mission as “making sure CMS is paying the right provider the right amount for services covered under our programs.”[8] CPI reports annually to Congress. The CPI reported an improper payment rate in 2021 of 6.26%.[9]

2 Machine Learning

“[I]n the U.S. alone, the application of machine learning and data mining approaches has the potential to save the healthcare industry up to $450 billion each year.” [2] Due to the volume of healthcare claims and the expense associated with fraud, machine learning is an economical way to detect and identify bad actors. Healthcare fraud is described as a classic “big data” problem in that it meets the 5 V’s: volume, variety, velocity, veracity and value.[[10]][11]

“The most common and well-accepted categorization that is used by machine learning experts divides data mining methods into ‘supervised’ and ‘unsupervised’ methods”[12] “Supervised methods attempt to discover the relationship between input variables (attributes or features) and an output (dependent) variable (or target attribute). Unsupervised learning methods are applied when no prior information of the dependent variable is available for use.” [12] “Supervised methods are usually used for classification and prediction objectives including traditional statistical methods such as regression analysis” [12] “Unsupervised methods are usually used for description including association rules extraction such as Apriori algorithm and segmentation methods such as clustering and anomaly detection.” [12] “the studies demonstrate that both supervised and unsupervised techniques have important merits in discovering different fraud strategies and schemes”[12]

In 2015, Thornton noted that some of the challenge was the lack of publically avaiable and accessible datasets. “For the health insurance industry to succeed in combatting fraudsters, it must also know itself – its systems and how data mining and analytic techniques can be applied within them to detect fraudulent activity.” [1] “Based on practical experience, we expect the lack of training data (structured datasets containing health care fraud cases) and a lack of useful open data available as the main causes for the relative small amount of research into the technological aspect of health insurance fraud.” [1]

Machine learning can train a model to identify fraudulent claim patterns by learning the training set’s characteristics. However, machine learning encounters a host of difficulties including:

  1. the lack of an explicit rule to identify fraudulent claims from non-fraudulent claims;

  2. the number of fraudulent claims to non-fraudulent claims is small, a problem referred to as “class imbalance”;

  3. claim variability can be extreme in that many variables are involved like disease, patient characteristics, doctor preferences;

  4. fraudulent actors change their behaviors and methods over time in response to the compliance environment; and

  5. frequent regulatory changes like changing drug lists and slow responses in fraud detection algorithms.[13]

Machine learning holds promise for the efficient detection of healthcare fraud, but one study noted its preoccupation with technical methods as opposed to practical advice to managers and and policy makers.[12]

Bring current to 2023 for close

4 Methodology

4.1 The LEIE database

“LEIE” is an acronym for the List of Excluded Individuals and Entitities. The Centers for Medicare and Medicaid Office of Inspector General (OIG) maintain the list. The list includes individuals and entities who cannot be reimbursed from Medicare because of previous misconduct. “Medicare is a U.S. government program that provides healthcare insurance and financial support for the elderly population, ages 65 and older, and other select groups of beneficiaries.”[2] Examples of misconduct include a felony drug conviction or the fraudulent submission of a Medicare claim. Additionally, the employment of an individual or entity on the LEIE may expose the employer or contractor to civil monetary penalties (CMP).

The exclusion list is the result of series of Congressional initiatives to reduce healthcare fraud. “In 1977, in the Medicare-Medicaid Anti-Fraud and Abuse Amendments . . . Congress first mandated the exclusion of physicians and other practitioners convicted of program-related crimes from participation in Medicare and Medicaid.”[14]

Then in 1981, it was followed “with enactment of the Civil Monetary Penalties Law (CMPL) to further address health care fraud and abuse.The CMPL authorizes the Department and OIG to impose CMPs, assessments, and program exclusions against any person that submits false or fraudulent or certain other types of improper claims for Medicare or Medicaid payment.”[14] Beginning in 1996, the enactment of the “Health Insurance Portability and Accountability Act (HIPAA) . . . and the Balanced Budget Act (BBA)” in 1997 expanded OIG’s sanctioning authority. These statutes expanded OIG exclusion authority to all “Federal health care programs.”[14]

“The effect of an OIG exclusion is that no Federal health care program payment may be made for any items or services furnished (1) by an excluded person or (2) at the medical direction or on the prescription of an excluded person.8 The exclusion and the payment prohibition continue to apply to an individual even if he or she changes from one health care profession to another while excluded.”[14] Employers have an ongoing obligation to know whether their employees or contractors are on the LEIE. Since the OIG updates the LEIE monthly, 2013 guidance suggested that a monthly check would minimize the probability of a violation. CMS in 2011 issued final regulations requiring states screen all enrolled providers monthly.[14]

The list is not comprehensive because many who commit fraud are not included. “For example, providers who are accused of overcharging insurers or Medicare often relinquish the overpayments without any public acknowledgement or notice.” [15] The LEIE is used in many machine learning efforts to label the data and identify fraud or no fraud. [2], [15] As of the date of download, August 11, 2023, the LEIE dataset included 77,942 individuals who have been excluded from participating in Medicare. Two other sanctions databases are sometimes mentioned in connection with the LEIE: the National Practitioner Databank (NPDB) and the Healthcare Integrity and and Protection Databank (HIPDB). The HIPDB data was subsumed by the NPDB in 2013.[16].

4.1.1 Unique Identifiers for the Excluded

The list provides basic identifiers for those who are excluded from participating in Medicare. The information includes, name, last location, and date of birth. However, social security numbers are omitted in deference to prevailing federal law. For providers, they may be identified by either the National Provider Identifier (NPI) or the Unique Provider Identifier (UPIN).

According to the LEIE website, “the NPI (National Provider Identifier) has replaced the UPIN as the unique number used to identify health care providers. The Centers for Medicaid & Medicare Services first began assigning NPIs in 2006, and providers were required to use NPIs as of mid-2008. See page.

According to the LEIE website, “the UPIN (Unique Physician Identification Number) was established by the Centers for Medicare & Medicaid Services as a unique provider identifier in lieu of the SSN. UPINs were assigned to physicians as well as certain non-physician practitioners and medical group practices. CMS no longer maintains the UPIN registry.

“Many individuals and entities that are excluded by OIG do not have NPIs to include in the LEIE. For those individuals and entities that have NPIs, OIG has added that information to records starting in 2008 and has included NPIs in the LEIE since that time.”

4.1.2 Missing NPIs

Participants who can be identified by an NPI have been increasing since their adoption in 2009. While the overall rate is 8.7%, 25% of newly added individuals and entities have an NPI included. This significantly improves the quality and usability of the database.

4.1.3 Statutory Exclusions

Include count within table.

statute agency exclusion description n_count
1128b4 Social Security Permissive License revocation or suspension 31084
1128a1 Social Security Mandatory Conviction of program related crimes 23109
1128a2 Social Security Mandatory Conviction for patient abuse 7449
1128a3 Social Security Mandatory Felony conviction for healthcare fraud 4987
1128a4 Social Security Mandatory Felony conviction for controlled substance 3140
1128b14 Social Security Permissive Default on student loans or scholarships 2227
1128b8 Social Security Permissive Entities controlled by sanctioned individual 1494
1128b1 Social Security Permissive Conviction for misdemeanor fraud 861
1128b5 Social Security Permissive Suspension under federal or state health care program 802
1128b7 Social Security Permissive Fraud or kickbacks or other prohibited 683
1128b3 Social Security Permissive Conviction for misdemeanor controlled substance 316
1128b6 Social Security Permissive Claim for excessive charges 67
1128b2 Social Security Permissive Conviction for obstruction of an investigation 59
1156 Social Security unknown Services are not economical & medically necessary 56
1128b15 Social Security Permissive Individuals controlled by a sanctioned entity 34
1128b11 Social Security Permissive Failure to supply payment information 11
1128b16 Social Security Permissive Making a false statement of material fact 2
1128b12 Social Security Permissive Failure to grant immediate access 1

4.1.4 Previous Uses of the LEIE dataset

  • Herland 2020

Herland retained only physicians with mandatory exclusions under section 1128. “The LEIE database does not include NPI numbers for all physicians and after preliminary analysis, we found that combining first name, last name, and address is not 100% reliable in determining identity.” Only physicians with NPI numbers that matched in the Part B data to the LEIE database were used. 1,310 physicians were deemed fraudulent.

4.1.5 Mandatory vs. Permissive Exclusions

4.1.6 Average Monthly Exclusions

4.1.7 Exclusions by Statory Provision

4.1.8 Variable Importance

A random forest model was applied to the LEIE data to check for variable importance. An outcome variable was constucted that was “present” for an observation that contained either an NPI or UPIN number or “absent” for an observation that contained neither. Multiple variables were dropped like first name, middle name, last name, street address, the npi and npi columns. The most important variable was the date of the exclusion. Since the presence of the npi has been increasing over time, the results matched expectations. See Table 1.

4.2 CMS Part B Claims Data

CMS furnishes “Part B” data. The data is a series of datasets that “provide information on services and procedures provided to Original Medicare (fee-for-service) Part B (Medical Insurance) beneficiaries by physicians and other healthcare professionals. These datasets contain information on use, payments, and submitted charges organized by National Provider Identifier (NPI), Healthcare Common Procedure Coding System (HCPCS) code, and geography.”[17] CMS groups the “unique National Provider Identification (NPI) numbers, Healthcare Common Procedure Coding System (HCPCS) code, and place of service (e.g. office or hospital).” [18]

5 Results

6 Conclusions

References

[1]
D. Thornton, M. Brinkhuis, C. Amrit, and R. Aly, “Categorizing and Describing the Types of Fraud in Healthcare,” Procedia Computer Science, vol. 64, pp. 713–720, Jan. 2015, doi: 10.1016/j.procs.2015.08.594.
[2]
M. Herland, R. A. Bauder, and T. M. Khoshgoftaar, “Approaches for identifying US medicare fraud in provider claims data,” Health care management science, vol. 23, pp. 2–19, 2020.
[3]
H. Joudaki et al., “Improving Fraud and Abuse Detection in General Physician Claims: A Data Mining Study,” International Journal of Health Policy and Management, vol. 5, no. 3, pp. 165–172, Nov. 2015, doi: 10.15171/ijhpm.2015.196.
[4]
M. P. Pawar, “Review on data mining techniques for fraud detection in health insurance,” IJETT, vol. 3, no. 2, pp. 1128–1131, 2016.
[5]
J. Byrd, P. Powell, and D. Smith, “Health Care Fraud: An Introduction to a Major Cost Issue.” Rochester, NY, Jun. 2013. Accessed: Mar. 25, 2023. [Online]. Available: https://papers.ssrn.com/abstract=2285860
[6]
“Medicare Fraud Strike Force,” Office of Inspector General | Government Oversight | U.S. Department of Health and Human Services. Mar. 2021. Accessed: Aug. 17, 2023. [Online]. Available: https://oig.hhs.gov/fraud/strike-force/
[7]
Department of Justice, “Office of Public Affairs | Justice Department Charges Dozens for $1.2 Billion in Health Care Fraud | United States Department of Justice.” Jul. 2022. Accessed: Aug. 17, 2023. [Online]. Available: https://www.justice.gov/opa/pr/justice-department-charges-dozens-12-billion-health-care-fraud
[8]
CMS, CMS Center for Program Integrity.” 2023. Accessed: Aug. 17, 2023. [Online]. Available: https://www.cms.gov/About-CMS/Components/CPI
[9]
U.S. Deparment of Health and Human Services Centers for Medicare & Medicaid Services, “2021 Report to Congress Medicare and Medicaid Program Integrity.” May 2023. Accessed: Aug. 17, 2023. [Online]. Available: https://www.cms.gov/files/document/fy2021-medicare-and-medicaid-annual-report-congress.pdf
[10]
J. M. Johnson and T. M. Khoshgoftaar, “Deep Learning and Data Sampling with Imbalanced Big Data,” pp. 175–183, Jul. 2019, doi: 10.1109/iri.2019.00038.
[11]
M. Herland, T. M. Khoshgoftaar, T. M. Khoshgoftaar, Taghi M. Khoshgoftaar, Taghi M. Khoshgoftaar, and R. A. Bauder, “Big Data fraud detection using multiple medicare data sources,” Journal of Big Data, vol. 5, no. 1, pp. 1–21, Dec. 2018, doi: 10.1186/s40537-018-0138-3.
[12]
H. Joudaki et al., “Using Data Mining to Detect Health Care Fraud and Abuse: A Review of Literature,” Global Journal of Health Science, vol. 7, no. 1, pp. 194–202, Jan. 2015, doi: 10.5539/gjhs.v7n1p194.
[13]
C. Zhang, X. Xiao, and C. Wu, “Medical Fraud and Abuse Detection System Based on Machine Learning,” International Journal of Environmental Research and Public Health, vol. 17, no. 19, p. 7265, Oct. 2020, doi: 10.3390/ijerph17197265.
[14]
Department of Health and Human Services Office of Inspector General, “Special Advisory Bulletin on the Effect of Exclusion from Participation in Federal Health Care Programs.” Jun. 2020. Accessed: Sep. 02, 2023. [Online]. Available: https://www.hhs.gov/guidance/document/updated-special-advisory-bulletin-effect-exclusion-participation-federal-health-care
[15]
L. K. Branting, F. Reeder, J. Gold, and T. Champney, “Graph analytics for healthcare fraud risk estimation,” in 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Aug. 2016, pp. 845–851. doi: 10.1109/ASONAM.2016.7752336.
[16]
“The NPDB - Home Page.” Accessed: Sep. 03, 2023. [Online]. Available: https://www.npdb.hrsa.gov/index.jsp
[17]
Centers for Medicare and Medicaid Services, “Medicare Physician & Other Practitioners - Centers for Medicare & Medicaid Services Data.” 2023. Accessed: Aug. 29, 2023. [Online]. Available: https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners
[18]
R. A. Bauder, R. A. Bauder, T. M. Khoshgoftaar, T. M. Khoshgoftaar, Taghi M. Khoshgoftaar, and Taghi M. Khoshgoftaar, “A Survey of Medicare Data Processing and Integration for Fraud Detection,” pp. 9–14, Jul. 2018, doi: 10.1109/iri.2018.00010.
[19]
“Explainable Machine Learning Models for Medicare Fraud Detection.” Jun. 2023. doi: 10.21203/rs.3.rs-3076353/v1.
[20]
“Fraud Costs Medicare Billions of Dollars Every Year,” AARP. May 2023. Accessed: Aug. 17, 2023. [Online]. Available: https://www.aarp.org/money/scams-fraud/info-2019/medicare.html
[21]
R. A. Bauder, T. M. Khoshgoftaar, T. M. Khoshgoftaar, Taghi M. Khoshgoftaar, and Taghi M. Khoshgoftaar, “Medicare Fraud Detection Using Machine Learning Methods,” pp. 858–865, Dec. 2017, doi: 10.1109/icmla.2017.00-48.
[22]
R. A. Bauder, T. M. Khoshgoftaar, T. M. Khoshgoftaar, Taghi M. Khoshgoftaar, and Taghi M. Khoshgoftaar, “Medicare Fraud Detection Using Random Forest with Class Imbalanced Big Data,” pp. 80–87, Jul. 2018, doi: 10.1109/iri.2018.00019.
[23]
R. A. Bauder and T. M. Khoshgoftaar, “The detection of medicare fraud using machine learning methods with excluded provider labels,” in The Thirty-First International Flairs Conference, 2018.
[24]
R. Bauder, R. da Rosa, and T. Khoshgoftaar, “Identifying Medicare Provider Fraud with Unsupervised Machine Learning,” in 2018 IEEE International Conference on Information Reuse and Integration (IRI), Jul. 2018, pp. 285–292. doi: 10.1109/IRI.2018.00051.
[25]
R. A. Bauder, T. M. Khoshgoftaar, T. M. Khoshgoftaar, Taghi M. Khoshgoftaar, and Taghi M. Khoshgoftaar, “The effects of varying class distribution on learner behavior for medicare fraud detection with imbalanced big data,” vol. 6, no. 1, pp. 9–9, Sep. 2018, doi: 10.1007/s13755-018-0051-3.
[26]
D. M. Berwick and A. D. Hackbarth, “Eliminating Waste in US Health Care,” JAMA, vol. 307, no. 14, pp. 1513–1516, Apr. 2012, doi: 10.1001/jama.2012.362.
[27]
V. Chandola, S. R. Sukumar, and J. C. Schryver, “Knowledge discovery from massive healthcare claims data,” pp. 1312–1320, Aug. 2013, doi: 10.1145/2487575.2488205.
[28]
M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we need hundreds of classifiers to solve real world classification problems?” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3133–3181, Jan. 2014.
[29]
P. J. García, “Corruption in global health: The open secret,” The Lancet, vol. 394, no. 10214, pp. 2119–2124, Dec. 2019, doi: 10.1016/S0140-6736(19)32527-9.
[30]
M. Herland, R. A. Bauder, T. M. Khoshgoftaar, T. M. Khoshgoftaar, Taghi M. Khoshgoftaar, and Taghi M. Khoshgoftaar, “The effects of class rarity on the evaluation of supervised healthcare fraud detection models,” Journal of Big Data, vol. 6, no. 1, pp. 1–33, Feb. 2019, doi: 10.1186/s40537-019-0181-8.
[31]
J. M. Johnson and T. M. Khoshgoftaar, “Medicare fraud detection using neural networks,” Journal of Big Data, vol. 6, no. 1, pp. 1–35, Jul. 2019, doi: 10.1186/s40537-019-0225-0.
[32]
OIG Texas HHS, “Exclusions | Office of Inspector General.” Accessed: Sep. 01, 2023. [Online]. Available: https://oig.hhs.texas.gov/exclusions
[33]
B. Peluso et al., “Health care fraud,” vol. 60, p. 937, 2023.
[34]
S. Shekhar, J. Leder-Luis, and L. Akoglu, “Unsupervised Machine Learning for Explainable Health Care Fraud Detection.” Rochester, NY, Feb. 2023. Accessed: Aug. 09, 2023. [Online]. Available: https://papers.ssrn.com/abstract=4356230
[35]
S. S. Waghade, “A Comprehensive Study of Healthcare Fraud Detection based on Machine Learning,” 2018. Accessed: Mar. 25, 2023. [Online]. Available: https://www.semanticscholar.org/paper/A-Comprehensive-Study-of-Healthcare-Fraud-Detection-Waghade/ca2337f7216a879b8595613adee14246dd4fead0
[36]
“The $272 billion swindle: Why thieves love America’s health-care system,” The Economist, Accessed: Aug. 22, 2023. [Online]. Available: https://www.economist.com/united-states/2014/05/31/the-272-billion-swindle
[37]
“Exclusions | Office of Inspector General | U.S. Department of Health and Human Services.” Accessed: Aug. 30, 2023. [Online]. Available: https://www.oig.hhs.gov/exclusions/index.asp
[38]
“Medicaid Exclusions | Office of the Medicaid Inspector General.” Accessed: Sep. 01, 2023. [Online]. Available: https://omig.ny.gov/medicaid-fraud/medicaid-exclusions
[39]
“Provider Suspended and Ineligible List (S&I List) - California Health and Human Services Open Data Portal.” Accessed: Sep. 01, 2023. [Online]. Available: https://data.chhs.ca.gov/dataset/provider-suspended-and-ineligible-list-s-i-list
[40]
R Core Team, R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2023. Available: https://www.R-project.org/
[41]
A. Liaw and M. Wiener, “Classification and regression by randomForest,” R News, vol. 2, no. 3, pp. 18–22, 2002, Available: https://CRAN.R-project.org/doc/Rnews/

Tables

Table 1: LEIE Variable Summary
skim_type skim_variable n_missing complete_rate numeric.mean Date.n_unique character.n_unique
Date dob 4003 0.9475924 NA 21092 NA
Date excldate 0 1.0000000 NA 2258 NA
character lastname 3153 0.9587206 NA NA 29340
character firstname 3153 0.9587206 NA NA 11900
character midname 22445 0.7061480 NA NA 8664
character busname 73229 0.0412794 NA NA 3100
character general 0 1.0000000 NA NA 87
character specialty 4088 0.9464795 NA NA 199
character upin 70248 0.0803069 NA NA 5957
character npi 69963 0.0840381 NA NA 6293
character address 9 0.9998822 NA NA 72452
character city 1 0.9999869 NA NA 9919
character state 5 0.9999345 NA NA 60
character zip 0 1.0000000 NA NA 17408
character excltype 0 1.0000000 NA NA 18
character reindate 76382 0.0000000 NA NA 0
character waiverdate 76382 0.0000000 NA NA 0
character wvrstate 76381 0.0000131 NA NA 1
character exclusion 0 1.0000000 NA NA 3
factor type 0 1.0000000 NA NA NA
numeric age_at_excl 4003 0.9475924 45.64651 NA NA

Figures

LEIE

Figure 1: Elephant

Part B