The LEIE Database
Abstract
- U.S. Healthcare costs exceed 20% of GDP.
- Healthcare fraud is estimated at 10% of total healthcare spending.
- Dollars lost to fraud reduce resources available for treatment.
- To date, supervised machine learning models frequently use the LEIE database for the outcome variable.
- A small percentage of the LEIE have NPIs, resulting in highly unbalanced class.
- Few, if any, machine learning studies have taken into account that the exclusions may not be the result of fraudulent activity.
- Some exclusions are not indicative of fraud. Omitting them would create even more highly unbalanced classes, jeopardizing the LEIE as a viable outcome variable.
1 Healthcare Fraud
Estimates of healthcare fraud vary dramatically. A 2015 article cited the statistic that, “[i]n the United States, roughly one-third of all healthcare expenses are caused by fraud, waste, and abuse.” (Thornton et al., 2015, p. 713) (pdf).[1] While other articles have found fraud prevalent, they estimate it as about 10% of U.S. medical claims. [2]–[4]. Because a lot of healthcare fraud is undiscovered, there is “no record of these activities. The exact size of annual theft is unknown and is the subject of debate, for which healthcare fraud likely costs tens of billions of dollars a year2.” (Herland et al., 2020, p. 6) (pdf) [2]
Fraud is prevalent in healthcare due to “the difficulty of measuring performance output, the variable and complex nature of the work, little tolerance for ambiguity or error, the highly specialized nature of health services, management’s lack of control over the individuals doing the work, third party reimbursement” and the American “win-at-all-costs mentality.(pdf)[5]
Multiple organizations are involved in the prevention of healthcare fraud. Prominent among them, are the OIG, the Department of Jutice (DOJ), and the Center for Program Integrity (CPI). The OIG, maintainer of the LEIE database, supports Medicare Strike Force Teams who along with federal, state and local law enforcement seek out those defrauding Medicare. As of August, 2022, the website lists 2,688 criminal actions, 3483 indictments, $4.7B in receivables.[6].
In July of 2022, the DOJ announced “criminal charges against 36 defendants in 13 federal districts across the United States for more than $1.2 billion in alleged fraudulent telemedicine, cardiovascular and cancer genetic testing, and durable medical equipment (DME) schemes.”[7]
Additionally, CMS houses the Center for Program Integrity (CPI) whose mission “is to detect and combat fraud, waste and abuse of the Medicare and Medicaid programs.” [8]CPI reframes that mission as “making sure CMS is paying the right provider the right amount for services covered under our programs.”[8] CPI reports annually to Congress. The CPI reported an improper payment rate in 2021 of 6.26%.[9]
2 Machine Learning
“[I]n the U.S. alone, the application of machine learning and data mining approaches has the potential to save the healthcare industry up to $450 billion each year.” [2] Due to the volume of healthcare claims and the expense associated with fraud, machine learning is an economical way to detect and identify bad actors. Healthcare fraud is described as a classic “big data” problem in that it meets the 5 V’s: volume, variety, velocity, veracity and value.[[10]][11]
“The most common and well-accepted categorization that is used by machine learning experts divides data mining methods into ‘supervised’ and ‘unsupervised’ methods”[12] “Supervised methods attempt to discover the relationship between input variables (attributes or features) and an output (dependent) variable (or target attribute). Unsupervised learning methods are applied when no prior information of the dependent variable is available for use.” [12] “Supervised methods are usually used for classification and prediction objectives including traditional statistical methods such as regression analysis” [12] “Unsupervised methods are usually used for description including association rules extraction such as Apriori algorithm and segmentation methods such as clustering and anomaly detection.” [12] “the studies demonstrate that both supervised and unsupervised techniques have important merits in discovering different fraud strategies and schemes”[12]
In 2015, Thornton noted that some of the challenge was the lack of publically avaiable and accessible datasets. “For the health insurance industry to succeed in combatting fraudsters, it must also know itself – its systems and how data mining and analytic techniques can be applied within them to detect fraudulent activity.” [1] “Based on practical experience, we expect the lack of training data (structured datasets containing health care fraud cases) and a lack of useful open data available as the main causes for the relative small amount of research into the technological aspect of health insurance fraud.” [1]
Machine learning can train a model to identify fraudulent claim patterns by learning the training set’s characteristics. However, machine learning encounters a host of difficulties including:
the lack of an explicit rule to identify fraudulent claims from non-fraudulent claims;
the number of fraudulent claims to non-fraudulent claims is small, a problem referred to as “class imbalance”;
claim variability can be extreme in that many variables are involved like disease, patient characteristics, doctor preferences;
fraudulent actors change their behaviors and methods over time in response to the compliance environment; and
frequent regulatory changes like changing drug lists and slow responses in fraud detection algorithms.[13]
Machine learning holds promise for the efficient detection of healthcare fraud, but one study noted its preoccupation with technical methods as opposed to practical advice to managers and and policy makers.[12]
Bring current to 2023 for close
4 Methodology
4.1 The LEIE database
“LEIE” is an acronym for the List of Excluded Individuals and Entitities. The Centers for Medicare and Medicaid Office of Inspector General (OIG) maintain the list. The list includes individuals and entities who cannot be reimbursed from Medicare because of previous misconduct. “Medicare is a U.S. government program that provides healthcare insurance and financial support for the elderly population, ages 65 and older, and other select groups of beneficiaries.”[2] Examples of misconduct include a felony drug conviction or the fraudulent submission of a Medicare claim. Additionally, the employment of an individual or entity on the LEIE may expose the employer or contractor to civil monetary penalties (CMP).
The exclusion list is the result of series of Congressional initiatives to reduce healthcare fraud. “In 1977, in the Medicare-Medicaid Anti-Fraud and Abuse Amendments . . . Congress first mandated the exclusion of physicians and other practitioners convicted of program-related crimes from participation in Medicare and Medicaid.”[14]
Then in 1981, it was followed “with enactment of the Civil Monetary Penalties Law (CMPL) to further address health care fraud and abuse.The CMPL authorizes the Department and OIG to impose CMPs, assessments, and program exclusions against any person that submits false or fraudulent or certain other types of improper claims for Medicare or Medicaid payment.”[14] Beginning in 1996, the enactment of the “Health Insurance Portability and Accountability Act (HIPAA) . . . and the Balanced Budget Act (BBA)” in 1997 expanded OIG’s sanctioning authority. These statutes expanded OIG exclusion authority to all “Federal health care programs.”[14]
“The effect of an OIG exclusion is that no Federal health care program payment may be made for any items or services furnished (1) by an excluded person or (2) at the medical direction or on the prescription of an excluded person.8 The exclusion and the payment prohibition continue to apply to an individual even if he or she changes from one health care profession to another while excluded.”[14] Employers have an ongoing obligation to know whether their employees or contractors are on the LEIE. Since the OIG updates the LEIE monthly, 2013 guidance suggested that a monthly check would minimize the probability of a violation. CMS in 2011 issued final regulations requiring states screen all enrolled providers monthly.[14]
The list is not comprehensive because many who commit fraud are not included. “For example, providers who are accused of overcharging insurers or Medicare often relinquish the overpayments without any public acknowledgement or notice.” [15] The LEIE is used in many machine learning efforts to label the data and identify fraud or no fraud. [2], [15] As of the date of download, August 11, 2023, the LEIE dataset included 77,942 individuals who have been excluded from participating in Medicare. Two other sanctions databases are sometimes mentioned in connection with the LEIE: the National Practitioner Databank (NPDB) and the Healthcare Integrity and and Protection Databank (HIPDB). The HIPDB data was subsumed by the NPDB in 2013.[16].
4.1.1 Unique Identifiers for the Excluded
The list provides basic identifiers for those who are excluded from participating in Medicare. The information includes, name, last location, and date of birth. However, social security numbers are omitted in deference to prevailing federal law. For providers, they may be identified by either the National Provider Identifier (NPI) or the Unique Provider Identifier (UPIN).
According to the LEIE website, “the NPI (National Provider Identifier) has replaced the UPIN as the unique number used to identify health care providers. The Centers for Medicaid & Medicare Services first began assigning NPIs in 2006, and providers were required to use NPIs as of mid-2008. See page.
According to the LEIE website, “the UPIN (Unique Physician Identification Number) was established by the Centers for Medicare & Medicaid Services as a unique provider identifier in lieu of the SSN. UPINs were assigned to physicians as well as certain non-physician practitioners and medical group practices. CMS no longer maintains the UPIN registry.
“Many individuals and entities that are excluded by OIG do not have NPIs to include in the LEIE. For those individuals and entities that have NPIs, OIG has added that information to records starting in 2008 and has included NPIs in the LEIE since that time.”
4.1.2 Missing NPIs
Participants who can be identified by an NPI have been increasing since their adoption in 2009. While the overall rate is 8.7%, 25% of newly added individuals and entities have an NPI included. This significantly improves the quality and usability of the database.
4.1.3 Statutory Exclusions
Include count within table.
statute | agency | exclusion | description | n_count |
---|---|---|---|---|
1128b4 | Social Security | Permissive | License revocation or suspension | 31084 |
1128a1 | Social Security | Mandatory | Conviction of program related crimes | 23109 |
1128a2 | Social Security | Mandatory | Conviction for patient abuse | 7449 |
1128a3 | Social Security | Mandatory | Felony conviction for healthcare fraud | 4987 |
1128a4 | Social Security | Mandatory | Felony conviction for controlled substance | 3140 |
1128b14 | Social Security | Permissive | Default on student loans or scholarships | 2227 |
1128b8 | Social Security | Permissive | Entities controlled by sanctioned individual | 1494 |
1128b1 | Social Security | Permissive | Conviction for misdemeanor fraud | 861 |
1128b5 | Social Security | Permissive | Suspension under federal or state health care program | 802 |
1128b7 | Social Security | Permissive | Fraud or kickbacks or other prohibited | 683 |
1128b3 | Social Security | Permissive | Conviction for misdemeanor controlled substance | 316 |
1128b6 | Social Security | Permissive | Claim for excessive charges | 67 |
1128b2 | Social Security | Permissive | Conviction for obstruction of an investigation | 59 |
1156 | Social Security | unknown | Services are not economical & medically necessary | 56 |
1128b15 | Social Security | Permissive | Individuals controlled by a sanctioned entity | 34 |
1128b11 | Social Security | Permissive | Failure to supply payment information | 11 |
1128b16 | Social Security | Permissive | Making a false statement of material fact | 2 |
1128b12 | Social Security | Permissive | Failure to grant immediate access | 1 |
4.1.4 Previous Uses of the LEIE dataset
- Herland 2020
Herland retained only physicians with mandatory exclusions under section 1128. “The LEIE database does not include NPI numbers for all physicians and after preliminary analysis, we found that combining first name, last name, and address is not 100% reliable in determining identity.” Only physicians with NPI numbers that matched in the Part B data to the LEIE database were used. 1,310 physicians were deemed fraudulent.
4.1.5 Mandatory vs. Permissive Exclusions
4.1.6 Average Monthly Exclusions
4.1.7 Exclusions by Statory Provision
4.1.8 Variable Importance
A random forest model was applied to the LEIE data to check for variable importance. An outcome variable was constucted that was “present” for an observation that contained either an NPI or UPIN number or “absent” for an observation that contained neither. Multiple variables were dropped like first name, middle name, last name, street address, the npi and npi columns. The most important variable was the date of the exclusion. Since the presence of the npi has been increasing over time, the results matched expectations. See Table 1.
4.2 CMS Part B Claims Data
CMS furnishes “Part B” data. The data is a series of datasets that “provide information on services and procedures provided to Original Medicare (fee-for-service) Part B (Medical Insurance) beneficiaries by physicians and other healthcare professionals. These datasets contain information on use, payments, and submitted charges organized by National Provider Identifier (NPI), Healthcare Common Procedure Coding System (HCPCS) code, and geography.”[17] CMS groups the “unique National Provider Identification (NPI) numbers, Healthcare Common Procedure Coding System (HCPCS) code, and place of service (e.g. office or hospital).” [18]
5 Results
6 Conclusions
References
Tables
skim_type | skim_variable | n_missing | complete_rate | numeric.mean | Date.n_unique | character.n_unique |
---|---|---|---|---|---|---|
Date | dob | 4003 | 0.9475924 | NA | 21092 | NA |
Date | excldate | 0 | 1.0000000 | NA | 2258 | NA |
character | lastname | 3153 | 0.9587206 | NA | NA | 29340 |
character | firstname | 3153 | 0.9587206 | NA | NA | 11900 |
character | midname | 22445 | 0.7061480 | NA | NA | 8664 |
character | busname | 73229 | 0.0412794 | NA | NA | 3100 |
character | general | 0 | 1.0000000 | NA | NA | 87 |
character | specialty | 4088 | 0.9464795 | NA | NA | 199 |
character | upin | 70248 | 0.0803069 | NA | NA | 5957 |
character | npi | 69963 | 0.0840381 | NA | NA | 6293 |
character | address | 9 | 0.9998822 | NA | NA | 72452 |
character | city | 1 | 0.9999869 | NA | NA | 9919 |
character | state | 5 | 0.9999345 | NA | NA | 60 |
character | zip | 0 | 1.0000000 | NA | NA | 17408 |
character | excltype | 0 | 1.0000000 | NA | NA | 18 |
character | reindate | 76382 | 0.0000000 | NA | NA | 0 |
character | waiverdate | 76382 | 0.0000000 | NA | NA | 0 |
character | wvrstate | 76381 | 0.0000131 | NA | NA | 1 |
character | exclusion | 0 | 1.0000000 | NA | NA | 3 |
factor | type | 0 | 1.0000000 | NA | NA | NA |
numeric | age_at_excl | 4003 | 0.9475924 | 45.64651 | NA | NA |