Enhanced Data - Mortality Status, Tumor Table and Natural Language Processing

Mortality data is a key outcome for clinical research ad in most healthcare data sets, mortality data is not included in the data set unless the patient has died in a hospital setting. Mortality data can assist in informing research and outcomes analysis as well as assist clinical trial teams in avoiding contacting deceased individuals’ families regarding clinical trials and studies.

GPC recognizes the value of mortality data and integrates the Social Security Death Master file data with the GPC site CDMs. An additional data source,, is available for the University of Missouri data set.

Death Data from the EHR record (death as disposition)

Death as disposition of a patient is recorded in their EHR patient record


The Death Master File (DMF) from the Social Security Administration (SSA) is a data source that is created from internal SSA records of deceased persons possessing social security numbers and whose deaths were reported to the SSA.

Obituary data sourced from funeral homes, newspapers, and other online obituary sources.

Enriched site-level data with Tumor Table Linkage

In additional to the standard tables in CDM, GPC is spearheading the effort to integrate specialty cancer data. We have linked cancer-specific data from North American Association of Certified Cancer Registrar (NAACCR) to populate a tumor table in site CDM, which has been used to support proposals. High quality data in structured fields for demographic, clinical, and treatment observations are included in the table.

Natural Language Processing (NLP)

GPC has committed to standardize the extraction and population of textual data. We have a top ranked NLP development team that specializes in clinical textual data extraction to tailor the pipelines, test, validate and refine the pipeline to support NLP deployment.


All GPC sites have geocoded patients’ addresses which can be used to further link to multiple community-level social determinants of health data that are publicly available. Based on zip+4 information, we have also geocoded all Medicare and Medicaid beneficiaries obtained for the GROUSE project and linked to a curated set of America community survey variable, Rural-Urban Community Area code, Area Deprivation Index, Bird Index, and etc.    

Clinical Observable Data

Multiple GPC partners have extracted an extensive list of structure clinical observable data from source EHR systems, including but not limited to flowsheets data and patient-reported outcomes.