December 8, 2023
This document describes the Complete PDSR Curated Data Set structure and provides example SQL statements you can use to query the data.
The Complete Patient Data Science Repository Curated Data Set is a repository of curated clinical data at Mass General Brigham (MGB). You can use the data for exploratory research or hypothesis-generating studies, cohort studies, pattern analysis, and more.
- You have reviewed the Complete PDSR Curated Data Set Dashboard in Collibra and the Complete PDSR Curated Data Set Data Dictionary.
- Access to the data repository has been approved and provisioned.
- Access to the Enclave Platform has been provisioned and you are logged into the virtual desktop environment.
Please allow 2-3 hours after provisioning before accessing the Complete PDSR Curated Data Set.
Complete PDSR Curated Data Set Structure
The data is available for direct access in two SQL server databases:
- The PDSR_Monthly is a monthly copy of PDSR.
- The PDSR_Snapshot [yyyymm] is a static quarterly snapshot.
The data within both databases is organized into three schemas - Analytics, Dimension, and i2b2.
Please refer to the Complete PDSR Curated Data Set Data Dictionary in Collibra for the complete data dictionary.
There are twenty (20) ontologies that organize concepts. These hierarchies correspond to the folder structure in the PDSR i2b2 Query Tool (and the RPDR Query Tool). Concept types are found in the Dimension.Concepts view.
List of concept types:
- Molecular Medicine
The Dimension and Analytics data are curated in a non-i2b2 format to make the data more easily understood and querying faster.
Examples of SQL Statements
To execute the example queries in SQL Server Management Studio (SSMS), either:
- Include the command ‘use PDSR_MONTHLY’ or ‘use PDSR_Snapshot [yyyymm]’ before the queries OR
- Append the actual database PDSR_MONTHLY or PDSR_Snapshot[yyyymm] within each query.
- These SQL queries are provided as examples only. You will need to modify components of the statements or queries below to better align with the objectives of your specific research and sharpen the query results.
- For long running queries, please submit a ServiceNow Issue/Inquiry Request to kill the query job.
- Avoid running select top 1 * statements when querying the data, as it may harm query performance. Use select top 10 * or greater statements instead.
- Please avoid using tempdb in your query. We can provision your project team an empty User Work Database for you to save and analyze your query results. Please refer to the following KB article on how to request a User Work Datamart.
List available Providers (Note: Some providers have multiple specialties)
select TOP 1000 * from i2b2.provider_dimension
where PROVIDER_ID in (select PROVIDER_ID from i2b2.provider_dimension group by PROVIDER_ID
order by PROVIDER_ID
List non-demographic concepts that appear multiple times in the i2b2.concept_dimension table (they belong to different folders and/or have symptoms)
select TOP 1000 * from i2b2.concept_dimension
where CONCEPT_CD in
(select CONCEPT_CD from i2b2.concept_dimension
where CONCEPT_PATH not like '%I2B2MetaData\Demographics%'
group by CONCEPT_CD having
order by CONCEPT_CD
Sample SQL for one patient
- The observation_fact view contains 5 billion facts for the patients, ignore rows where CONCEPT_CD=@ or empty string
- Other i2b2 views are joined to i2b2.observation_fact
- Join CONCEPT_CD to concept_dimension to find medical information about the patient
- Join PROVIDER_ID to provider_dimension to find information on the provider
- Join PATIENT_NUM to patient_dimension to find information on the patient
- Join ENCOUNTER_NUM to visit_dimension to find information on the encounter
Execute/run the below SQL to list all information for a given PATIENT_NUM, including encounter, provider, concept, and visit information.
Specifying certain columns will help to filter output information leaving just the columns that are useful for the task.
A patient may have multiple encounters with different providers, each of those with multiple specialties, output includes multiple concepts, depending on the nature of the visit
select TOP 1000 * from i2b2.observation_fact a
left join i2b2.provider_dimension b on a.PROVIDER_ID=b.PROVIDER_ID -- one provider can have multiple specialties
left join i2b2.patient_dimension c on a.PATIENT_NUM=c.PATIENT_NUM -- one row per patient
left join i2b2.visit_dimension d on a.ENCOUNTER_NUM=d.ENCOUNTER_NUM -- one row per encounter
left join i2b2.concept_dimension e on a.CONCEPT_CD=e.CONCEPT_CD -- one concept can have multiple hierarchies
where a.PATIENT_NUM=9869928 and
a.CONCEPT_CD not in ('','@') and
e.CONCEPT_PATH not like '%I2B2MetaData\Demographics%'
order by a.START_DATE desc
Sample SQL for patient population
Identify a list of patients with the same concept type - using CONCEPT_PATH value in concept_dimension view helps to find patients with specific concepts (e.g. the following query outputs all patients who have a transfusion concept):
select distinct TOP 1000 PATIENT_NUM from i2b2.observation_fact a
where exists (select * from i2b2.concept_dimension b
where CONCEPT_PATH like '%I2B2MetaData\Transfusion%' and
Sample SQL for mean, min, and max age and count of patients with HBA1C measured lab test by patient sex:
Query computes the mean, min and max age and count of patients with an HBA1C measured lab test by patient sex. The query includes all patients in the patient_dimension table (full PDSR LDS population), joins to the observation_fact table and includes all the CONCEPT_CDs corresponding to an HBA1C lab test by looking in the concept_dimension table. You can change the CONCEPT_PATH (or add additional CONCEPT_PATHS using an OR statement) to look for a different set of concepts:
(Execution completes after ~5 min.)
from [PDSR_Monthly].[i2b2].[patient_dimension] p
inner join PDSR_Monthly.i2b2.observation_fact f on p.PATIENT_NUM = f.PATIENT_NUM
and f.CONCEPT_CD IN (select CONCEPT_CD
where CONCEPT_PATH LIKE '\I2B2MetaData\LabTests\LAB\Chemistry\GHBA1C\%')
group by SEX_CD
Other SQL Queries
Analytics Labs and Vitals views take longer to return results. Since there are indexes on both Encounter_Num and Patient_Num, in order to return rows quickly, please add a ‘where clause’ on either Encounter_Num or Patient_Num. Examples are below:
select top 10 * from Analytics.Labs where Encounter_Num <10000000
select top 10 * from Analytics.Vitals where Patient_Num <10000000
List showing the highest level of folder grouping of the concepts:
select TOP 1000 HierarchyLevel1NM from Dimension.Concepts
group by HierarchyLevel1NM
order by HierarchyLevel1NM
Use the Dimension.Concepts table to identify which concepts you are interested in and use that list of concepts to filter data from the Analytics tables
For example, identify all medications under the Valacyclovir folder:
select TOP 1000 * from Dimension.Concepts
There are thirty five (35) concepts grouped under the Valacyclovir category/folder/hierarchy level. To identify these concepts, run the below query:
select TOP 1000 * from Analytics.Medications m
where exists (select 1 from Dimension.Concepts c
where m.ConceptCD =c.ConceptCD
The Medications ontology organizes medication concepts into hierarchies/folders similarly for Diagnoses, LabTests, HealthHistory, and Procedures.
List patient set for a specific clinic in vitals by the patient’s encounter num:
(Execution completes after ~2 min.)
select TOP 1000 * from Dimension.EncountersCurrentPatients e
left join Analytics.VitalsCurrentPatients v on e.Encounter_Num = v.Encounter_Num
where v.ClinicNM = 'MGH Cancer Center (232)'
order by e.Patient_Num, e.Encounter_Num