March 31, 2022
This document describes: (1) Data available in the Azure Enclave and (2) Azure Enclave database.
For assistance with questions, please write to Azure Enclave Team at MGBAzureEnclave@partners.org.
Table of Contents
- About the Data
- Azure Enclave Database
About the Data
The initial release of Azure Enclave pulls data from PDSR (Patient Data Science Repository). In future, data from other sources can be pulled and stored in Azure SQL database, flat files on the Project Workspace (VM) or in ADLS (Gen2) for other tools (e.g. Databricks) to access from the workspace.
➤ PDSR data in i2b2 format
The Azure Enclave uses PDSR data as input. PDSR is a repository of patient data in i2b2 format and contains the same data as RPDR (Research Patient Data Repository), converted to i2b2 format. PDSR patient information is updated hourly. All other data is updated monthly. In future, labs will be updated more often. i2b2 is an industry-standard format for storing medical information for patients. You can find information on i2b2 at the following websites: i2b2 website and i2b2 user community website (database model).
➤ Identification of patient cohort
As part of the Project Workspace onboarding process, project researchers provide the Azure Enclave Team with the patient cohort for which data is required. This could be a list of MRNs (with MRN type), a list of EMPIs, an RPDR query or a PDSR query that identifies the patient cohort.
NOTE: The RPDR patient cohort is updated monthly. The PDSR patient cohort is updated hourly. If RPDR is used to identify the patient cohort, it may contain a slightly different cohort when run in PDSR.
➤ Data is pushed to an Azure SQL database attached to a Project Workspace (VM)
All PDSR data for the patient cohort is pushed to an Azure SQL database attached to a data science Project Workspace (VM) in Azure.
As part of the Project Workspace onboarding process, researchers provide their data cleanup rule requirements. These rules are applied during the ETL (Extract, Transform, Load) process. Examples of data cleanup rules are:
- Set VIP_CD to NULL (mandatory for Industry Sponsored Research data marts) – standard cleanup
- Remove patients with no facts - standard cleanup
- Remove patients with no visits - standard cleanup
- Remove facts that are older than a certain date - custom cleanup
All PDSR data pushed to the Azure SQL Database by the Azure Enclave Team is locked down – Read-Only for researchers.
Metadata (data dictionary) for all PDSR data pushed to the Azure Enclave is located in Collibra at the following link.
Note: Patient and encounter IDs are not included in the initial release of Azure Enclave. They will be included in a future release and will need appropriate approval (IRB) to access the actual MRNs/EMPIs for a patient.
Azure SQL Database
The database name is workspaceDb. The database server name is mgb-risc-wrkspce-prod-use2-<workspace#>-sql.database.windows.net. This is the unique Project Workspace Data Server Name provided to researchers when their Project Workspace access is provisioned.
➤ i2b2 schema
The data in i2b2 format for the patient cohort is stored in the i2b2 schema and is Read-Only. The list of i2b2 tables is:
- encounter_mapping (not populated currently, to be included in a future release)
- patient_mapping (not populated currently, to be included in a future release)
➤ Dimension and Analytics schemas
Optionally, some data types are available in normalized format that is more user-friendly. This data will be locked down (read-only for users). The list of data types that can be provided in normalized (user-friendly) format is:
Dimension data types (stored in Dimension schema):
- PatientMRN (not populated currently, to be included in a future release)
Fact data types (stored in Analytics schema):
➤ Shared workspace schema (scratchpad)
A shared workarea schema is provided in the Azure SQL database for researchers to use as they wish – to store code, tables, functions etc. for any transformation of the data or analysis that is needed. Researchers have full control of this schema.
➤ Sizing and compute
The Azure SQL database is configured as part of the onboarding process to suit the size of the patient cohort and its planned project usage. If access to the database is only needed once or twice a week, serverless may be the most cost-effective way to configure the Azure SQL database. If planned usage is daily, a provisioned configuration may be the more cost-effective. A serverless configuration means charges are only applied when the database is used. A provisioned configuration is a set cost. All costs are charged back to the project.
➤ Data Science Project Workspace (VM)
The Azure Enclave Project Workspace is an Azure data science VM (Windows Server 2019 DSVM). There is a baseline offering of common data analytics software programs and applications such as SSMS, Python, VSCode, Azure Data Studio, RStudio, Jupyter, Weka, and Node.js. You can customize your environment with additional tools during the onboarding process.
For a complete list of tools and additional details, please refer to the following link.
The VM is configured with a permanent storage drive with 128 GB of space as a default but additional space can be provided at additional cost.