Complete Patient Data Science Repository (PDSR) Curated Data Set

NEW! Introducing the Complete PDSR Curated Data Set

The new Complete Patient Data Science Repository (PDSR) Curated Data Set is now available to researchers across Mass General Brigham.

This remarkably large data repository provides the research community the opportunity to access and analyze over 5 billion observation facts on the entire Mass General Brigham patient population. Connecting to this repository within the secure MGB Data Enclave platform protects patients’ information by prohibiting the export of any raw data.

Data includes, but is not limited to, Encounters, Lab tests, Medications, Procedures, Problems, Immunizations, Vitals, Reason for Visit, and others, as obtained from RPDR with more up-to-date patient Demographics, Lab results, and Diagnoses. In addition, a set of Pulmonary X-ray Severity Score data for COVID-tested patients is derived from a deep learning-based algorithm using chest radiograph DICOM images.

Who can access the data set?

  • Available to all Mass General Brigham RPDR Faculty Sponsors and faculty approved workgroup members.
  • No IRB required Because the data set contains limited PHI (Protected Health Information) with removed identifiers per MGB policy an approved IRB Protocol is not required. 
  • Upon requesting access, users will attest to a Data Use Agreement (DUA) for the Complete PDSR Curated Data Set.

Important! The Complete PDSR Curated Data Set is in a SQL database, therefore researchers should have intermediate to advanced knowledge of working with SQL and clinical databases.

What are the benefits of the data set?

‘One-stop shopping!’ A singular, secure place to access a vast amount of data, perform analysis, and share work with your research team.

  • Access to the complete set of data facilitates the study or exploration of extensive populations of patients and clinical data without compromising patient privacy.
  • Ability to easily identify cohorts, generate hypotheses, and develop models.
  • Allows to increase the speed to research discovery and enables machine learning (ML) and artificial intelligence (AI).
  • Broadens opportunities for understanding, evaluating, and analyzing clinical data without requiring an approved IRB protocol.

Are there restrictions on the data set?

Yes, some restrictions are applied to keep our patient data safe.

  • The data is available through the MGB Analytics Enclave. (more info below)
  • This enclave environment provides a secure environment with some data-sharing rules applied:
    • Data can only be shared with other researchers who have ALSO been provisioned access to the same project workspace.
    • Researchers can send data, documents, etc. INTO their Enclave "computer" but removing information is regulated. For example, summarized outcomes of research studies can be removed with permission whereas raw data is not permitted to be removed.

What types of data are included in PDSR?

  • The data set contains all Mass General Brigham patients from ACE (Affiliated Covered Entity) institutions and DFCI.
  • To view data information contained in the PDSR, please refer to the Complete PDSR Curated Data Set Data Dictionary.
  • In addition to data availability in i2b2 and curated formats, data is provided in OMOP (Observational Medical Outcomes Partnership) Common Data Model (CDM) format v5.2. The OMOP CDM format standardizes the organization and content of observational data so that standardized applications, tools and methods can be applied across the data set allowing for quicker and more efficient querying. The format presented is provided using the Observation Health Data Sciences & Informatics OMOP website. Follow the link for more information.
  • Two PDSR databases are available - one is updated monthly and the other is a quarterly snapshot for those who need to work with a more static data set.
  • Reference the grid at the bottom of this page for features of the PDSR and RPDR.

Where do I start?

  • You must be an RPDR Faculty Sponsor (i.e. RPDR Workgroup Leader) or a workgroup member approved by your workgroup leader. (Please see RPDR Confidentiality for more information or RPDR Registration to register as a Faculty Sponsor)
  • You will need to submit a ServiceNow form to request access to the Complete PDSR Curated Data Set.
  • To view the instructions on how to submit your request, please review the Request Access section in the Complete PDSR Curated Data Set Dashboard.
  • Please note - you must be on the MGB network or use VPN to be able to view the access request form. If asked to authenticate, please provide your MGB credentials.

How do I get to the data set?

Once access is approved and provisioned, researchers access the data set within the MGB Analytics Enclave.

The MGB Analytics Enclave is where researchers will find their project workspace which has been provisioned for them. The project workspace is where data querying and analysis is done.

The MGB Analytics Enclave project workspace is essentially a desktop computer accessed remotely from a researcher's own computer. This Enclave "computer" appears and behaves just like any other computer in most instances. It includes: 

  • Personal drive to save your work.
  • Shared drive to share with the members of your workgroup.
  • Multiple analytic tools such: 
    • SQL Management Server Studio (SSMS), 
    • R, R Studio 
    • Python 
    • SAS (bring your own license)
    • [Coming Soon] PDSR i2b2 Query Tool

Accessing the data within your project workspace is very easy. Researchers are guided to download and open VMWare software, enter their credentials, and select their assigned project workspace. A new Enclave project workspace "computer" will open on their desktop and from within here, the researcher can open one of the many analytic tools to access the data.

[Coming Soon] Similar to the Biobank and RPDR query tools, the new PDSR i2b2 Query Tool is now available to researchers. The tool is a web-based application accessed from within your Enclave project workspace and provides a true count of patient totals from the Complete PDSR Curated Data Set that meet user-defined characteristics and criteria such as diagnoses, procedures, medications and/or laboratory results. For more details on how to access, query patients, run temporal queries, and tips/tricks, please visit the PDSR i2b2 Query Tool page.

MGB Analytics Enclave Project Workspace Example

Enclave Project Workspace

Visit our Complete PDSR Curated Data Set FAQs for additional details.

If you have any questions, please contact MGBAnalyticsEnclaveSupport@mgb.org.

What data source best aligns with my research needs?

When to use PDSR vs RPDR PDSR RPDR
Access includes Direct hands-on access to raw and curated patient and clinical data Access to a user-friendly interface that provides researchers aggregate counts of patients with user-defined characteristics. Identified data can be requested and obtained with an IRB protocol 
Data types available Complete PDSR Curated Data Set RPDR Data Dictionary
Minimum skill set to analyze data SQL (Structed Query Language) is needed to analyze data No SQL experience is needed
Access data behind MGB firewall Yes Yes
Various Clinical notes available No Yes
Available for feasibility & cohort discovery Yes Yes
Direct access to the full patient database Yes No
Computational tools available with the data Yes No
Located in secure environment allowing collaboration with project members Yes No
Researchers are not prohibited from requesting access and using both!