Complete Patient Data Science Repository (PDSR) Curated Data Set

NEW! Introducing the Complete PDSR Curated Data Set

The new Complete Patient Data Science Repository (PDSR) Curated Data Set is now available to researchers across Mass General Brigham.

This remarkably large data repository provides the research community the opportunity to access and analyze over 5 billion observation facts on the entire Mass General Brigham patient population. Connecting to this repository within the secure MGB Data Enclave platform protects patients’ information by prohibiting the export of any raw data.

Data includes, but is not limited to, Encounters, Lab tests, Medications, Procedures, Problems, Immunizations, Vitals, Reason for Visit, and others, as obtained from RPDR with more up-to-date patient Demographics, Lab results, and Diagnoses. In addition, a set of Pulmonary X-ray Severity Score data for COVID-tested patients is derived from a deep learning-based algorithm using chest radiograph DICOM images.

Who can access the data set?

  • Available to all Mass General Brigham RPDR Faculty Sponsors and faculty approved workgroup members.
  • No IRB required! Because the data set contains limited PHI (Protected Health Information) with removed identifiers per MGB policy an approved IRB Protocol IS NOT required. 
  • Upon requesting access, users will attest to a Data Use Agreement (DUA) for the Complete PDSR Curated Data Set.

Important! The Complete PDSR Curated Data Set is in a SQL database, therefore researchers should have intermediate to advanced knowledge of working with SQL and clinical databases.

What are the benefits of the data set?

‘One-stop shopping!’ A singular, secure place to access a vast amount of data, perform analysis, and share work with your research team.

  • Access to the complete set of data facilitates the study or exploration of extensive populations of patients and clinical data without compromising patient privacy.
  • Ability to easily identify cohorts, generate hypotheses, and develop models.
  • Allows to increase the speed to research discovery and enables machine learning (ML) and artificial intelligence (AI).
  • Broadens opportunities for understanding, evaluating, and analyzing clinical data without requiring an approved IRB protocol.

Are there restrictions on the data set?

Yes, some restrictions are applied to keep our patient data safe.

  • The data is available through the MGB Analytics Enclave. (more info below)
  • This enclave environment provides a secure environment with some data-sharing rules applied:
    • Data can only be shared with other researchers who have ALSO been provisioned access to the same project workspace.
    • Researchers can send data, documents, etc. INTO their Enclave "computer" but removing information is regulated. For example, summarized outcomes of research studies can be removed with permission whereas raw data is not permitted to be removed.

What types of data are included in PDSR?

    Where do I start?

    • You must be an RPDR Faculty Sponsor (i.e. RPDR Workgroup Leader) or a workgroup member approved by your workgroup leader. (Please see RPDR Confidentiality for more information or RPDR Registration to register as a Faculty Sponsor)
    • You will need to submit a ServiceNow form to request access to the Complete PDSR Curated Data Set.
    • To view the instructions on how to submit your request, please review the Request Access section in the Complete PDSR Curated Data Set Dashboard.
    • Please note - you must be on the MGB network or use VPN to be able to view the access request form. If asked to authenticate, please provide your MGB credentials.

    How do I get to the data set?

    Once access is approved and provisioned, researchers access the data set within the MGB Analytics Enclave.

    The MGB Analytics Enclave is where researchers will find their project workspace which has been provisioned for them. The project workspace is where data querying and analysis is done.

    The MGB Analytics Enclave project workspace is essentially a desktop computer accessed remotely from a researcher's own computer. This Enclave "computer" appears and behaves just like any other computer in most instances. It includes: 

    • Personal drive to save your work.
    • Shared drive to share with the members of your workgroup.
    • Multiple analytic tools such: 
      • SQL Management Server Studio (SSMS), 
      • R, R Studio 
      • Python 
      • SAS (bring your own license)

    Accessing the data within your project workspace is very easy. Researchers are guided to download and open VMWare software, enter their credentials, and select their assigned project workspace. A new Enclave project workspace "computer" will open on their desktop and from within here, the researcher can open one of the many analytic tools to access the data.

    MGB Analytics Enclave Project Workspace Example

    Enclave Project Workspace

    Visit our Complete PDSR Curated Data Set FAQs for additional details.

    If you have any questions, please contact

    What data source best aligns with my research needs?

    When to use PDSR vs RPDR PDSR RPDR
    Access includes Direct hands-on access to raw and curated patient and clinical data Access to a user-friendly interface that provides researchers aggregate counts of patients with user-defined characteristics. Identified data can be requested and obtained with an IRB protocol 
    Data types available Complete PDSR Curated Data Set RPDR Data Dictionary

    Minimum skill set to analyze data

    SQL (Structed Query Language) is needed to analyze data No SQL experience is needed
    Access data behind MGB firewall Yes Yes
    Various Clinical notes available No Yes
    Available for feasibility & cohort discovery Yes Yes
    Direct access to the full patient database Yes No
    Computational tools available with the data Yes No
    Located in secure environment allowing collaboration with project members Yes No
    Researchers are not prohibited from requesting access and using both!