Complete PDSR Curated Data Set FAQs

Complete PDSR Curated Data Set Frequently Asked Questions

About the data

If I do not have SQL experience - how can I get help?
What is i2b2?
Is there any way to link this data to identifiers?
What identifiers are excluded and included?
When should I use the RPDR and when should I use the PDSR?
Can I publish based on findings from this data set?
How do I get access?

About the MGB Analytics Enclave, the "Enclave Platform"

What is the Enclave?
What is a project workspace?
What tools will be available to me in the Enclave?
Where can I get help with the computational tools?
Who can I share this data /my analysis with?
I have found a cohort of interest-what can I do now?
Do I need extra storage to work in the Enclave?
Can I export or download data?
Can I add my own data to my project workspace?

• If I do not have SQL experience - how can I get help?
We recommend researchers accessing the PDSR have the skillset of intermediate to advanced knowledge of working with SQL and clinical databases. For researchers with little or none of this experience, we recommend utilizing a data analyst or access the data via the RPDR.

• What is i2b2?
The Informatics for Integrating Biology and the Bedside (i2b2) data model of the clinical research chart (CRC) is based on the “star schema”. The star schema has a central “fact” table where each row represents a single fact. In i2b2, the facts are made up of observations about a patient. Observations about a patient are recorded by a specific observer within a specific time range (defined by start and end dates), regarding a specific concept, such as a lab test or disease, in the context of an encounter, such as a patient outpatient or inpatient visit. The concept can be any coded attribute about the patient, such as a code for diabetes, a medication the patient is on, or a specific test result. This manner of expressing a concept as an attribute in a row rather than designating it in a column is based on prior work known as the entity-attribute-value (EAV) model. It is extremely efficient to query data arranged in a star schema represented in an EAV format. This is because the concepts recorded in EAV format allow one very large index to be built which cuts across all patients' data in the fact table.

• Is there any way to link this data to identifiers?
Not at the moment. In the future we will provide the direct identifier patient mapping for a select cohort of patients as approved by IRB for a particular study or research project.

• What identifiers are excluded and included?
The Complete PDSR Curated Data Set complies with the “Limited Data Set” format as defined by HIPAA (HHS.gov HIPAA link) and MGB policy.

  • Excluded: A limited data set is protected health information that excludes the direct identifiers of the individual or of relatives, employers, or household members of the individual.
    • names (including initials)
    • postal address information, other than town or city, state, and zip code
    • telephone numbers
    • fax numbers
    • e-mail addresses
    • social security numbers
    • medical record numbers
    • health plan beneficiary numbers
    • account numbers
    • certificate/license numbers
    • vehicle identifiers and serial numbers, including license plate numbers
    • device identifiers and serial numbers
    • web universal resource locators (URL’s)
    • internet protocol (IP) address
    • biometric identifiers, including finger and voice prints
    • full face photographs and comparable images
  • Included: The Complete PDSR Curated Data Set includes the following PHI data:
    • geographic data: town, city, state, and zip code, but not street address
    • dates: any dates relating to an individual (e.g., birth dates, admission dates, discharge dates, procedure dates, and dates of death)
    • Patient_Num: unique static identifier for patient

• When should I use the RPDR and when should I use the PDSR?
Reference the What data source best aligns with my research needs? table on the Complete PDSR Curated Data Set Home Page.

• Can I publish based on findings from this data set?
Yes, you can publish your findings with the appropriate PDSR acknowledgement. Please check back to see recommended citation language.

• How do I request access?
Please access the Request Access section on our Complete PDSR Curated Data Set Dashboard for project access requirements, access documentation links, and link to the access request form.

• What is the Enclave?
The Enclave Platform is a centralized, highly secure, virtual one-stop platform with strict data security requirements. You can use the Enclave Platform to remotely access data, perform data analysis, machine learning, or AI, and collaborate on sensitive, confidential data all in a project-specific remote desktop workspace. Please reference the MGB Analytics Enclave Platform FAQ for additional information.

• What is a project workspace?
A project defines a user's dedicated workspace within the Enclave for working with data. The project workspace includes both personal and shared drives to share their work, and multiple analytic tools such as SQL Management Server Studio (SSMS), R, Python, and SAS to access and analyze data.

• What tools will be available to me in the Enclave?
A baseline offering of common data analytics software programs and applications such as MATLAB, R, RStudio, SAS, SPSS, STATA, and development tools (GitLab, Python, IntelliJ IDEA, Java Development Kit, Java Runtime Analysis Toolkit). The software programs and R repository packages are updated regularly. The in-platform repository (including a CRAN mirror) contains most used packages. A full list of available Enclave tools are found on the MGB Analytics Enclave Platform Offerings page.

You can also customize your environment with additional software. Please submit a project workspace Digital Research ServiceNow Issue/Inquiry Request.  

• Where can I get help with the computational tools?
The Support Team can assist with access, configuration, and environment issues. We are not able to provide advanced technical support with specific tools. Researchers utilizing said tools are expected to have experience and expertise in using them.

For assistance with troubleshooting or technical support with your Enclave project workspace environment, please submit a project workspace RISC ServiceNow Issue/Inquiry Request.

• Who can I share this data/my analysis with?
Data and analysis is shared between team members who also have access to the same project workspace in the Enclave via a shared (P) drive. The P: Drive belongs to the team and not individual members. The project’s PI or Project Lead has complete control over access to the P: Drive at the folder-level via the Partners PAS system and oversight of shared data or files by their team.

• My Enclave project workspace needs extra data storage?
Your project workspace shared (P) drive storage size can be increased from it's default size of 128gb. Please submit a project workspace Digital Research ServiceNow Issue/Inquiry Request.

• I have found a cohort of interest-what can I do now?
Within your project workspace, you can also share your data with team and import the data into one of the many tools available in the Enclave for further analysis.

• Can I export or download data?
Exporting or downloading data from your Enclave project workspace is regulated.
• Summarized outcomes of research studies can be removed with permission. For example, a researcher may create tables containing a higher summary of the analyzed data such as the distribution of particular diseases by age/ethnic group, etc. This data is considered summarized as its no longer specific to a particular patient and this data is the outcome (the result) of their analysis.
• Raw database data including screenshots must remain and is not permitted to be removed.

• Can I add my own data to my project workspace?
Yes. All researchers provisioned with an Enclave project workspace are permitted to import data and files into their workspace.