Big Data Analytics Bootcamp on IDEA

May 22, 2017 9:00 am to May 26, 2017 12:00 pm

399 Revolution Drive, Somerville, MA 02145

The Big Data Analytics Bootcamp on IDEA is a hands-on workshop led by Dell Data Scientists and supported by the ERIS Scientific Computing team for Partners research groups. We will start the training with a session on Research Methodology to apply to the student cases. Then, an introduction on how to load data to the IDEA Platform. Afterward, we will focus on specific tools, such as Python, pySpark and R and their usage in Data Science projects and Big Data applications. Time will be included at the end for specific questions and support for the participants.

You can select the trainings that you can attend. Please consider that this is a high demand training and we can only offer few spots. Please register here to save your spot!

Contact ideasupport@partners.org if you have questions.

Thank you!

Scientific Computing Team

	Monday 05/22	Tuesday 05/23	Wednesday 05/24	Thursday 05/25	Friday 05/26
9:00 - 12:00	Data Science Research Methodology	Data Science with Python + Hands-on	Data Science with PySpark + Hands-on	Data Science with R + Hands-on	Office hours
1:00 - 5:00	Data Loading: Sqoop + Hive	Data Science Research Methodology Group exercise (continuation)	Data Science with PySpark + Hands-on (continuation)	Data Science with R + Hands-on (continuation)

Requirements

Due to the large variety of topics included, the participants are required to have a basic knowledge of Python and R coding. Basic operations, data structures, flow control and loops. If you need basic material to prepare please take a look at the Python and R websites.
Bring a laptop, either Windows or Mac.
Request an IDEA account in advance. Please fill the form https://rc.partners.org/idea-platform-account-request

Agenda:

Data Science Research Methodology: This course describes DEPP methodology, Description, Exploration, Predictive, Prescriptive methodology. For each of these steps, we will provide to the students how to use the methodology. We will focus on a tool for Statistical Inference like Design of Experience and Hypothesis test.
1. (Monday AM)
  - Big data maturity curve
  - An overview about DEPP for Big Data
  - Define actionable business objective
  - DEPP Exercise
  - DEPP Adaption
  - Hypothesis generation using tree algorithms with existing data. Exercise.
  - Mutually exclusive and collective exhaust for none existing data. Exercise.
  - Analytics cases study and discussion
  - Use case brainstorm by workgroup exercise
2. (Tuesday PM)
  - All groups will present their analytics case study that has been developed from the first session.
Data Loading (Monday PM): For this course, we are going to get into detail on how to load data into the Platform specifically using Sqoop and Hive.
- Sqoop: Extracting from MySQL and Postgres databases as examples.
- Hive:
  - Architecture overview
  - External Tables
  - Internal Tables
  - Security in IDEA. Schema access
  - Demo - Squirrel, TOAD
  - Loading files
  - Querying Unstructured Data
  - Loading strategy
Data Science with Python (Tuesday AM): The objective of this course is to understand the usage of some packages for machine learning, data management, and medical images treatment in Python. Basic knowledge of how to code in Python is required for this part of the training.
1. The Python program language:
  - Function Lambda
  - Sklearn, Pandas, Medical Images using Simple ITK package
  - Main libraries in python (sklearn, matplotlib, pandas, NLTK).
  - Demo of TensorFlow
Data Science with PySpark (Wednesday): The objective of this course is to provide the students with the main concepts of PySpark. The students will have a broad understand of what Spark is and how to use Spark with PythonApi.
1. (Wednesday Morning)
  - Spark Introduction
  - Spark Ecosystem
  - How Spark Works
  - RDD – Main engine for Spark
  - Parallel Programing in Spark
  - Caching and Persistence
2. (Wednesday Afternoon)
  - Spark Applications how to write
  - MLLib – Spark Library for Data Science
  - Other Libraries in Spark (SparkSQL, Dataframe, Graph, Spark Streaming)
Data Science with R (Thursday): The objective of this course is to provide the students with the main concepts of R and some statistical and machine learning techniques
- R Packages:
  - Time Series
  - ANOVA
  - Dimensionality Reduction: PCA/SVD
  - Outliers Detection
  - Text Mining
  - Classification
  - Clustering
  - Regression
- R Spark
Office hours (Friday AM): Dell team and Partners Scientific Computing team will be available for follow-up questions.

Get Help

Agenda: