Skip to Main Content

Research Data Management

Background and links to more information about data management issues.

Winter 2023 Reproducible Research Training

Interested in learning new tools and skills to create more organized and reproducible research?

During the winter term, the Library and ITC will host a series of workshops aimed at expanding your computational research and data management best practices.  The workshops will explore different tools, concepts, and strategies to make your research more efficient and reproducible.

Join us for workshops on Text Analysis, Machine Learning, GIS, Introduction to Data Science, HPC, RDM, and Motion Data!

Please see the tabs below for more information on the sessions in each track.  Tracks contain related sessions, but each one is designed to be stand alone content. 

Registration for all events: https://dartgo.org/RRADworkshops

Workshop Tracks

An Introduction to Text Analysis with Python: Uncovering Hidden Patterns in Texts

How can we use computational techniques to analyze texts and then visualize patterns buried within them? In this six-lesson series, you will learn how to get started with the Python programming language and how to apply Python to perform digital text analysis. You will practice identifying and visualizing patterns within individual texts and across large collections or corpora of texts.


6 sessions: 1/18, 2/1, 2/8, 2/15, 2/22, 3/1 at 12 pm

 

Part 1: The State of the Union is ___? Working with texts in Python, 1/18 @ 12pm

What can we learn about texts by applying text analysis in Python? How do we get started?

 

In this session, participants will:

  • Learn how to write basic scripts in Python using Jupyter Notebooks
  • Work with and modify strings and text files using Python
  • Iterate through a corpus of texts, extract basic information from each, and create a colorful bar chart showing the changing lengths of State of the Union speeches

Part 2: The United States is / are ____? Counting Words & Terms, 1/25 @ 12pm

Exploring the frequency of words and phrases in texts: what can they tell us about a text?

 

In this session, participants will:

  • Apply Python (and the NLTK package) to read individual text files and apply essential pre-processing techniques (i.e. divide each text into a list of words or tokens, lower-case all words, remove punctuation, and lemmatize each word).
  • Create frequency lists identifying the most common words or ngrams (multi-word terms) in a text or corpus
  • Create graphs, charts, and word clouds visually representing word and term frequency patterns 
  • Identify some ways the language of State of the Union speeches has changed over time and discuss how this method could be applied to other texts and questions

Part 3: From Government to Dreams: Comparing Texts across Time Using TFIDF, 2/8 @ 12pm

How does the language of a select group of texts change over time? How can we compare texts using the frequency counts of words?

 

In this session, participants will:

  • Apply Python (and the scikit-learn library) to calculate term frequency - inverse document frequency (TFIDF) for words in a corpus. TFIDF is a way to identify what makes a text unique compared to a larger corpus.
  • Search for the TFIDF scores of individual words within a corpus (a deductive approach) 
  • Create a graphic showing the words that appear in the most unusual frequency in each text (an inductive approach)
  • Discuss how this method could be adapted to other texts or questions

Part 4: Topics & Emotions: Topic Modeling and Sentiment Analysis, 2/15 @ 12pm

How can we sort and classify texts by emotional register (anger, sadness, joy, etc.) and topic?

 

In this session, participants will:

  • Run and evaluate topic modeling code sorting a collection of texts into groups organized by topic. 
  • Run and evaluate code applying sentiment analysis scores to each segment of text
  • Discuss the potential promise and problems inherent with topic modeling and sentiment analysis

Part 5: Text Classification, 2/22 @ 12 pm

In previous sessions of the Text Analysis series, we have learned how to describe texts in terms of, e.g., word counts, topics, or expressed sentiments. In this session, we will try to identify, visualize, and exploit patterns in these features: Which texts are more similar? Which are different? Can we use these features to classify texts into categories? 

 

In this session, we will dig deeper into the extracted features and use dimensionality reduction techniques to visualize emerging patterns. Using the State of the Union dataset, we will practice what we have learned by trying to automatically guess if a speech was delivered by a Democratic or a Republican president.

 

While not strictly required, attending the previous sessions in the series and “Intro to Machine Learning with scikit-learn” is highly recommended. 

Part 6: Beyond the Union: Working with Other Corpora, 3/1 @ 12 pm

How can we find and construct our own corpora?

 

In this session, participants will:

  • Examine other readily available text corpora (from NLTK, Constellate, Proquest, etc.) and how to import them into Python
  • Practice with and modify code importing a corpus of plain text files from your own computer into Python
  • Examine examples of text analysis in a diverse set of fields; brainstorm potential applications
  • Discuss next steps to learn more about and practice text analysis in Python

Intro to Machine Learning with scikit-learn, 1/25 @ 12 pm

Scikit-learn (also known as sklearn) is a machine learning library written in Python. It features models and methods for supervised and unsupervised learning, dimensionality reduction, model selection and evaluation, and even some techniques for the visualization of results. Because of its efficient implementations and accessible interface, scikit-learn is very popular in educational, research, and production environments, and runs “under the hood” of many other, more streamlined libraries (e.g. NLTK, auto-sklearn, or PyCaret).

 

In this code-along workshop, we will introduce various components of scikit-learn. By the end of the session, you will be able to implement a typical machine learning workflow.

Intro to PyTorch, 2/14 @ 2 pm

Deep Learning for everyone! PyTorch is a free and open source machine learning framework for the rapid development of neural networks for applications in computer vision, natural language processing, or speech recognition. It provides a simple Python interface, which makes it equally popular in education, research, and production environments. If a problem falls into fairly standard categories, powerful pre-trained models are available out-of-the-box. If a custom model is required, PyTorch makes it easy to define, train, and test neural networks using state-of-the-art algorithms and components from simple feed-forward networks to convolutional networks to LSTMs, transformers and more.

 

In this session, you will get a brief overview of the components provided by PyTorch. We will apply a pre-trained model to a problem with just a few lines of code, and we will define our own neural network! Finally, we will introduce the concept of transfer learning, which allows you to benefit from pre-trained models even if your particular problem is different from what the model was originally trained on! 

Introduction to the R package 'caret', 2/23 @11am 

The caret package (short for Classification And Regression Training) is one of the most popular R packages to handle statistics and machine learning problems. It contains functions to streamline the model training process for complex regression and classification problems and makes the process of training, tuning, and evaluating machine learning models in R consistent and easy.

During this session, you will be introduced to the basic functionalities of the caret package.

Basic knowledge of R and linear regression are helpful for you to understand the content of this webinar.

Introduction to PyCaret, 3/2 @ 11 am

PyCaret is an open source Python machine learning library inspired by the popular R package – “caret”.  The goal of the “caret” package is to automate the major steps for evaluating and comparing machine learning algorithms for classification and regression. The main benefit of the library is that one can achieve a lot with only a few lines of code and little manual configuration. The PyCaret library brings these capabilities to Python. It is well suited for seasoned data scientists who want to increase the productivity of their machine learning experiments by using PyCaret in their workflows or for citizen data scientists and those new to data science with little or no background in coding.

This is a suitable training session for people who already have basic knowledge of Python and are interested in learning more to perform high-level data analysis.

**Google Colab will be used to do all the demos.

Up and Running with GIS, 1/10 @ 12pm 

We'll introduce the concepts of Geographic Information Systems to make maps, analyze geographic data and create new geographic data.  We'll discuss and use some of the many different types of GIS data

GIS for the Humanities and Social Sciences, 1/26 @ 12 pm   

This workshop will examine the use of mapping and geographic information systems in the Humanities and Social Sciences, and teach the use of some basic tools and techniques to create and edit GIS data as well as query existing geospatial datasets for information.

R and Python scripting for Geospatial Data Science, 2/9 @ 12pm 

We'll use R and Python to show how to work with geospatial data and create reproducible workflows, results and maps.

Getting started with the Discovery cluster SLURM, 1/17 10AM-11:30AM 

This class is for users new to the Discovery cluster. It covers how to set up your environment, submit jobs, transfer files to and from the cluster, how to use available storage and how to monitor your jobs.

Getting started with the Discovery cluster SLURM, 2/21 10AM-11:30AM 

This class is for users new to the Discovery cluster. It covers how to set up your environment, submit jobs, transfer files to and from the cluster, how to use available storage and how to monitor your jobs.

Massively parallel computing with MPI in Python, 3/1 @ 2 pm

Most modern programming libraries for computational work make it easy to parallelize your code and thus leverage the power of all CPU cores on your machine. But what if even that is not enough? How can we truly unleash the power of a High Performance Cluster like Dartmouth’s Discovery and use hundreds of CPUs distributed across multiple nodes?

The answer: MPI. MPI is a standard for a Message-Passing Interface that is implemented in various libraries. It allows several nodes within a cluster to communicate. By sending status messages and data back and forth between nodes, the computational load can be distributed across any number of available nodes.

 

Introduction to the UNIX Shell, 1/9 @ 12 pm - 1:30 pm 

Unix is a command-line-based platform that is a highly powerful and flexible tool for data management and analysis.  It helps users automate repetitive tasks and easily combine smaller tasks into larger, more powerful workflows.  Use of the shell is fundamental to a wide range of advanced computing tasks, including high-performance computing.  This workshop introduces the basic concepts of UNIX operating system and shell scripting. We will explore essential hands-on skills to confidently use the command line interface.

Getting Started with R, 1/11 @ 12 pm 

R is a free, open-source programming language that is known for its approachability and for becoming an increasingly popular tool for data analysis and visualization.  In this basic, hands-on 60 minute session, we will introduce basic programming concepts using R such as dataframes and plots, and show you how they can save you time and increase the reproducibility of your research.  

In this session, we will introduce MPI in Python covering basic concepts, one-to-one, one-to-many, many-to-one communications, and we will close out with a few notes on pitfalls and good practices.

Introduction to R Shiny, 1/25 @ 2 pm

Shiny is an R package that makes it easy to build interactive web applications straight from R. Shiny package lets researchers transform any piece of analysis code in R into an interactive app, which is capable of use by a broad audience, without other coding and web-development skills.

It is strongly suggested that you have experience in R, and if not please sign up for 'Getting started with R” (1/11 @ 12pm). In this session, we will get you started building Shiny apps right away.

Click the link (https://rstudio-connect.dartmouth.edu/connect/#/apps/389dc504-a863-4b4f-bb3c-68673c82c79a/access) for recommended installation prior to the workshop.

Statistical Data Analysis with R, 2/2 @ 11am

R is a free, open-source programming language that is known for its approachability and for becoming an increasingly popular tool for data analysis and visualization.  In this hands-on session, you will learn how to use R to conduct basic statistical data analysis, and how to save you time and increase the reproducibility of your research. 

Data Visualization for Health and Biology Research, 1/20 @ 1- 5 pm

Research Computing, in partnership with the Reproducible Research Group and the Dartmouth Library, invites researchers of all levels to participate in a workshop focused on developing publication-ready data visualizations for health and biology research using R. Some prior experience in R expected, we recommend participation in Getting Started with R on January 11th.

Workshop Goals:

  • Build traditional data visualizations such as boxplots, scatter graphs, and line graphs using ggplot in R
  • Develop custom color palettes that can be used inside and outside of R
  • Plot hundreds of figures in seconds using loops
  • Add statistical tests and significance values to figures
  • Learn to interpret and build PCAs, t-SNEs, and UMAPs
  • Learn to interpret and build both static and interactive heatmaps

 

Attendees will be provided lecture slides and R notebooks. Publicly available experimental data will be provided, but attendees may choose to bring their own data for additional interpretation. 

NIH Data Management Plans and Practices, 1/17 @ 2pm 

Effective January 25, 2023, the National Institutes of Health (NIH) is implementing its Policy for Data Management and Sharing (DMS) to promote the management and sharing of scientific data generated from NIH-funded or conducted research.

This workshop will introduce the DMS policy, including required elements of the plan and institutional resources.  We will also explore the DMPTool, an online application that helps researchers create data management plans by providing funder and institutional guidance, as well basic concepts for data management and sharing, such as file naming, organization, and documenting your process. 

Representatives from the Office of Sponsored Projects, the Library, and ITC will be available to answer your questions.

NIH Data Management Plans and Practices, 2/28 @ 2pm 

Effective January 25, 2023, the National Institutes of Health (NIH) is implementing its Policy for Data Management and Sharing (DMS) to promote the management and sharing of scientific data generated from NIH-funded or conducted research.

This workshop will introduce the DMS policy, including required elements of the plan and institutional resources.  We will also explore the DMPTool, an online application that helps researchers create data management plans by providing funder and institutional guidance, as well basic concepts for data management and sharing, such as file naming, organization, and documenting your process. 

Representatives from the Office of Sponsored Projects, the Library, and ITC will be available to answer your questions.

Intro to Character Design in Maya, 2/1 @ 3:30pm

This workshop is for participants with basic knowledge of Autodesk Maya and will cover the basics of character design and modeling.

In this workshop participants will:

  • Learn how to design humanoid characters in Maya
  • Learn industry standard modeling techniques for character design 
  • Add geometry and textures to pre-made models 
  • Learn how to optimize the design of models for simplified animation

Animation from Motion Capture, 2/8 @ 3:30pm

This workshop is for participants with a basic knowledge of Autodesk Maya. In this workshop participants will:

  • Learn the basics of rigging and skinning humanoid characters 
  • Follow a simplified rigging and skinning workflow on pre-made character models
  • Learn about the motion capture system in the DEV Studio
  • Practice using motion capture data to animate characters in Maya