Student work

ARCC is looking for dedicated students with high work ethic and professional attitude to engage in a range of research, software development and system administration projects. We pay well, will send you to conferences, aim to expose students to advanced technologies and help with internship opportunities with Industry.

Potential projects to engage in Fall 2022

Note the projects that have run at ARCC during the Summer of 2022: some of them could still use additional help.

Inverse Reinforcement learning to predict cancer progression

Collaborators:

We are building an AI medical software prototype to predict cancer progression using genomic data. Our prior implementation of the Pop-Up Restaurant Inverse Reinforcement Learning has been successfully tested in colorectal cancer, but takes too long to train and uses impractical amounts of computational resources. We are now introducing an efficient DNA encoding method based on bidirectional encoder representations from transformers, which will significantly reduce the computational load and permit the use of larger training datasets, enabling higher quality of prediction. This approach will be applicable to other cancers, COVID variant prediction, and any other process of mutation-driven cellular evolution.

 

 

 

Bioinformatics workflow automation

We will continue working on various projects automating bioinformatics workflows in collaboration with UW faculty and external entities. This will be similar to the microbial pipeline workflow below, as well as these past works:

Automate the ARCC cluster utilization report system

This project builds on the “utilization analysis“ work we did in Summer 2022, see below. Now that we know what measures we want to track, we need an automated system to do it. This will involve building a database of cluster jobs and storage utilization. Then current Python scripts will need to be adapted to mine this database for a specific time frame in order to produce appropriate reports that can be shared with the UW faculty and administration, posted on the ARCC website for public consumption, or reported to the State governor’s office.

Assist with system administration

We are looking for technically minded students who would like to engage in HPC cluster administration tasks and help us out with hardware installation, wiring, testing; software installation, configuration and testing; user support, etc. Training will be provided.

Code improvements for scientific web applications

ARCC administers a number of research web applications that have been written a few years ago and could use code improvements, upgrade to current security standards, and incorporation of best practices. Work will involve investigating how a particular web server is set up with regards to interaction between HTML and PHP codes, database querying, Apache setup etc. Once the setup has been understood, improvement or re-implementation will proceed under the supervision of ARCC staff and the faculty stakeholders.

VM development for standardized research and training environments

ARCC is engaged with a number of training initiatives on campus that require standardized software environments to teach software coding, such as specific IDEs, conda environments, specific versions of software installed etc. We need to build a range of virtual machines that would support such environments for each class or workshop being administered, so that the VM could be supplied on WyoLearn for asynchronous learning.

Engagement with Pacific Research Platform

Pacific Research Platform gives compute access for researchers to perform very large parallel GPU-based computations https://pacificresearchplatform.org . We would like to be able to support UW faculty, staff and students in deploying their workloads on this platform. However, it has a somewhat unusual setup. We need a brave student or three to deploy a project on this platform and figure out how to use it, so as to expand the ARCC bandwidth in supporting users on PRP.

 

Projects that have run in Summer 2022

Automating A Workflow For High Throughput Genomic Analysis Of Wildlife Pathogens In Wyoming

 

Automated Workflow Management system makes it possible to orchestrate multistep, complex, time-consuming processes in a well-organized, parallelized, reproducible fashion. In our current study, we developed an automated genome analysis workflow to identify bacterial isolates from infected wildlife samples using Nextflow platform. For that purpose, individual bioinformatics programs were channeled together in a single pipeline deployed on the Teton HPC cluster at the University of Wyoming. The workflow was optimized and benchmarked to run in a parallel manner on very large sample sizes, utilizing a big portion of the cluster in a short amount of time. Whole genome sequencing technologies are becoming robust and inexpensive. Yet the cost of computational analysis and the human effort in deploying and maintaining the code is still very significant. Our objective was to develop a data analysis pipeline that can process very large datasets in a rapid, efficient, standardized manner using the high performance Nextflow platform. This enables the discovery of the microbial groups linked to wildlife diseases researched at the Wyoming State Vet Lab.

 

Strategies for Correcting and Extracting Fields from Optical Character Recognition Products

 

Radiocarbon dating was invented nearly 70 years ago, and continues to be a crucial method for determining the age of historical objects, fossils and geological sites. Early records were compiled in the form of notched 5×8-inch cards, which still contain valuable information to modern researchers. Fred Johnson (1904-1994), an archaeologist at the Peabody Museum of Andover Academy, compiled 45,000 such cards for the dates 1959-1972 from all over the world, based on the reports and data published in the journal Radiocarbon. To make this information accessible to the scientists in our modern digital world, the University of Wyoming Libraries digitized the cards, and applied Optical Character Recognition (OCR) to the output. Our project focused on correcting and extracting the relevant fields from these records and organizing them for upload to the Canadian Archaeological Radiocarbon Database (CARD). Our Python codes automate this process, which can be used for other batches of cards of similar nature.

 

Optical character recognition and sentiment analysis of the Beatles as a cultural phenomenon

 

  • Students: Milana Wolff

  • Faculty: Prof. Kent Drummond

  • Collaborating campus units: The Libraries and the English Department

This project leverages large-scale optical character recognition and sentiment analysis to digitize text found in historical newspapers and extract information about a particular topic – namely, public attitudes towards the Beatles, a highly popular British music group from the 1960’s. At this point in project development, key steps have included: obtaining raw data, such as PDFs of historical newspapers mentioning the Beatles, via ProQuest (accessible through the Coe Library proxy), and pre-processed popular culture archives via collaborators at the Coe Library; using the open-source program Tesseract to perform optical character recognition (OCR) on the PDF documents in order to extract the text therein; extensive data cleaning to minimize errors and ensure accurate dates on articles; conducting sentiment analyses using the Python packages VADER, TextBlob, and SentiWordNet to determine the overall positive/negative emotions expressed in each newspaper article; and visualizing the changes in sentiment over time using MatPlotLib and Seaborn Python libraries. Future steps include enhanced statistical analysis and expansion of the underlying dataset beyond articles from the New York Times and the Adam Matthew Popular Culture in Britain and America, 1950-1975 sources. Preliminary results are included below. The graph shows variations in sentiment expressed in articles related to the Beatles from the Popular Culture dataset over the duration of their careers, with notable dates indicated by red and pink lines. Positive values indicate positive language (such as “good”, “excellent music”, etc.), while negative values indicate the opposite.

 

Image recognition and coordinate tracking of mice during behavioral experiments

 

We are improving the performance of an image recognition software to track physical position of mice during behavioral experiments. The experimenter simultaneously records large-scale neural activity in vivo through a mini scope mounted in the animal’s skull. The scope, wires and the mounting bar obscure the field of vision for the recording camera, which prevents correct mouse identification by the software. We are undertaking two parallel approaches to remedy this situation. We are introducing improvements in the current Matlab code to help it better recognize mice, and also experimenting with a Python-based code that is purpose-built to recognize behaving mice in experimental videos. Once the recognition issues are resolved, the resultant mouse behavioral trajectories are intended to combine with simultaneous recordings of neuronal activities to establish mechanisms by which activity of individual neurons or neural ensembles codes animal’s behavior. This will aid in our understanding of neural circuit mechanisms of depression, autism, and dementia in the medial prefrontal cortex (mPFC), which all produce deficits in social behavior.

 

GPU benchmarking

In this project we are learning how to benchmark performance of AI workloads at a very deep level, such as the IO movenet between CPU and GPU, to and from memory on either chip, as well as CPU and GPU cycles, and disk IO.

ARCC cluster utilization analysis

In this project we are building a software toolkit to gain insights into the ARCC cluster utilization, in terms of cycle usage per user and per job, disk utilization over time, as well as any discernable patterns such as utilization by department, variation of cycle usage as a function of proposal deadlines or time of year (Summer vs academic year), and similar.