Read below about current and past projects that have involved partnerships with UW researchers as well as with outside organizations and industry. Projects involve a wide array of computational tools and methodologies and span a diverse range of fields, from athletics to medicine to oil & gas to the humanities.
Artificial Intelligence/Computer Vision/Machine Learning
...
Inverse Reinforcement Learning to Predict Cancer Progression
...
EXTERNAL PROJECT COLLABORATORS:
Dr. Nicholas Chia, Mayo Clinic
Dr. Christina Fliege, University of Illinois, National Center for Supercomputing Applications
...
Optical character recognition and sentiment analysis of the Beatles as a cultural phenomenon
...
UW PROJECT COLLABORATORS:
PhD Candidate in Computer Science: Milana Wolff
English Professor Kent Drummond & the UW English department
The Libraries of UW
This project leverages large-scale optical character recognition and sentiment analysis to digitize text found in historical newspapers and extract information about a particular topic – namely, public attitudes towards the Beatles, a highly popular British music group from the 1960’s. At this point in project development, key steps have included: obtaining raw data, such as PDFs of historical newspapers mentioning the Beatles, via ProQuest (accessible through the Coe Library proxy), and pre-processed popular culture archives via collaborators at the Coe Library; using the open-source program Tesseract to perform optical character recognition (OCR) on the PDF documents in order to extract the text therein; extensive data cleaning to minimize errors and ensure accurate dates on articles; conducting sentiment analyses using the Python packages VADER, TextBlob, and SentiWordNet to determine the overall positive/negative emotions expressed in each newspaper article; and visualizing the changes in sentiment over time using MatPlotLib and Seaborn Python libraries. Future steps include enhanced statistical analysis and expansion of the underlying dataset beyond articles from the New York Times and the Adam Matthew Popular Culture in Britain and America, 1950-1975 sources. Preliminary results are included below. The graph shows variations in sentiment expressed in articles related to the Beatles from the Popular Culture dataset over the duration of their careers, with notable dates indicated by red and pink lines. Positive values indicate positive language (such as “good”, “excellent music”, etc.), while negative values indicate the opposite.
Strategies for Correcting and Extracting Fields from Optical Character Recognition Products
...
Students: Carver Bray, Collin Dixon
...
Faculty: Prof. Robert Kelly
...
Oil Well Cards Parsing for the Enhanced Oil Recovery Institute (EORI)
...
PROJECT COLLABORATORS:
Enhanced Oil Recovery Institute (EORI)
UW Libraries & Digital Scholarship Center
The purpose of this project is to use Optical Character Recognition (OCR) to extract text from a large collection of over 100,000 digitized Oil Well Card files in the form of PDF documents, and to subsequently organize that text into spreadsheet files. The spreadsheet files will be sent to the EORI so that they can keep track of the oil well information found on the PDF documents. Tools utilized include Kitty Ranger and Tesseract.
Extracting Data from Radiocarbon Dating Cards of Anthropological Records
...
PROJECT COLLABORATORS:
Anthropology Professor Robert Kelly & the UW Anthropology Department
UW Libraries
Radiocarbon dating was invented nearly 70 years ago, and continues to be a crucial method for determining the age of historical objects, fossils and geological sites. Early records were compiled in the form of notched 5×8-inch cards, which still contain valuable information to modern researchers. Fred Johnson (1904-1994), an archaeologist at the Peabody Museum of Andover Academy, compiled 45,000 such cards for the dates 1959-1972 from all over the world, based on the reports and data published in the journal Radiocarbon. To make this information accessible to the scientists in our modern digital world, the University of Wyoming Libraries digitized the cards, and applied Optical Character Recognition (OCR) to the output. Our project focused on correcting and extracting the relevant fields from these records and organizing them for upload to the Canadian Archaeological Radiocarbon Database (CARD). Our Python codes automate this process, which can be used for other batches of cards of similar nature.
...