Projects

This section is where theory meets practice. Each project here represents a different slice of what I bring to data work: collecting messy data from the web, building and comparing machine learning models, applying cutting-edge NLP techniques, creating interactive tools to make my partner happy, and making complex methods accessible to others.

Some of these grew out of pure curiosity (what patterns emerge in song lyrics? Can we detect “eras”?), others from practical needs (how do I make my teaching materials easily searchable for my students?), and a few from different research challenges (what is the best classifier for a job? how good are LLMs actually? is political polarization really so hot?). Together, they show how I approach problems: I start with a question, pick the right tools and data for the job, and don’t stop until I’ve found an answer worth sharing.

You’ll see a mix of techniques here: classic machine learning, modern transformer models, web scraping, interactive dashboards, and RAG applications. But the thread that runs through all of them is the same: how can I turn raw data into understandable insights, into something people can actually understand and use.

Polarization terms

This emerged as part of writing the introductory chapter for my dissertation on political polarization. Political polarization has been coined “Word of the Year 2024” by Merriam Webster, reflecting the search traffic on their page. I was wondering whether there would be different ways to document such a surge in attention. Therefore, I moved to Google Trends (“are people interested in political polarization?”), Google Scholar (“are researchers interested in political polarization?”), and the New York Times API (“are journalists interested in political polarization?”). Turns out, yes, there’s been an overall increase in attention. This summary can serve as an example of how to scrape valuable data sources and navigate obstacles such as CAPTCHAs. READ MORE HERE.

Training and comparing ML classifiers

One topic of a graduate class I taught on Computational Social Science was supervised classification of text. I showed the students different approaches for doing this (simple, dictionary-based; more advanced, using bag-of-words-based models; and advanced, using BERT). To show the students how capable these different models are, I decided to train and compare several machine learning classifiers. Due to time constraints, I resorted to a pre-labeled data set that was available on Kaggle, containing IMDb reviews of movies. Here’s a little report on my results, with an emphasis on the impact of preprocessing and the number of training examples. READ MORE HERE.

BERTopic on survey responses

For a research paper, I needed to analyze open-ended survey responses. They came with particular challenges: (a) they were very short, rendering “classic” mixed membership models useless; (b) they came in three different languages (English, German, Swedish). To classify the responses into different “topics”, I decided to use BERTopic, a modern topic modeling approach based on transformer embeddings. This approach also allowed me to pre-specify topics based on prior research, making it particularly useful for theory-guided research where you already have an idea of what’s in the text. READ MORE HERE.

Shiny advent calendar

Ph.D. students do not have tremendous purchasing power and my partner and I had to live in different places for a year. However, I knew that I would come back to Durham, NC (where she lived at the time) and we would spend spring and summer together. To make the distance and wait more bearable, I created a Shiny app that served as an advent calendar for us. Each day, a new suggestion for a shared “date activity” would pop up, like a boat rental or a nice restaurant. It was a fun way to combine my coding skills with a personal touch, and it made the holiday season special despite the distance. READ MORE HERE.

Taylor Swift lyrics

I can’t say that I have been a Taylor Swift fan since her early days. However, I will readily admit that her songwriting does a tremendous job at capturing the lived experience of a Millenial and as she aged her songs also matured (and there are all these fun Shakespeare references). I was curious whether I could find patterns in her lyrics that correspond to different “eras” of her music career. The data were readily available, neatly wrapped in an R package, so I processed the text, and went to town with various NLP techniques to see if distinct themes or styles emerged over time. READ MORE HERE.

RAG of teaching materials

To help my students prepare for their final papers and making a “more targeted” GPT for them, I wanted to create a Retrieval-Augmented Generation (RAG) system that could answer their questions based on the materials I provided throughout the semester. This involved collecting lecture notes, slides, and reading materials, casting them into a machine-readable format, and then setting up a RAG pipeline that could retrieve relevant information, coupled with a local LLM to generate coherent answers. The goal was to make studying more interactive and efficient for my students. READ MORE HERE.

Transcription tutorial

Social scientists use plenty of text data for their research due to the fact that its readily available and easily analyzable. However, an in my experience often overlooked source of text data are audio or video recordings. Interviews, focus groups, speeches, and even podcasts can be treasure troves of data. To help fellow researchers get started with transcribing audio data into text, I created a tutorial that walks through the process using popular tools and services. The tutorial covers everything from speaker diarization with pyannote (for instance, if you have interview or focus group recordings) to using a large model in the background (OpenAI Whisper) to ensure high accuracy in your transcriptions. Unlike tools, such as Zoom’s transcription software, my approach is fully free and the data will not be used for training the providers’ models – making it well-suited for sensitive data. READ MORE HERE.

Using LLMs for coding qualitative data

Qualitative data analysis often involves coding text data into different categories or themes. This can be a time-consuming and subjective process. To explore how large language models (LLMs) can assist in this task, I set up an experiment where I used an LLM to code a set of qualitative data and compared its performance to a human coder (me). The goal was to assess the accuracy, consistency, and efficiency of using LLMs for qualitative coding, and to identify potential benefits and limitations of this approach. The approach I suggest works across languages (depending on LLM choice), is fairly quick and robust. Also, it runs locally (at least on my 2022 MacBook Pro M1 Pro), thus the data will never be shared with other companies (e.g., OpenAI, Anthropic). READ MORE HERE.