What does it actually mean for a machine to understand how someone feels? That was the question sitting at the center of this project — and as it turns out, it’s a harder one than it looks.
Over the course of eight weeks, our team of three built a full end-to-end NLP pipeline for the Content Intelligence Agency, a company that develops AI tools to help media makers analyse the emotional content of their videos at scale. The goal: take raw video input, transcribe it, and automatically classify which of Paul Ekman’s six core emotions — happiness, sadness, anger, surprise, fear, disgust, or neutral — is present in each sentence.
The Pipeline #
The pipeline was trained and tested on English-language unscripted TV show data — 23,696 sentences for training, and 1,210 sentences from Kitchen Nightmares as the test set. Right from the start, the class distribution told a complicated story:
Neutral (32%) and Happiness (23%) dominated the dataset, while Fear (5%) and Disgust (6%) were severely underrepresented — a challenge that shaped every modelling decision we made and ultimately showed up clearly in the final per-class results.
My Role: From Baselines to Transformers #
My main contribution was model training — running the full progression of experiments from classical baselines through to fine-tuned transformer models, and documenting every iteration in the team’s model log.
Starting with the classics #
I began with Linear SVM and Naïve Bayes, testing a range of feature combinations: TF-IDF, POS tags, sentiment scores, pretrained word embeddings, n-grams, and NER. The goal wasn’t just to get numbers — it was to understand what each feature actually contributed.
The clearest finding came early: sentiment polarity alone is almost useless for emotion classification. An F1 of 0.091 when using sentiment scores as the only SVM feature made that point bluntly. Positive/negative polarity is far too coarse to distinguish between, say, happiness and surprise. TF-IDF remained the strongest traditional signal throughout, with the best SVM iteration — TF-IDF combined with POS tags and sentiment scores — reaching an F1 of 0.490.
Naïve Bayes told a similar story. The most feature-rich combination (TF-IDF + Bag of Words + n-grams + sentiment + NER) peaked at F1 0.475. Useful baselines, but clearly a ceiling.
The transformer jump #
The gap between classical models and transformers was stark. My first DistilBERT runs, fine-tuned for 4–6 epochs, reached F1 scores in the 0.534–0.567 range — already a big step up, even with conservative training settings. A BERT-base run hit 0.516, and notably the model file got corrupted during training, which was a memorable lesson in checkpointing.
The real breakthrough came with RoBERTa. The architecture’s more robust pretraining paid off immediately: the first clean RoBERTa run hit F1 0.721, and after cleaning the test set and stabilising the training configuration, it settled at F1 0.727 — well above anything the classical models could reach.
I then moved to DeBERTa, which uses a disentangled attention mechanism to separately encode content and position. The first run edged out RoBERTa with F1 0.729 and near-perfect precision-recall balance (73.0% vs 73.1%). A second run with a cosine learning rate scheduler and stronger regularisation didn’t move the needle further — some architectures simply plateau, and knowing when to stop tuning is itself a useful conclusion.
The team’s final pipeline used a DeBERTa model reaching 73.6% accuracy and F1 0.734.
What the numbers don’t show #
The per-class picture was more nuanced. Neutral and Happiness performed well (F1 ~0.789 each). Fear was the weak point across every model — only 35.6% recall, meaning the model missed two out of every three fearful sentences. Disgust and Sadness also struggled consistently, a direct consequence of how underrepresented they were in training data. Emotion classification in spoken TV language is genuinely hard: the data is informal, interrupted, and stripped of tone of voice — cues a human listener would rely on heavily.
Error Analysis & XAI #
The error analysis revealed consistent confusion patterns: Happiness frequently collapsed into Neutral, and Fear was routinely misclassified as Sadness or Surprise. Short sentences (under five words) were disproportionately misclassified, as was any sentence containing sarcasm or irony — neither of which the model has any mechanism to detect.
For the XAI component, we applied Layer-wise Relevance Propagation (LRP) with Conservative Propagation to our transformer model, following the method from Ali et al. (2022). The gradient × input baseline confirmed that emotionally charged words — “terrified”, “furious”, “love” — were weighted heavily. The improved LRP method distributed relevance more evenly across context, catching cases where the emotion was carried by sentence structure rather than a single keyword. Input perturbation experiments showed that confidence dropped sharply after removing just two or three key tokens for most emotions, suggesting the model relies on a small number of anchor words rather than a holistic understanding of context.
Reflections #
Running 24 model iterations across two architectures taught me more about the gap between classical and deep learning approaches than any lecture could. The jump from TF-IDF + SVM (F1 ~0.49) to fine-tuned DeBERTa (F1 ~0.73) isn’t magic — it’s the difference between surface-level pattern matching and contextual understanding of language. But 73% on a 7-class emotion problem with imbalanced, spoken-language data is genuinely difficult territory, and the remaining 27% error rate is a useful reminder that language understanding is still an open problem.
Human oversight remains essential for any real deployment — especially for rare emotions like Fear, where even the best model in our suite performs poorly.