The system watches the movie frame by frame. For every face that appears, it detects the identity and reads the emotion. For every line spoken, it first separates vocals from background music (Demucs), reduces noise (DeepFilterNet), then transcribes every word with millisecond-precision timestamps (WhisperX). It identifies who is speaking (pyannote diarization) and measures their vocal emotion — arousal, valence, dominance (Wav2Vec2 SER). The transcribed text is then run through DistilRoBERTa emotion classification to get a text-level emotion label and confidence score — a third independent signal alongside face and voice. Three channels, each telling a potentially different story about the same moment.
Three Perception Channels
👀
Visual (FER)
RetinaFace detection
Facenet512 recognition
DeepFace emotion
🔊
Audio (SER + ASR)
WhisperX transcript
Wav2Vec2 arousal
valence & dominance
📄
Text (NLP)
DistilRoBERTa 7-class
emotion + confidence score
anger/joy/sadness/fear/surprise/disgust/neutral
character,dialogue,text_emotion,audio_arousal,audio_valence,face_emotion
Brent,"Why do we care if we were expelled?",anger,0.762,0.649,neutral
Scarlett,"Wearing as if I waited for the poor...",anger,0.814,-0.312,happy ← CONFLICT
"Sees Scarlett smiling (happy, confidence 85%), hears her voice cracking (arousal 0.82, valence -0.31), reads dialogue tagged as sadness. Three signals — three different answers."
Arousal and valence across the entire movie timeline — automatically extracted by the pipeline