Skip to content

WhissleAI/lie_detection_binary

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

40 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Binary Lie Detector β€” Real-life Trial Deception

A self-contained, reproducible pipeline that predicts deceptive vs. truthful from short courtroom video clips, fusing three modalities:

Lane Source Signals
Text Whissle STT (gateway /video/analyze) transcript + per-segment emotion / intent / age / gender metadata, entities, diarization, word timing
Visual Audio-visual hybrid intelligence (same gateway call) per-frame emotion, head pose, gaze, blink, mouth, attention + hand gestures
Audio local prosody (librosa) pitch (F0), jitter/shimmer, pauses, voice quality

The text + visual features come from a single Whissle gateway call β€” POST /video/analyze runs Whissle ASR (with metadata tags) and the audio-visual lane, then fuses them. Prosody is a complementary local lane. Everything else β€” feature engineering, speaker-independent evaluation, and the classifier β€” lives in this repo.

⚠️ External dependency β€” the Whissle gateway is NOT in this repo. The STT (transcript + metadata) and visual feature extraction run on the Whissle gateway Docker (whissleasr/whissle-gateway, port 9000). This repo only calls it over HTTP and parses the result. You must have the gateway running and a wh_ token to do the real extraction. See docs/GATEWAY.md for how to run it, the full request/response contract, and troubleshooting. Only the audio-prosody lane runs locally here (needs ffmpeg).

Dataset: Real-life Trial Deception Detection (PΓ©rez-Rosas et al., 2015, Univ. of Michigan): 121 clips (61 deceptive / 60 truthful) from real trials.

πŸŽ“ Taking this forward? Start with docs/NEXT_STEPS.md β€” current status, the immediate to-do (real gateway pass), and research ideas.


Why this is harder than it looks (and how we handle it)

The 121 clips come from only ~33 unique speakers β€” one defendant (Jodi Arias) accounts for 32 clips, and 7 speakers appear in both classes. A random train/test split lets a model memorise who is speaking instead of whether they are lying, producing inflated, meaningless accuracy.

We evaluate with Leave-One-Speaker-Out (LOSO) cross-validation: every clip from a given person is held out together. Speaker identity is parsed from the dataset README and used as the CV grouping key. This is the only honest estimate of generalisation to an unseen person β€” and it is the headline methodology of this project.


Architecture

                    Real-life trial clip (.mp4)
                              β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β–Ό                     β–Ό                       β–Ό
 gateway /asr/transcribe  gateway /video/analyze  ffmpeg β†’ 16k wav
 (wav: transcript +       (mp4: visual_timeline)      β”‚
  metadata + pauses +          β”‚                      β–Ό
  word conf + probs)           β”‚               prosody (librosa)
        β”‚                      β”‚                      β”‚
        β–Ό                      β–Ό                      β–Ό
   text_features          visual_features        audio_features
   (lexical + STT          (gaze/pose/emotion/    (F0/jitter/pauses/
    metadata probs)         blink/gestures)        voice quality)
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β–Ό                        β–Ό
             multimodal feature matrix (one row / clip)
                   β”‚
                   β–Ό
   Leave-One-Speaker-Out CV  β†’  LogReg / SVM / RandomForest / HistGBM
                   β”‚
                   β–Ό
   metrics (acc / balanced-acc / AUC / F1) + per-modality ablations
   + permutation feature importance  β†’  best_model.joblib

Step 02 makes two gateway calls per clip: /asr/transcribe for the rich text + metadata lane and /video/analyze for the visual timeline. (The video endpoint also runs ASR internally, but its fuser only forwards a segments field this model doesn't emit, so the metadata would be lost β€” hence the dedicated /asr/transcribe call.) See docs/GATEWAY.md.


Setup

Prerequisites: Python 3.10+, ffmpeg on PATH (brew install ffmpeg / apt install ffmpeg), and access to a Whissle gateway (the local docker whissle-gateway on :9000, or https://api.whissle.ai).

cd lie_detection_binary
./setup.sh                      # creates .venv, installs deps, installs this package
source .venv/bin/activate

cp .env.example .env            # then edit .env:
#   WHISSLE_API_TOKEN=wh_...    (required for the gateway STT + visual step)
#   WHISSLE_GATEWAY_URL=http://localhost:9000
#   DECEPTION_DATASET_DIR=/path/to/Real-life_Deception_Detection_2016

The gateway requires Authorization: Bearer wh_.... Create a token at https://lulu.whissle.ai/access.


Usage

Run the whole pipeline:

python scripts/run_all.py                 # real: gateway STT + audio-visual + prosody
python scripts/run_all.py --limit 5       # quick smoke run on 5 clips
python scripts/run_all.py --bootstrap     # offline: bundled transcripts (text+audio only)

…or step by step:

python scripts/01_build_manifest.py       # clips β†’ labels + speaker groups  (no token)
python scripts/02_extract_av.py           # gateway /video/analyze β†’ STT + visual  (token)
python scripts/03_extract_audio.py        # librosa prosody                  (no token)
python scripts/04_build_features.py       # assemble feature matrix          (no token)
python scripts/05_train.py                # LOSO CV, ablations, importance    (no token)
python scripts/06_paper_comparison.py     # paper protocol vs ours + manual gestures (no token)

Each extraction step is resumable (skips clips already done; --overwrite to force) and accepts --limit N for quick tests.

Bootstrap mode (no token yet)

--bootstrap builds text-only records from the dataset's bundled transcripts so you can exercise the text + audio pipeline immediately. Swap in your WHISSLE_API_TOKEN and rerun 02_extract_av.py --overwrite to get the real metadata-rich transcripts and the visual lane.


Outputs

data/
  manifest.csv                 clip β†’ label, speaker, role
  wav/<clip>.wav               16 kHz mono audio (for prosody)
  av/<clip>.json               fused gateway response (transcript + segments + visual_timeline)
  audio/<clip>.json            prosody features
  features/features.parquet    the multimodal feature matrix (+ .csv)
  reports/cv_results.csv        model Γ— modality β†’ LOSO metrics
  reports/feature_importance.csv
  reports/paper_comparison.csv  video-out vs speaker-out, incl. manual gestures
  reports/summary.json
  models/best_model.joblib      refit best pipeline + metadata

05_train.py prints a table like (real run, 169 features, LOSO CV):

        model     modality  n_features  accuracy  balanced_accuracy  roc_auc    f1
      svm_rbf         text         102     0.570              0.571    0.655 0.527
      svm_rbf   text+audio         124     0.603              0.604    0.650 0.586
     hist_gbm          all         169     0.603              0.604    0.615 0.556
random_forest       visual          45     0.562              0.563    0.616 0.531
majority_baseline      β€”             0     0.504              0.500    0.500 0.671

Honest, speaker-independent numbers land around AUC 0.62–0.66 / accuracy ~0.60 β€” clearly above the 0.50 base rate but far from "solved" (and lower than papers that leak speaker identity via random splits). The Whissle STT metadata probability features (behavior/age/emotion distributions) and a few psycholinguistic rates (third-person, negation, neg-emotion) carry most of the signal; the visual lane adds a modest independent ~0.6 AUC on its own.

⚠️ Confound: the model's audio gender read correlates with the label (corr β‰ˆ βˆ’0.35) because the deceptive set is dominated by a few female speakers (Jodi Arias, Amanda Hayes, Crystal Mangum). So meta_gender_* / meta_age_* partly encode demographics, not deception. See docs/NEXT_STEPS.md β€” re-run with demographics dropped to measure the genuine signal.


Results & comparison (PΓ©rez-Rosas et al., 2015 + LLM baselines)

All numbers below use the whissle-large ASR model (transcript + emotion/age/ gender/intent probabilities). The paper reports up to 75.2% accuracy; our honest, speaker-independent headline is lower β€” and that gap is the CV protocol, not a modelling flaw. 06_paper_comparison.py runs every feature set under both protocols (pooled out-of-fold accuracy):

feature set model leave-1-video-out (paper) leave-1-speaker-out (honest) leakage gap
our_text (auto) RandomForest 0.752 0.587 +0.165
our_visual (auto) RandomForest 0.719 0.612 +0.107
our_audio (auto) DecisionTree 0.719 0.570 +0.149
our_all (text+audio+visual) RandomForest 0.752 0.529 +0.223
manual_gestures (paper's CSV) RandomForest 0.769 0.686 +0.083
gemini_features (LLM video scores) RandomForest 0.694 0.678 +0.017
gemini+our_all RandomForest 0.777 0.620 +0.157
majority baseline β€” 0.504 0.504 β€”

LLM zero-shot baselines (Gemini 2.5 Pro, no training β†’ no CV, no leakage):

approach accuracy balanced acc AUC deceptive-call rate
Gemini direct VIDEO 0.669 0.669 0.749 55% (calibrated)
Gemini over-features (v1 forensic prompt) 0.554 0.550 0.631 93% (biased)
Gemini over-features (v2 neutral prompt) 0.512 0.511 0.516 73%

Our trained models under leave-one-speaker-out (step 05, full sweep) peak at AUC 0.670 (whissle-large, up from 0.655 on the small ASR model).

Takeaways:

  1. The paper's protocol leaks speaker identity. Leave-one-video-out keeps 31 of Jodi Arias's 32 clips in training when testing the 32nd, so the model learns the person. The "leakage gap" column is the inflation it buys (+0.02 to +0.26). Under the paper's own protocol we match it (our_text/our_all 0.752) and beat it when we add the LLM's video reads (gemini+our_all 0.777).
  2. Best honest result = Gemini watching the raw video (zero-shot AUC 0.749, balanced 0.669, no training, no leakage). It's well-calibrated (55% deceptive calls vs the 50% base rate).
  3. Reasoning over our feature digest fails (AUC 0.52–0.63, chance-level) even though the same features train to AUC 0.67. Summarising the clip into a list of cues both loses information the video carries and primes the LLM toward "deceptive." A neutral, base-rate-anchored prompt (v2) cuts the bias (deceptive-calls 93%β†’73%, truthful 7β†’17/60) but can't manufacture signal the digest doesn't hold. Watching beats reading our digest; and a trained model beats the LLM at reading it.
  4. Manual gold gestures generalise best of the feature sets (speaker-out 0.686); our automatic MediaPipe visual lane is noisier (0.612). Gemini's video-derived feature scores are a close second (0.678) and the most leakage-robust (gap +0.02).
  5. whissle-large helped β€” better transcripts + intent lifted the trained models (AUC 0.655β†’0.670) β€” but did not rescue the feature-digest LLM (still chance).

Bottom line: 75% on this dataset is a leave-one-video-out (speaker-leaky) number. The honest, speaker-independent ceiling here is ~0.65–0.69 accuracy / ~0.67–0.75 AUC, and the single best honest result is Gemini reading the raw video (AUC 0.749) β€” not any feature-engineered pipeline.

Best systems β€” with vs. without the LLM (09_best_fusion.py)

Concatenating all 178 features hurts (gemini_features alone beats gemini+our_all) β€” 121 clips can't support 178 dims. Feature selection + late fusion fix it. Two deployable configurations, both leave-one-speaker-out (honest):

config best method accuracy AUC
A β€” with Gemini late-fusion: our model βŠ• Gemini's video prob 0.678 0.752
B β€” self-hosted, no LLM hist_gbm on all our features 0.678 0.741

Both now near/above Gemini-video (0.749). Two feature improvements got us here: (1) the text lane's speech_analysis (fluency/grammar/pitch/rhythm) + filtered deception-intents (intent_labels = DENIAL, CONFESSION, JUSTIFICATION, AVOIDANCE, CONTRADICTION, …); and (2) lowering the gateway's face-detection confidence (face-detect rate 0.50β†’0.80, see docs/GATEWAY.md), which lifted the visual lane 0.61β†’0.674 and the self-hosted system 0.670β†’0.741.

The striking result: the fully self-hosted, no-LLM, no-raw-media-leaves system reaches AUC 0.741 / acc 0.678 β€” competitive with Gemini watching the raw video (0.749) β€” and adding Gemini on top gives only a marginal lift to 0.752. For a privacy-sensitive deployment, the self-hosted pipeline is now the better trade.

  • Config A matches Gemini-video-alone (0.747) but as a trained, calibratable classifier. Its top features are Gemini's holistic reads β€” defensiveness, overall_credibility, story_specificity, microexpression_leakage β€” plus our head-pitch, vocal F0, and negation rate.
  • Config B sends no raw audio/video to any external LLM β€” features are extracted only by the (self-hostable) Whissle gateway + local prosody, and a trained model predicts. Honest AUC 0.670 / accuracy ~0.645 (the naive 0.562 was a 0.5-threshold artifact; a weighted per-modality late-fusion is calibrated to 0.645 out of the box β€” see 10_improve_selfhosted.py). Top cues: head-pitch (looking down), vocal pitch, negations, fear expression. AUC is capped ~0.67 by the features β€” ensembling/stacking/late-fusion can't beat it; only better features (visual face-detection, temporal cues) would.
  • Naive concat: AUC 0.640 β†’ SelectKBest(k=10): 0.747. The lesson is selection, not concatenation.

So we can show both: a stronger result with Gemini (AUC 0.747 / acc 0.686), and a respectable fully self-hosted result without any LLM and without raw media leaving the box (AUC 0.670).

Feature reference

Text (txt_*) β€” two groups:

  • Psycholinguistic markers (Newman & Pennebaker; Vrij): first-person-singular vs. plural pronoun rates, negations, tentative/certainty/cognitive/exclusive/ motion word rates, negativeβˆ’positive emotion, type-token ratio, disfluency.
  • Whissle STT metadata from /asr/transcribe: speech rate (WPM, articulation rate, filler/pause ratios), pause statistics (count, mean/max duration, long-pause fraction), per-word confidence + filler rates, overall ASR confidence, uncertain-word rate, entity count, and the full per-token probability distributions for every metadata category (metaprob_<cat>_<tok> for emotion / age / gender / behavior / eval / role) plus each category's entropy and an expected-age scalar β€” i.e. the model's soft read, not just the top-1 label.

Visual (vis_*) β€” aggregated over sampled frames where the speaker's face is detected: emotion fractions + intensities + entropy, gaze aversion, head-pose mean/spread and frame-to-frame motion (fidgeting), blink rate, attention (engaged) fraction, mouth-openness, speaking fraction, hand-gesture presence/ diversity, and face_detect_rate for coverage.

Audio (aud_*) β€” F0 mean/std/range/voiced-fraction + jitter proxy, RMS loudness + shimmer proxy, silence ratio / pause count / mean pause length / pause density, ZCR and spectral centroid/bandwidth/rolloff.


Project layout

lie_detector/
  config.py                 env-driven paths + gateway settings
  dataset.py                manifest + speaker parsing from the README
  media.py                  ffmpeg audio extract / probe
  io_utils.py               json + cache helpers
  extraction/
    gateway.py              POST /video/analyze (STT + audio-visual)  ← step 02
    audio_prosody.py        librosa prosody                            ← step 03
  features/
    text_features.py        txt_*   (transcript + STT metadata)
    visual_features.py      vis_*   (visual_timeline aggregation)
    audio_features.py       aud_*   (prosody passthrough + derived)
    assemble.py             join β†’ multimodal matrix
  modeling/
    metrics.py              binary metrics
    train.py                LOSO CV, models, ablations, importance
scripts/                    01…05 + run_all.py
tests/                      smoke tests
docs/
  GATEWAY.md                the external Whissle gateway: how to run it + contract
  NEXT_STEPS.md             handoff: status + research ideas (read this first)

Notes, limitations, and ethics

  • Small, biased sample. 121 clips / ~33 speakers from US trials. Results are a research signal, not a courtroom tool. Expect LOSO accuracy in the ~60–75% range β€” well above the ~50% base rate, far from "proof".
  • Deception detection is not solved. No model here infers guilt; it predicts a dataset label derived from verdicts/exonerations. Do not deploy this to judge real people. Treat outputs as probabilistic and contestable.
  • Demographic confound. A handful of female defendants dominate the deceptive class, so age/gender metadata correlate with the label. Some apparent "accuracy" is demographics, not deception β€” audit by dropping meta_*/ metaprob_age*/metaprob_gender* and re-checking (see docs/NEXT_STEPS.md).
  • Reproducibility. Fixed seed, deterministic LOSO folds, resumable caches.
  • The bundled Annotation/All_Gestures_*.csv (human-annotated gestures) is a reference baseline from the original paper; we extract our own features and do not train on those labels.

Citation

PΓ©rez-Rosas, Abouelenien, Mihalcea, Burzo. Deception Detection using Real-life Trial Data. ICMI 2015.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors