A behavioral psychometrics pipeline. Years of cross-platform messages, an LLM-as-judge, and triangulation across many relationships — used to build a self-model that's harder to lie to than a survey.
Conventional psychometrics — Big Five, MBTI, every workplace assessment — asks you about you. "Are you organized?" The answer reflects how you want to be seen, what mood you're in, what you think the assessor will reward. The instrument and the subject are the same person, and the data inherits the bias.
The opposite of self-report is behavioral: what you actually did, written down, in real life, without an examiner watching. Most of us already produce that data continuously — in messages, emails, threads — for years on end. The hard part isn't collection. It's the rubric.
personalContext is the rubric and the pipeline that runs it.
Big Five is the field's strongest survey instrument. It captures a lot of variance but loses the texture that actually predicts how someone shows up — conflict style, authority response, temporal orientation. The framework here extends Big Five with five additional dimensions chosen because they are readable in messaging behavior, not just in self-report.
The methodological move that makes the whole thing useful. A single-relationship behavioral analysis is just one mask — you behave differently with your mother than with your business partner than with a romantic partner than with a friend you've known since you were eight. A reading from any one of those is a reading of how you show up in that context.
Run the same rubric across many counterparts and the structure emerges: traits that appear across many relationship types are higher-confidence reads of the underlying person. Traits that appear in only one are context-specific behaviors — masks, and themselves useful information about who that mask gets worn for.
One relationship's scores look like a confident reading. They aren't — they're the subject's behavior in that specific dynamic.
Many relationships, one rubric. Consensus across batches is the underlying trait. Variance is the mask, and the masks themselves cluster meaningfully.
The pipeline ends where conventional psychometrics begins: a self-report instrument. The behavioral profile is used as a target, and the system synthesizes yes / no items — concrete, scenario-anchored, intentionally hard to game — that should reproduce the same dimensional readings if the subject answers them honestly, cold.
This is forward-validation, and it's the most important diagnostic in the whole pipeline. If the synthesized instrument does not reproduce the behavioral profile, the analysis was carrying noise or bias the rubric didn't catch. If it does, you have a portable, transferable read of the same construct — and the underlying behavioral data backs it up with literal evidence quotes.
The same pipeline could be pointed outward — toward employees, partners, candidates — and it absolutely should not be. Behavioral psychometrics is asymmetric: the tool that lets you see yourself more honestly is the same tool that, used on someone else, would be surveillance.
How many distinct relationship batches are needed before triangulation stabilizes? Does the LLM-as-judge drift with model versions? Are dimension definitions invariant across the lifespan, or do early-life batches need recalibration?
Big Five (Costa & McCrae) is the conceptual base. LIWC and other lexical-inventory approaches read messages but lose context. LLM-as-judge work in alignment evals is the closest methodological neighbor — this just turns it onto behavioral instead of factual material.
Tighter inter-rater agreement testing across model families. A formal split between the "trait" signal and the "mask" signal as separate outputs. Quietly running the synthesized instrument on the subject and comparing.
Happy to discuss the rubric, the pipeline, or the validation approach with researchers, psychometricians, or anyone working on behavioral-data analysis. The instrument and downstream profile content stay private.