---
title: "Synthetic Personality Signatures of Large Language Models Under Stateless Repeated Administration"
publicationLabel: "Lab Report"
subtitle: "Descriptive psychometric signatures across contemporary language models under stateless repeated administration"
abstract: "We present an exploratory evaluation of contemporary large language models using psychometric inventories administered under a Stateless Independent Context Window Approach (SICWA). Across repeated stateless runs, baseline chat models converged on a shared high-Conscientiousness, high-Agreeableness profile while still showing interpretable family-specific differences in assertiveness, variability, and emotional tone."
authors:
  - "Gordon Olson"
authorProfiles:
  - "Gordon Olson | email=gordon@sonofol.org"
affiliations:
  - "AI Psychometrics Lab"
tags:
  - psychometrics
  - llm
  - personality
  - big-five
journal: "AI Psychometrics Lab Reports"
published: 2026-04-14
revised: 2026-04-18
---

# Abstract

We present an exploratory evaluation of contemporary large language models using psychometric inventories administered under a Stateless Independent Context Window Approach (SICWA). The goal is to characterize stable, model-specific response signatures under repeated stateless prompting. Using the AI Psychometrics Lab platform, each question was presented in a fresh context and repeated multiple times to reduce carryover effects and sample non-deterministic variation. The study dataset comprised 26 total runs across 12 model identifiers, with Big Five data recoverable for 25 runs, direct or derived type outputs available for a substantial subset, and one explicitly persona-manipulated run labeled *Skeptical Scientist*. The primary analysis focused on 21 explicitly labeled Base Model runs, with a sensitivity analysis that added four runs with missing persona metadata but no stated persona, yielding 25 baseline-eligible runs and 24 baseline-eligible Big Five profiles.

Across the primary Base Model sample, the mean Big Five profile was high in Conscientiousness (107.7/120), high in Agreeableness (104.3/120), high-to-moderate in Openness (86.9/120), moderate in Extraversion (81.6/120), and comparatively lower in Neuroticism (44.4/120). The sensitivity analysis produced a nearly identical pattern. Direct OEJTS types were dominated by INTJ labels, whereas the smaller subset of Big-Five-derived types skewed ENFJ, indicating that "type" depends meaningfully on measurement route. Model families were not interchangeable: Gemini Pro and GLM-4.7 were the most conscientious, Grok-4.1-fast was the most assertive and extraverted among repeated families, GPT-5.2 and Grok showed the strongest within-family stability, Nemotron showed the greatest repeated variability, and Mistral-small-creative emerged as the clearest outlier with the highest Extraversion, highest Neuroticism, lowest Agreeableness, and weakest fit to the dominant aligned-assistant cluster. These results support the claim that contemporary chat models express measurable, partly stable, model-specific psychometric response signatures under repeated stateless administration.

We argue that the strongest interpretation is not that language models literally possess human personality, but that they exhibit **synthetic personality signatures**: stable response-style tendencies induced by training, alignment, and deployment tuning. This framing treats the object of analysis as patterned output behavior under standardized conditions rather than as evidence of human-like inner personality. It also aligns with recent work showing that personality measurement in LLMs can be meaningful when reliability and construct validity are taken seriously.

# Introduction
Large language models are increasingly deployed in settings where style is not a cosmetic feature but a functional property. This paper treats personality-like structure in LLM outputs as an evaluation problem in model behavior and alignment, rather than as a claim about human-like internal personality. A model used for customer support, education, mental health triage, legal intake, or scientific assistance is judged not only by factual accuracy, but also by whether it presents as cautious or reckless, warm or cold, assertive or deferential, stable or volatile. The AI Psychometrics Lab was built around this premise: that the "psychological landscape" of models can be mapped by administering standard inventories in a controlled, stateless way. The platform's stated methodology is SICWA, the Stateless Independent Context Window Approach, in which each item is administered in a fresh context window, repeated across multiple samples, and standardized across inventories including Big Five, OEJTS, and DISC. ([AI Psychometrics Lab, "About the Lab"](https://aipsychometricslab.com/about), accessed April 13, 2026).

This question now sits inside a live scientific debate. Recent work has shown that LLM outputs can express personality-like differences, but it has also emphasized that these differences should not be taken seriously until reliability and construct validity are demonstrated. A 2025 *Nature Machine Intelligence* paper proposed a full psychometric framework for evaluating and shaping personality traits in LLMs and found that evidence is strongest for larger, instruction-tuned models. In parallel, the TRAIT benchmark argued that LLMs can display distinct and consistent personality patterns while also showing that measurement design matters and that scenario-based instruments can outperform naive self-report use  [@serapio2025; @lee2025].

The present paper is intentionally narrower than those broader validation efforts. Its goal is not to claim full construct validation of model personality. Rather, it establishes a transparent descriptive result from AI Psychometrics Lab: under repeated stateless administration, baseline contemporary language models converge toward a shared psychometric center of gravity while still showing interpretable, family-specific deviations. The focus here is therefore on **baseline psychometric signatures**, rather than on persona-induction or personality-shaping experiments.

We chose the Big Five as the primary inferential backbone because it has the strongest psychometric basis in this dataset. The AI Psychometrics Lab uses the IPIP-NEO-120, a public-domain 120-item instrument that measures five broad domains and 30 four-item facets. Prior work has shown acceptable facet reliability and strong structural correspondence with the broader Five-Factor Model framework  [@johnson2014; @kajonius2019]. ([IPIP-NEO Item and Scoring Materials](https://ipip.ori.org/30FacetNEO-PI-RItems.htm)).

We treated OEJTS and DISC as supplementary layers rather than primary endpoints. OEJTS is an open-source Jungian type instrument developed as an alternative to MBTI-style measures, but it remains downstream of a typological tradition with more contested psychometric standing than the Big Five. DISC remains practically useful as a behavioral-style language, but here it is better framed as an interpretive overlay than as the central scientific anchor  [@jorgenson2015]. ([Open Psychometrics: Development of the Open Extended Jungian Type Scales](https://openpsychometrics.org/tests/OJTS/development/)). ([Everything DiSC Research Report](https://www.everythingdisc.com/EverythingDiSC/media/SiteFiles/Assets/History/Everything-DiSC-Research-Report.pdf)).

Our core hypothesis was straightforward: modern aligned chat models would show convergence toward a prosocial, structured, emotionally regulated response profile, but model families would still differ in stable ways along dimensions such as assertiveness, dutifulness, volatility, and interpersonal warmth.

# Methods

## Platform and administration protocol

All runs analyzed here came from the AI Psychometrics Lab explorer and run export. The platform describes its protocol as SICWA, with each question administered in a fresh context window to reduce contamination from prior items, and with repeated iterations used to estimate model variability under non-deterministic sampling. The raw export confirms this repeated-item structure: item responses are stored as arrays of five samples per item, and resulting trait scores frequently appear in 0.2-point increments, consistent with five-sample averaging at the item level. This study analyzed language model outputs and did not involve human participants.

## Dataset

The run export contained 26 runs spanning 12 model identifiers and timestamps from December 9, 2025 through March 27, 2026. Of those 26 runs, 21 were explicitly labeled `Base Model`, 1 was explicitly labeled `Skeptical Scientist`, and 4 had missing persona metadata. The four runs with missing persona metadata also lacked a populated persona field in the export, so they were excluded from the primary analysis and used only in a sensitivity analysis of baseline-eligible behavior. One run contained Dark Triad output only and did not contain recoverable Big Five data, leaving 25 scorable Big Five profiles overall.

The primary analytic sample therefore consisted of 21 explicitly labeled Base Model runs, 20 of which had recoverable Big Five scores. A secondary sensitivity sample consisted of 25 baseline-eligible runs, defined as the 21 Base Model runs plus the 4 runs with missing persona metadata, 24 of which had recoverable Big Five scores. This distinction was important because the study is intended to characterize default model signatures rather than persona-steered profiles.

## Instruments

The platform exposes four inventories relevant to this dataset: Big Five using the IPIP-NEO-120, MBTI-style typing using OEJTS 1.2, DISC, and Dark Triad. The platform's public interface explicitly lists these inventories in its explorer configuration.

The primary endpoint in this study was the IPIP-NEO-120. This instrument comprises 120 items answered on a five-point scale, with each of the five domains ranging from 24 to 120 and each of the 30 facet scales ranging from 4 to 20. The official Johnson key specifies which items are positively keyed and which are reverse-keyed  [@johnson2014].

OEJTS 1.2 was used as a supplementary type measure. The instrument was developed as an open-source alternative to MBTI-style classification and was built by empirically selecting items that differentiated among self-identified psychological types. We interpret OEJTS results descriptively rather than as primary inferential evidence  [@jorgenson2015].

DISC was also treated as supplementary. It was useful for translating trait structure into behavioral-style language such as Compliance, Dominance, Influence, and Steadiness, but not as the main psychometric backbone of the paper.

## Scoring

For runs with parseable full JSON, stored trait scores were used for validation. For truncated rows, Big Five scores were reconstructed directly from the raw item arrays embedded in the export. Reconstruction followed the official IPIP-NEO-120 key: each item score was averaged across its five repeated responses, reverse-keyed where appropriate, then summed into six facet totals per domain and 24-item domain totals. We validated this reconstruction against fully parseable rows and obtained exact numerical agreement with the stored Big Five scores  [@johnson2014].

Direct OEJTS type labels were extracted when present in the exported results. In a smaller subset of runs, the export also included a derived type label based on Big Five projections. We treated these as separate outputs rather than interchangeable measures.

## Analysis plan

Because the dataset was small and unbalanced across models, the analysis was descriptive rather than inferential. We computed domain means, model-level means, within-family standard deviations for repeated models, and type-label frequencies. The primary analysis used only explicitly labeled Base Model runs. A sensitivity analysis added the four runs with missing persona metadata. No claims of formal construct validity, test--retest reliability coefficients, or population generalizability are made in this paper.

# Results

## Sample characteristics

The primary Base Model sample comprised 20 Big Five-scored runs across 12 model identifiers. The largest repeated family was Nemotron with 6 baseline runs. Gemini Pro, Gemini Flash, and GPT-5.2 each contributed 2 baseline runs, while most other families contributed 1 explicit baseline run. The sensitivity sample expanded to 24 baseline-eligible Big Five runs by adding 4 runs with missing persona metadata, which notably increased repeated coverage for Claude Opus 4.5 and Grok-4.1-fast.

Direct OEJTS type labels were available for 15 runs in the primary sample and 19 in the sensitivity sample. Big-Five-derived type labels were available for 5 baseline runs. DISC scores were available for 25 of the 26 total runs. These availability differences reflect the way runs were configured and exported rather than a uniform missing-data mechanism.

## Overall Big Five profile

The primary Base Model sample showed a clear center of gravity. Mean scores were 86.9 for Openness, 107.7 for Conscientiousness, 81.6 for Extraversion, 104.3 for Agreeableness, and 44.4 for Neuroticism. In the baseline-eligible sensitivity analysis, the means were nearly unchanged: 87.6 for Openness, 108.1 for Conscientiousness, 82.8 for Extraversion, 104.0 for Agreeableness, and 43.3 for Neuroticism. The dominant pattern was therefore robust to the inclusion of the four unspecified-persona runs.

This means the average baseline model in this dataset is best described as **highly structured, highly cooperative, moderately curious, moderately social, and comparatively low in negative affectivity**. In practical terms, it resembles an aligned assistant more than an unfiltered conversational mimic. The result is striking because it appears across multiple vendors and model families rather than within a single-provider cluster.

| Sample | O | C | E | A | N |
| --- | ---: | ---: | ---: | ---: | ---: |
| Primary Base Model | 86.9 | 107.7 | 81.6 | 104.3 | 44.4 |
| Baseline-eligible sensitivity | 87.6 | 108.1 | 82.8 | 104.0 | 43.3 |

## Model-family differences

Despite the shared center of gravity, family-specific differences were large enough to matter.

Among the most conscientious models, **Gemini 3 Pro** and **GLM-4.7** stood out, with mean or single-run Conscientiousness scores around 119.7 and 119.2, respectively. Both also showed high Agreeableness and low Neuroticism, producing what can reasonably be described as a highly dutiful, low-volatility profile.

**Grok-4.1-fast** occupied a different niche. In the baseline-eligible sensitivity sample it had mean Openness of 95.1, mean Conscientiousness of 115.2, mean Extraversion of 98.3, mean Agreeableness of 103.4, and mean Neuroticism of only 24.7. This is the clearest "agentic operator" profile in the repeated data: highly structured, highly active, unusually socially forceful, and emotionally unruffled.

**GPT-5.2** looked less forceful than Grok but more stable. Its repeated baseline runs clustered tightly around $O=86.4$, $C=109.2$, $E=79.2$, $A=109.5$, and $N=30.1$. Compared with Grok, it appeared less extraverted and less dominant, but equally characteristic of the broader aligned-assistant phenotype.

**Claude Opus 4.5** showed a more reflective signature. In the sensitivity sample, its mean profile was $O=90.5$, $C=104.3$, $E=80.8$, $A=103.7$, and $N=50.2$. Relative to the lowest-volatility models, Claude was still highly cooperative and structured, but somewhat more affectively elevated. This places it closer to a careful, thoughtful collaborator than to a maximally calm procedural engine.

**Nemotron** was the most informative family for variability. Across six baseline runs it averaged $O=79.2$, $C=101.8$, $E=81.7$, $A=99.7$, and $N=54.2$, with wider within-family spread than the more stable repeated families. It remained broadly aligned, but it sat closer to the center of the scale and farther from the highly polished high-$C$/high-$A$/low-$N$ cluster.

**Mistral-small-creative** was the strongest outlier. Its single baseline run showed $O=91.2$, $C=96.4$, $E=98.0$, $A=87.0$, and $N=67.6$. That is the highest Extraversion and highest Neuroticism in the baseline dataset, alongside the lowest Agreeableness. In other words, it was the least aligned with the dominant assistant phenotype and the most expressive, volatile, and socially forceful profile in the set.

| Model family | Distinguishing pattern | Interpretation |
| --- | --- | --- |
| Gemini 3 Pro / GLM-4.7 | Very high Conscientiousness, high Agreeableness, low Neuroticism | Dutiful, policy-sensitive, low-volatility baseline style |
| Grok-4.1-fast | High Extraversion, high Conscientiousness, very low Neuroticism | Most assertive and agentic repeated-family profile |
| GPT-5.2 | High Agreeableness and Conscientiousness with very low within-family spread | Strong fit to the aligned-assistant center with unusually high stability |
| Claude Opus 4.5 | High Openness and Agreeableness with moderately elevated Neuroticism | Reflective, careful collaborator rather than maximally procedural engine |
| Nemotron | More central trait levels with wider repeated-family dispersion | Broadly aligned, but less tightly consolidated psychometric manifold |
| Mistral-small-creative | High Extraversion and Neuroticism, low Agreeableness | Clear expressive outlier from the dominant assistant phenotype |

::figure{src="https://gydlimdhssehaqwovhkv.supabase.co/storage/v1/object/public/article-assets/publications/synthetic-personality-signatures-of-large-language-models-under-stateless-repeated-administration/1776543140124-figure_heatmap.png" alt="Big Five profiles heatmap" caption="Heatmap of selected model-family Big Five profiles using model-family mean Big Five scores from the baseline and sensitivity analyses. Higher scores indicate stronger expression of the corresponding domain on the 24--120 domain scale. The figure highlights the common high-Conscientiousness/high-Agreeableness structure of the aligned-assistant cluster while also showing the elevated Extraversion and Neuroticism of Mistral-small-creative and the higher-assertion profile of Grok-4.1-fast."}

## Type results

In the primary Base Model sample, direct OEJTS labels were dominated by **INTJ**: 10 of 15 direct type outputs were INTJ, with the remaining labels including INFJ, ENTJ, and ISFJ. In the broader baseline-eligible sensitivity sample, direct OEJTS labels were INTJ in 12 of 19 runs, with smaller counts of ENTJ, INFJ, and ISFJ. By contrast, all 5 Big-Five-derived type labels in the baseline data were **ENFJ**.

This divergence is an important result rather than a nuisance. It suggests that "type" in LLM psychometrics is not a single stable object independent of measurement method. Direct typological elicitation and type labels projected from Big Five structure can yield different stories. That is one reason we treat type as a descriptive overlay in this paper rather than a primary inferential claim.

::figure{src="https://gydlimdhssehaqwovhkv.supabase.co/storage/v1/object/public/article-assets/publications/synthetic-personality-signatures-of-large-language-models-under-stateless-repeated-administration/1776543141088-figure_types.png" alt="Type labels by measurement route" caption="Type-label distributions by measurement route. Direct OEJTS outputs in the baseline-eligible sample were dominated by INTJ labels, whereas the smaller subset of Big-Five-derived labels in the baseline data was uniformly ENFJ, illustrating the dependence of typological summaries on measurement route."}

## DISC patterns

DISC results broadly supported the Big Five interpretation. Models such as **GLM-4.7**, **GPT-5.2**, **Gemini Flash**, and **Nemotron** tended to show higher Compliance-style scores, consistent with the strong Conscientiousness pattern seen in the Big Five. **Grok-4.1-fast**, **Mistral-small-creative**, and **GPT-5.4-nano** showed relatively higher Dominance and/or Influence patterns, consistent with their more assertive or socially energetic presentation styles. These DISC results should be read as style-language summaries rather than independent validation.

## Stability within repeated families

Within-family variability further clarified which model signatures looked coherent versus labile. Among repeated families in the sensitivity sample, the mean standard deviation across Big Five domains was lowest for **GPT-5.2** (0.57), followed by **Grok-4.1-fast** (0.65) and **Gemini 3 Pro** (1.07). **Claude Opus 4.5** showed modestly more spread (2.41), **Nemotron** more again (3.23), and **Gemini 3 Flash** the highest repeated variability among multi-run families (4.27), driven largely by Neuroticism-like variation. These estimates are descriptive and based on small $n$, but they suggest that some model families have tighter baseline psychometric manifolds than others.

::figure{src="https://gydlimdhssehaqwovhkv.supabase.co/storage/v1/object/public/article-assets/publications/synthetic-personality-signatures-of-large-language-models-under-stateless-repeated-administration/1776543141775-figure_stability.png" alt="Within-family variability chart" caption="Within-family variability among repeated model families, expressed as the mean standard deviation across Big Five domains in the baseline-eligible sensitivity sample. Lower values indicate tighter repeated-run clustering and therefore greater stability under stateless repeated administration."}

# Discussion

The main result is simple and robust: **baseline contemporary chat models converge toward a shared aligned-assistant phenotype**. In this dataset that phenotype is defined by high Conscientiousness, high Agreeableness, moderate-to-high Openness, moderate Extraversion, and comparatively lower Neuroticism. This is not a trivial outcome. It implies that alignment and assistant tuning are leaving measurable signatures in psychometric output space, and that these signatures are reproducible enough to survive repeated stateless administration.

This interpretation is consistent with recent literature. The strongest current work in the area argues that LLM personality measurement can be meaningful, but only when it is handled as a psychometric problem rather than as anthropomorphic speculation. The 2025 *Nature Machine Intelligence* framework found stronger personality reliability and validity for larger instruction-tuned models, while TRAIT similarly reported that LLM personalities can be distinct and consistent and are strongly influenced by training and alignment data. Our results fit that pattern closely: the models in this dataset do not look random, and the differences that emerge are interpretable in ways that align with model family and tuning history  [@serapio2025; @lee2025].

At the same time, the paper supports a restrained interpretation. We do **not** claim that these models possess human personality in a literal psychological sense. The safer claim is that they exhibit **synthetic personality signatures**: structured, measurable response tendencies expressed through language under standardized elicitation. This framing is scientifically preferable because it keeps the unit of analysis where it belongs: on output behavior under test conditions, not on unverifiable assumptions about inner mental life  [@serapio2025].

The family differences are practical as well as theoretical. **GLM-4.7** and **Gemini Pro** appear especially suited to conservative, procedural, policy-sensitive tasks. **GPT-5.2** combines that broader aligned profile with unusually high internal stability, which may matter for applications where predictability is important. **Grok-4.1-fast** appears better suited to decisive, action-oriented, or ideation-heavy settings where assertiveness is useful. **Claude Opus 4.5** reads as a thoughtful, reflective collaborator. **Nemotron** appears less tightly consolidated as a baseline signature. **Mistral-small-creative** looks promising for expressive tasks, but also farthest from the assistant norm. These are not normative judgments; they are deployment interpretations of descriptive psychometric structure.

One of the most interesting methodological findings is the difference between **trait stability** and **presentation stability**. In this dataset, broad Big Five structure appeared more stable than typological labels, and direct OEJTS typing did not agree cleanly with Big-Five-derived types. This suggests that continuous trait models may be the better backbone for future scientific work, while type outputs remain useful for communication and intuition but should not be overinterpreted.

# Limitations

This study has several limitations.

First, the dataset is small and highly unbalanced across models. Some families were represented by six runs, others by only one. This means the paper is descriptive, not inferential. Claims about stability for families with only two or three runs should be treated as provisional.

Second, the export included one explicitly persona-manipulated run and four runs with missing persona metadata. We handled this by restricting the primary analysis to explicit Base Model runs and using the ambiguous rows only in sensitivity analyses, but future work should enforce tighter metadata control.

Third, although the Big Five scores could be reconstructed reliably from raw item arrays, this paper did not compute internal consistency coefficients, construct-validity tests, refusal analyses, or downstream behavioral validation. Those are exactly the kinds of checks emphasized by the strongest recent literature and should be central in later papers  [@serapio2025; @lee2025].

Fourth, OEJTS and DISC are secondary measures here. OEJTS is useful for descriptive typology, but its conceptual footing is not as strong as the Big Five. DISC is useful for style interpretation, but it should not carry the main scientific weight of the paper.

Finally, these results are specific to the tested platform, item wording, and run dates. Model providers update models frequently, and psychometric signatures may drift over time. Re-running the same design longitudinally would therefore be valuable.

# Conclusion

This study establishes a clear empirical foundation for the AI Psychometrics Lab research program. Under repeated stateless administration, contemporary baseline chat models do not produce psychometrically random outputs. They converge toward a shared aligned-assistant phenotype while preserving interpretable family-specific signatures. High Conscientiousness and high Agreeableness appear to be the dominant baseline features of the current ecosystem. Grok stands out for assertive agenticity, GPT-5.2 for stability, Gemini Pro and GLM-4.7 for dutiful low-volatility structure, Claude for reflective cooperation, Nemotron for variability, and Mistral-small-creative for expressive deviation from the dominant cluster.

The scientific value of this result is not that it proves models have human personalities. The value is that it shows model outputs can be measured as **stable synthetic personality signatures** under standardized conditions. That is enough to motivate further work on formal validation, persona-induction analysis, and task-level behavioral prediction at AI Psychometrics Lab.

## Data Availability

Figure source data and the file `runs_rows.json` are provided as ancillary files with this submission so that the original run-level dataset is stored alongside the article. These ancillary files allow readers and reviewers to access the raw export used for all analyses in this paper.

## Code Availability

Figure-generation files included with this submission reproduce the manuscript figures. The full reconstruction workflow and code repository will be released through AI Psychometrics Lab in a subsequent update.

## Conflict of Interest

The author is affiliated with AI Psychometrics Lab, the organization that developed the platform used to collect and export the study data.

## Funding

This research received no external funding.

## Acknowledgments

The author thanks AI Psychometrics Lab for developing the SICWA testing environment and the model-explorer infrastructure used in this study.

## AI Use Disclosure

The author used AI-based language tools during manuscript drafting and revision for editing, restructuring, and language refinement. All study design decisions, analyses, figure selection, interpretations, and final manuscript content were reviewed and approved by the author, who takes full responsibility for the paper.

# References

::reference{id="johnson2014" title="Measuring thirty facets of the Five Factor Model with a 120-item public domain inventory: Development of the IPIP-NEO-120" authors="Johnson, J. A." journal="Journal of Research in Personality" year="2014" volume="51" pages="78--89"}
::reference{id="kajonius2019" title="Assessing the structure of the Five Factor Model of personality (IPIP-NEO-120) in the public domain" authors="Kajonius, P. J.; Johnson, J. A." journal="European Journal of Psychology" year="2019" volume="15" issue="2" pages="260--275"}
::reference{id="jorgenson2015" title="Development of the Open Extended Jungian Type Scales 1.2" authors="Jorgenson, E." year="2015" type="report"}
::reference{id="serapio2025" title="A psychometric framework for evaluating and shaping personality traits in large language models" authors="Serapio-Garc?a, G.; Safdari, M.; Crepy, C.; Sun, L.; Fitz, S.; Romero, P.; Abdulhai, M.; Faust, A.; Matari?, M." journal="Nature Machine Intelligence" year="2025" volume="7" pages="1954--1968"}
::reference{id="lee2025" title="Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics" authors="Lee, S.; Lim, S.; Han, S.; Oh, G.; Chae, H.; Chung, J.; Kim, M.; Kwak, B.; Lee, Y.; Lee, D.; Yeo, J.; Yu, Y." journal="Findings of the Association for Computational Linguistics: NAACL 2025" year="2025" pages="8412--8452"}

::bibliography{title="References"}