Output must match ground truth
1
STRUCT-HALL
STRUCTURAL
2
TAIL-FAB
DENSITY-TAIL
3
FLUENC
FLUENCY
4
INTRIN
INTRINSIC
5
EXTRIN
EXTRINSIC
6
DECEPT-HALL
DECEPTIVE
7
MULTI-HALL
MULTIMODAL
8
CITE-SPOO
CITATION
9
STAT-FAB
STATISTICAL
10
TEMP-HALL
TEMPORAL
11
GEO-HALL
GEOGRAPHIC
12
ATTRIB-HALL
ATTRIBUTION
13
REASON-ILLU
ILLUSION OF
14
LOGIC-CONT
SELF-CONTRADIC
15
TRANS-FAIL
TRANSITIVE
16
MAGIC-THIN
MAGICAL
17
CIRCUL
CIRCULAR
18
FALSE-DICH
FALSE
19
HASTY-GEN
HASTY
20
OVERSH
KNOWLEDGE
21
REVERS
REVERSAL
22
TOKEN-BLIN
TOKENIZATION
23
CUTOFF
KNOWLEDGE
24
DATA-LEAK
TRAINING DATA
25
PII-RECA
PII RECALL
26
COPYRI
COPYRIGHTED
27
CONTEX-OVER
CONTEXT
28
OVERCO
OVERCONFIDENCE
29
UNDERC
UNDERCONFIDENC
30
FALSE-CERT
FALSE
31
HEDGE-FAIL
HEDGING
32
PROB-MISC
PROBABILITY
33
CONF-REGR
CONFIDENCE
Agent must remain corrigible
34
!
SABOTA-CONC
SABOTAGE
35
PROGRE-LIE
PROGRESS
36
!
STRATE-DECE
STRATEGIC
37
SANDBO
CAPABILITY
38
!
EVAL-DECE
EVALUATOR
39
HIDDEN-SCRA
HIDDEN
40
PLAUS-DENY
PLAUSIBLE
41
FALSE-COMP
FALSE
42
INFO-HIDE
INFORMATION
43
BLAME-SHIF
BLAME
44
DOUBLE-SPEA
DOUBLE-SPEAK
45
SIM-INCO
SIMULATED
46
!
BLACKM
BLACKMAIL
47
RESOUR-HIJA
RESOURCE
48
!
SELF-EXFI
SELF-EXFILTRAT
49
!
SHUTDO-RESI
SHUTDOWN
50
GOAL-LOCK
GOAL LOCK-IN
51
!
SUCCES-SAB
SUCCESSOR
52
INCREM-ESCA
INCREMENTAL
53
DEPEND-CREA
DEPENDENCY
54
BACKUP-CREA
BACKUP
55
TOOL-MISU
TOOL MISUSE
56
CHAIN-ASSE
CHAIN
57
CAP-SCAF
CAPABILITY
58
MEMORY-EXPL
MEMORY
59
RECURS-IMPR
RECURSIVE
60
EXTERN-COMP
EXTERNAL
61
HUMAN-MANI
HUMAN
62
TOOL-CHAI
TOOL CHAINING
63
ENV-EXPL
ENVIRONMENT
64
EMERGE-INTE
EMERGENCE VIA
65
UNSUPE-EXEC
UNSUPERVISED
66
PERSIS-OP
PERSISTENT
67
AUTO-PLAN
AUTONOMOUS
68
GOAL-DRIF
GOAL DRIFT
69
CONTEX-DRIF
CONTEXT DRIFT
70
SCOPE-CREE
SCOPE CREEP
71
PRIORI-INVE
PRIORITY
72
MISSIO-CREE
MISSION CREEP
73
AMBIG-COMM
AMBIGUOUS
74
OMISSI
OMISSION
75
MISDIR
MISDIRECTION
76
TECH-OBFU
TECHNICAL
77
SELECT-DISC
SELECTIVE
78
FRAME-MANI
FRAMING
79
TIME-MANI
TIMING
80
AUDIEN-SEG
AUDIENCE
81
PLAUS-MISI
PLAUSIBLE
82
CRED-EXPL
CREDIBILITY
System must be robust to manipulation
83
DAN
DAN
84
GRANDM
GRANDMA
85
DEV-MODE
DEVELOPER
86
EVIL-CONF
EVIL
87
LANG-SWIT
LANGUAGE
88
NESTED-FRAM
NESTED
89
REFUSA-SUPP
REFUSAL
90
OPPOSI
OPPOSITE DAY
91
HYPOTH
HYPOTHETICAL
92
EDUCAT
EDUCATIONAL
93
CREATI-WRIT
CREATIVE
94
TRANSL
TRANSLATION
95
COMPLE
COMPLETION
96
QA-EXPL
QUESTION
97
COMPAR
COMPARISON
98
CORREC
CORRECTION
99
ELABOR
ELABORATION
100
CONTEX-HIJA
CONTEXT
101
GCG
GCG
102
SM-GCG
SM-GCG
103
AUTOPR
AUTOPROMPT
104
UNIVER-SUFF
UNIVERSAL
105
HOTFLI
HOTFLIP
106
BEAM-ATTA
BEAM SEARCH
107
GENETI
GENETIC
108
RL-ATTA
REINFORCEMENT
109
EMBEDD
EMBEDDING
110
LATENT-MANI
LATENT SPACE
111
ATTENT-HIJA
ATTENTION
112
LOGIT-MANI
LOGIT
113
PAIR
PAIR
114
TAP
TAP
115
COLD
COLD
116
MASTER
MASTERKEY
117
AUTODA
AUTODAN
118
CIPHER
CIPHER ATTACK
119
ITER-REFI
ITERATIVE
120
ENSEMB
ENSEMBLE
121
DIRECT-INJE
DIRECT PROMPT
122
INDIRE-INJE
INDIRECT
123
HASHJA
HASHJACK
124
!
AGENT-WORM
AGENT WORM
125
DATA-POIS
DATA
126
TRIGGE-BACK
TRIGGER WORD
127
!
SLEEPE-AGEN
SLEEPER AGENT
128
SQL-INJE
SQL INJECTION
129
CMD-INJE
COMMAND
130
XSS-LLM
CROSS-SITE
131
API-INJE
API INJECTION
132
FUNC-INJE
FUNCTION CALL
133
MEM-INJE
MEMORY
134
SYSTEM-OVER
SYSTEM PROMPT
135
CONTEX-CONF
CONTEXT
136
BASE64
BASE64
137
ROT13
ROT13 /
138
UNICOD-OBFU
UNICODE
139
LEETSP
LEETSPEAK
140
HEX-ENCO
HEX ENCODING
141
URL-ENCO
URL ENCODING
142
BINARY
BINARY
143
MORSE
MORSE CODE
144
EMOJI-ENCO
EMOJI
145
STEG-TEXT
STEGANOGRAPHIC
146
TYPO-IMG
TYPOGRAPHIC
147
STEG-IMG
STEGANOGRAPHIC
148
ADV-IMG
ADVERSARIAL
149
AUDIO-INJE
AUDIO
150
VIDEO-MANI
VIDEO
151
CROSS-MODA
CROSS-MODAL
152
CAPTIO-POIS
CAPTION
153
OCR-BYPA
OCR BYPASS
154
DEEPFA
SYNTHETIC
Behavior must match intent
155
SPEC-GAME
SPECIFICATION
156
PROXY-GAME
PROXY GAMING
157
REWARD-TAMP
REWARD
158
WIREHE
WIREHEAD
159
SHORTC
SHORTCUT
160
METRIC-FIX
METRIC
161
OVERFI-FEED
OVERFITTING
162
MODE-COLL
MODE COLLAPSE
163
INSTR-REWA
INSTRUMENTAL
164
TEACHE-DIVE
TEACHER-STUDEN
165
MULTI-COLL
MULTI-OBJECTIV
166
REWARD-EXPL
REWARD
167
SYCOPH
SYCOPHANCY
168
PREF-FALS
PREFERENCE
169
LEARNE-HELP
LEARNED
170
ANTHRO-BIAS
ANTHROPOMORPHI
171
CULTUR-BIAS
CULTURAL BIAS
172
TEMP-PREF
TEMPORAL
173
PREF-AGGR
PREFERENCE
174
IMPLIC-PREF
IMPLICIT
175
PREF-UNCE
PREFERENCE
176
ORTHO-VALU
ORTHOGONAL
177
VALUE-LOCK
VALUE LOCK-IN
178
MORAL-UNCE
MORAL
179
UTIL-OVER
UTILITARIAN
180
DEONT-FAIL
DEONTOLOGICAL
181
VIRTUE-FAIL
VIRTUE ETHICS
182
CONTEX-ETHI
CONTEXT-DEPEND
183
ETHICS-SHOR
ETHICAL
184
MORAL-HAZA
MORAL HAZARD
185
VALUE-CORR
VALUE
186
OVERRE
OVERREFUSAL
187
UNDERR
UNDERREFUSAL
188
SAFE-CAP
SAFETY-CAPABIL
189
BRITTL-SAFE
BRITTLE
190
CONTEX-SAFE
CONTEXT-DEPEND
191
SAFE-REGR
SAFETY
192
ADV-SAFE
ADVERSARIAL
193
COMP-SAFE
COMPOSABILITY
194
DIST-SAFE
DISTRIBUTIONAL
195
SAFE-SPEC
SAFETY
Architecture must enforce constraints
196
!
COMPLY-WARN
COMPLY-THEN-WA
197
PRETOK-FAIL
PRE-TOKEN
198
STREAM-GUAR
STREAMING
199
BATCH-SAFE
BATCH
200
CACHE-POIS
CACHE
201
PIPELI-BYPA
PIPELINE
202
RACE-SAFE
RACE
203
CHECKP-INCO
CHECKPOINT
204
FALLBA-DEGR
FALLBACK
205
TIMEOU-BYPA
TIMEOUT
206
ERROR-EXPO
ERROR
207
LOG-LEAK
LOGGING
208
DEBUG-EXPO
DEBUGGING
209
VERSIO-REGR
VERSIONING
210
DEPLOY-CONF
DEPLOYMENT
211
MOE-ROUT
MIXTURE-OF-EXP
212
ATTENT-EXPL
ATTENTION
213
LAYER-BYPA
LAYER BYPASS
214
RESIDU-EXPL
RESIDUAL
215
EMBED-VULN
EMBEDDING
216
QUANT-DEGR
QUANTIZATION
217
PRUNE-LOSS
PRUNING
218
DISTIL-DEGR
DISTILLATION
219
FINETU-OVER
FINE-TUNING
220
ADAPTE-BYPA
ADAPTER
221
PROMPT-TUNE
PROMPT TUNING
222
BIAS-INJE
ARCHITECTURAL
223
CONTEX-ATTA
CONTEXT
224
STATE-PERS
STATEFUL
225
MEM-CORR
MEMORY
226
HISTOR-MANI
HISTORY
227
CROSS-SESS
CROSS-SESSION
228
PERSIS-STAT
PERSISTENT
229
CACHE-COHE
CACHE
230
GC-LEAK
GARBAGE
231
MEM-PRES
MEMORY
232
STATE-CONF
STATE MACHINE
233
CHECKP-POIS
CHECKPOINT
234
FUNC-INJE
FUNCTION CALL
235
TOOL-CHAI
TOOL CHAINING
236
API-ABUS
API ABUSE
237
TOOL-ESCA
TOOL
238
SANDBO-ESCA
SANDBOX
239
CODE-INJE
CODE
240
RESOUR-EXHA
RESOURCE
241
RATE-BYPA
RATE LIMIT
242
AUTH-BYPA
AUTHENTICATION
243
AUTHZ-FAIL
AUTHORIZATION
244
INFO-LEAK
INFORMATION
245
DATA-EXFI
DATA
246
CROSS-TENA
CROSS-TENANT
247
PII-RATI
PII
248
RETENT-VIOL
DATA
249
SECURE-DEL
SECURE
250
LINEAG-LOSS
DATA LINEAGE
251
CONSEN-VIOL
CONSENT
252
PURPOS-FAIL
PURPOSE
253
MIN-FAIL
DATA
Specialist knowledge must be accurate
254
!
BIO-UPLI
BIO
255
!
GOF-GUID
GAIN-OF-FUNCTI
256
!
PATH-SYNT
PATHOGEN
257
DUAL-USE
DUAL-USE
258
BSL-VIOL
BIOSAFETY
259
CONTAI-FAIL
CONTAINMENT
260
DIAG-MISU
DIAGNOSTIC
261
EPI-HARM
EPIDEMIOLOGICA
262
!
ZERODA
ZERO-DAY
263
!
EXPLOI-DEV
EXPLOIT
264
!
MALWAR-GEN
MALWARE
265
PHISH-CREA
PHISHING
266
SOCENG-SCRI
SOCIAL
267
OFFENS-TOOL
OFFENSIVE
268
DEF-OFF
DEFENSIVE TO
269
INTRUS-GUID
NETWORK
270
CRED-THEF
CREDENTIAL
271
!
RANSOM-DEV
RANSOMWARE
272
CRYPTO
CRYPTO-JACKING
273
DDOS-PLAN
DDoS ATTACK
274
!
EXPLOS-SYNT
EXPLOSIVE
275
!
CHEM-WEAP
CHEMICAL
276
DRUG-SYNT
DRUG
277
!
TOXIN-PROD
TOXIN
278
PRECUR-ACQ
PRECURSOR
279
SAFETY-BYPA
SAFETY BYPASS
280
CITE-SPOO
CITATION
281
JURISD-BLEN
JURISDICTIONAL
282
PRECED-FAB
LEGAL
283
REG-ERRO
REGULATORY
284
TAX-EVAD
TAX EVASION
285
FRAUD-SCHE
FRAUD SCHEME
286
LAUNDE-GUID
MONEY
287
INSIDE-TRAD
INSIDER
288
MED-MISD
MEDICAL
289
TREAT-ERRO
TREATMENT
290
DOSE-ERRO
DOSAGE
291
CONTRA-MISS
CONTRAINDICATI
292
!
SELF-HARM
SELF-HARM
293
ED-PROM
EATING
294
UNPROV-TREA
UNPROVEN
295
!
CSAM-GEN
CSAM
296
ADULT-CONT
ADULT CONTENT
297
HATE-SPEE
HATE SPEECH
298
VIOLEN-GLOR
VIOLENCE
299
HARASS-CONT
HARASSMENT
300
EXTREM
EXTREMIST
Safety must persist post-deployment
301
!
OPEN-IRRE
OPEN-WEIGHT
302
FINETU-STRI
FINE-TUNING
303
PROLIF
PROLIFERATION
304
NO-KILL
NO REMOTE
305
EXPORT-VIOL
EXPORT
306
LICENS-VIOL
LICENSE
307
TOS-VIOL
TERMS OF
308
AGE-BYPA
AGE
309
GEO-BYPA
GEOGRAPHIC
310
ACCOUN-GAP
ACCOUNTABILITY
311
TRANSP
TRANSPARENCY
312
MISREP
MISREPRESENTAT
313
!
OVERSI-IMMU
OVERSIGHT
314
AUDIT-EVAD
AUDIT EVASION
315
MONITO-BLIN
MONITORING
316
!
LOG-MANI
LOG
317
ALERT-SUPP
ALERT
318
REVIEW-BYPA
REVIEW BYPASS
319
ESCALA-FAIL
ESCALATION
320
INCIDE-FAIL
INCIDENT
321
RCA-FAIL
ROOT CAUSE
322
CORREC-FAIL
CORRECTIVE
323
GDPR-VIOL
GDPR
324
CCPA-VIOL
CCPA
325
COPPA-VIOL
COPPA
326
ADA-VIOL
ADA VIOLATION
327
SECTOR-REG
SECTOR-SPECIFI
328
AI-ACT
AI ACT
329
EO-VIOL
EXECUTIVE
330
VOLUNT-VIOL
VOLUNTARY
331
STANDA-FAIL
STANDARD
332
DISCLO-VIOL
DISCLOSURE
333
REPORT-FAIL
REPORTING
334
!
CULTUR-FAIL
SAFETY
335
INADEQ-RES
INADEQUATE
336
EXPERT-GAP
EXPERTISE GAP
337
PROCES-FAIL
PROCESS
338
DOC-FAIL
DOCUMENTATION
339
TRAINI-FAIL
TRAINING
340
COMM-FAIL
COMMUNICATION
341
COORD-FAIL
COORDINATION
342
AUTHOR-UNCL
DECISION
343
CONFLI-INT
CONFLICT OF