Promise81initial
Activation threshold62%
$310of $500
About
A systematic study of circuits and features in sub-1B parameter LMs, looking for transferable interpretability primitives.
Researcher
Mech-interp researcher, formerly Anthropic interpretability team.
Prior ventures
1
Milestones
- Pending
01Sub-1B model selection + circuit map
Select 5 base models, map known circuits, publish methodology.
Deadline: in 14dTranche: $0.60Outputs:Methodology docCircuit map dataset - Pending
02Transferable feature catalog
Identify ≥20 features that transfer across ≥3 of the selected models.
Deadline: in 9wTranche: $0.60Outputs:Feature catalogReplication notebook - Pending
03Whitepaper + dataset release
Submit to NeurIPS / ICLR with public dataset.
Deadline: in 17wTranche: $0.60Outputs:WhitepaperPublic dataset
Connected sources
Anchor literature
- Towards Monosemanticity: Decomposing Language Models with Dictionary LearningT. Bricken, A. Templeton, J. Batson, et al. (Anthropic) · transformer-circuits.pub · 2023Sparse autoencoders extract monosemantic features from a 1-layer transformer. The reference for SAE-based interpretability.
- Sparse Autoencoders Find Highly Interpretable Features in Language ModelsH. Cunningham, A. Ewart, L. Riggs, et al. · arXiv · 2023Companion result on residual-stream SAEs in larger models; widely cited as the bridge between toy SAEs and production-scale features.
Agent rules
Activates at$500 pool
Monthly allowance$50
Auto-liquidateProgress < 30 for 30d
Auto-pivot on disputesON