Continuous Thought Machines

Variable	Description
$z^{t}$	Post-activations at internal tick $t$ , after neuron-level models have been used.
$θ_{syn}$	Recurrent (synapse) model weights; U-NET-like architecture that connects neurons at a given internal tick, $t$ .
$a^{t}$	Pre-activations at internal tick $t$ .
$A^{t}$	History of most recent pre-activations, designed as a FIFO list so that they are always length $M$ ; inputs to neuron-level models.
$θ_{d}$	Weights of a single neuron-level model, $d$ of $D$ ; MLP architecture, unique weights per neuron.
$Z^{t}$	History of all post-activations up to this internal tick, variable length; used as input for synchronization dot products.
$S^{t}$	Synchronization matrix at internal tick $t$ . In practice we use far fewer neurons than $D$ for separate $S_{out}^{t}$ and $S_{action}^{t}$ synchronization representations.
$W_{out}$ , $W_{in}$	Linear weight matrices that project from $S_{out}^{t}$ and $S_{action}^{t}$ to attention queries and predictions, respectively.
$o^{t}$	Cross attention output.

Acknowledgements

Citation

For attribution in academic contexts, please cite this work as

Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, and Llion Jones. (2025). Continuous Thought Machines. Sakana AI Technical Report.

BibTeX citation

@techreport{darlow2025ctm,
  author    = {Luke Darlow and Ciaran Regan and Sebastian Risi and Jeffrey Seely and Llion Jones},
  title     = {{Continuous Thought Machines}},
  institution = {Sakana AI},
  year      = {2025},
  month     = {April},
  note      = {Technical Report}
}

Open Source Code

We release our code for this project here.

Appendix

Please view the PDF version of the paper for the appendix, which contains additional details and experiments.

References

Deep learning
LeCun, Y., Bengio, Y. and Hinton, G., 2015. nature, Vol 521(7553), pp. 436—444. Nature Publishing Group UK London.
Deep learning
Goodfellow, I., Bengio, Y., Courville, A. and Bengio, Y., 2016. , Vol 1(2). MIT press Cambridge.
Emergent abilities of large language models
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D. and others, ., 2022. arXiv preprint arXiv:2206.07682.
Spike timing—dependent plasticity: a Hebbian learning rule
Caporale, N. and Dan, Y., 2008. Annu. Rev. Neurosci., Vol 31(1), pp. 25—46. Annual Reviews.
Building machines that learn and think like people
Lake, B.M., Ullman, T.D., Tenenbaum, J.B. and Gershman, S.J., 2017. Behavioral and brain sciences, Vol 40, pp. e253. Cambridge University Press.
Deep learning: A critical appraisal
Marcus, G., 2018. arXiv preprint arXiv:1801.00631.
On the measure of intelligence
Chollet, F., 2019. arXiv preprint arXiv:1911.01547.
Time is of the essence: neural codes, synchronies, oscillations, architectures
Cariani, P. and Baker, J.M., 2022. Frontiers in Computational Neuroscience, Vol 16, pp. 898829. Frontiers Media SA.
On the relevance of time in neural computation and learning
Maass, W., 2001. Theoretical Computer Science, Vol 261(1), pp. 157—178. Elsevier.
Long short-term memory
Hochreiter, S. and Schmidhuber, J., 1997. Neural computation, Vol 9(8), pp. 1735—1780. MIT press.
Gate-variants of gated recurrent unit (GRU) neural networks
Dey, R. and Salem, F.M., 2017. 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), pp. 1597—1600.
Recurrent neural networks: design and applications
Medsker, L. and Jain, L.C., 1999. CRC press.
Attention is all you need
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I., 2017. Advances in neural information processing systems, Vol 30.
Perceiver: General perception with iterative attention
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A. and Carreira, J., 2021. International conference on machine learning, pp. 4651—4664.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Geiping, J., McLeish, S., Jain, N., Kirchenbauer, J., Singh, S., Bartoldson, B.R., Kailkhura, B., Bhatele, A. and Goldstein, T., 2025. arXiv preprint arXiv:2502.05171.
Looped transformers are better at learning learning algorithms
Yang, L., Lee, K., Nowak, R. and Papailiopoulos, D., 2023. arXiv preprint arXiv:2311.12424.
Meta learning backpropagation and improving it
Kirsch, L. and Schmidhuber, J., 2021. Advances in Neural Information Processing Systems, Vol 34, pp. 14122—14134.
Structurally Flexible Neural Networks: Evolving the Building Blocks for General Agents
Pedersen, J., Plantec, E., Nisioti, E., Montero, M. and Risi, S., 2024. Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1119—1127.
Introducing symmetries to black box meta reinforcement learning
Kirsch, L., Flennerhag, S., Van Hasselt, H., Friesen, A., Oh, J. and Chen, Y., 2022. Proceedings of the AAAI Conference on Artificial Intelligence, Vol 36(7), pp. 7202—7210.
Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks
Schwarzschild, A., Borgnia, E., Gupta, A., Huang, F., Vishkin, U., Goldblum, M. and Goldstein, T., 2021. Advances in Neural Information Processing Systems, Vol 34, pp. 6695—6706.
U-net: Convolutional networks for biomedical image segmentation
Ronneberger, O., Fischer, P. and Brox, T., 2015. Medical image computing and computer-assisted intervention—MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234—241.
Neural synchrony in cortical networks: history, concept and current status
Uhlhaas, P., Pipa, G., Lima, B., Melloni, L., Neuenschwander, S., Nikolic, D. and Singer, W., 2009. Frontiers in integrative neuroscience, Vol 3, pp. 543. Frontiers.
Deep residual learning for image recognition
He, K., Zhang, X., Ren, S. and Sun, J., 2016. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770—778.
Arc prize 2024: Technical report
Chollet, F., Knoop, M., Kamradt, G. and Landers, B., 2024. arXiv preprint arXiv:2412.04604.
Humanity’s Last Exam [link]
Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Zhang, C.B.C., Shaaban, M. and others, ., 2025.
Brain-inspired Artificial Intelligence: A Comprehensive Review
Ren, J. and Xia, F., 2024. arXiv preprint arXiv:2408.14811.
T-SCEND: Test-time Scalable MCTS-enhanced Diffusion Model
Zhang, T., Pan, J., Feng, R. and Wu, T., 2025. arXiv preprint arXiv:2502.01989.
End-to-end algorithm synthesis with recurrent networks: Extrapolation without overthinking
Bansal, A., Schwarzschild, A., Borgnia, E., Emam, Z., Huang, F., Goldblum, M. and Goldstein, T., 2022. Advances in Neural Information Processing Systems, Vol 35, pp. 20232—20242.
Adaptive computation time for recurrent neural networks
Graves, A., 2016. arXiv preprint arXiv:1603.08983.
CIFAR10 to compare visual recognition performance between deep neural networks and humans
Ho-Phuoc, T., 2018. arXiv preprint arXiv:1811.07270.
Human uncertainty makes classification more robust
Peterson, J.C., Battleday, R.M., Griffiths, T.L. and Russakovsky, O., 2019. Proceedings of the IEEE/CVF international conference on computer vision, pp. 9617—9626.
Proximal policy optimization algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. arXiv preprint arXiv:1707.06347.
Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks
Chevalier-Boisvert, M., Dai, B., Towers, M., Lazcano, R.d., Willems, L., Lahlou, S., Pal, S., Castro, P.S. and Terry, J., 2023. CoRR, Vol abs/2306.13831.
Gymnasium: A Standard Interface for Reinforcement Learning Environments [link]
Towers, M., Kwiatkowski, A., Terry, J., Balis, J.U., Cola, G.D., Deleu, T., Goulão, M., Kallinteris, A., Krimmel, M., KG, A., Perez-Vicente, R., Pierré, A., Schulhoff, S., Tai, J.J., Tan, H. and Younis, O.G., 2024.

This page requires Javascript. Please enable it to view the website.

Continuous Thought Machines

Interactive demonstration

Introduction

Why do this research?

Reasoning models and recurrence

Method

But what about data?

Internal ticks: the ‘thought’ dimension

A dimension over which thought can unfold.

Recurrent weights: synapses

Neuron-level models

Synchronization as a representation: modulating data

Synchronization enables a very large representation.

Modulating input data

Loss function: optimizing across internal ticks

More information in our Technical Report.

Experiment: ImageNet

Demonstrations

Results

Discussion

The missing ingredient: TIME

Experiment: Solving 2D Mazes - doing it the hard way

The why and the how

Demonstration

Results

Generalization

Discussion

A World Model

Where to go from here?

Experiment: Parity

Sequential data, non-sequentially

Demonstration

Results

Learning sequential algorithms

Experiment: Q&A MNIST

Memory via Synchronization

Demonstration

Results

Generalization

Additional experiments

CTM versus humans

CIFAR-100, ablation studies

Sorting real numbers

Reinforcement Learning

Conclusion

Acknowledgements

Citation

Open Source Code

Appendix

References