hi, i’m benjamin
I’m a researcher focused on building AI that’s aligned with human values. To further that goal, I’m currently a contract researcher at OpenAI where I work on model evaluations and monitoring. I’m also a Research Fellow in the Transportation and Land Use program at the Marron Institute at New York University, where I apply machine learning to study urban governance issues related to transportation, government responsiveness, policing and use of public space.
You can e-mail me if you’d like to chat about AI safety, new projects, urbanism, music or ways to make the world a better place for all humans.
AI
I was the lead author on a paper studying chain-of-thought monitoring in AI Control. It has inspired some follow up work and has been cited in some cool papers. I continued this work at OpenAI with research on chain of thought monitorability and how it could be an important part of safety cases for highly capable AI systems.
I’ve recently written about how to accurately track model capabilities and improve evlauations. I care a lot about building the field of AI safety, and I mentor researchers when I can. Recent projects included researching dangerous capability evals and methods to improve AI Control.
Urbanism
Read my latest article where I highlight a chronic issue in New York City and how it’s emblematic of the challenges of crafting effective policies when underlying data is tainted. This piece is largely based on this research paper where I used machine learning to study how police respond to illegal parking complaints. It paints dire picture where the majority of cases are ignored and offers concrete solutions to make streets safer.
This built on earlier work I did using computer vision and official data to understand the chaotic nature of New York City streets.
I like learning
I had a great experience attending a programming residency at the Recurse Center, where I worked on projects related to technical AI safety. If you love programming, you should probably apply.
I like tinkering
I was tired of checking my phone for subway departures, so I built an LED arrival board for my living room. I spent a night at NYC’s Museum of Modern Art watching a film that made me think a lot about time.
Published Work
Why SWE-bench Verified no longer measures frontier coding capabilities
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring In Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring
Why Does the NYPD Ignore So Many Parking Complaints?
Eye in the Sky: Harnessing AI to Monitor Police Response to Illegal Parking Complaints