Articles by mustaphah
166

SkillsBench: Benchmarking how well agent skills work across diverse tasks (arxiv.org)

61

Evaluating AGENTS.md: are they helpful for coding agents? (arxiv.org)

3

Curosr: Expanding our long-running agents research preview (cursor.com)

1

METR: A simpler AI timelines model predicts 99% AI R&D automation in ~2032 (metr.org)

1

Measuring Time Horizon Using Claude Code and Codex (metr.org)

1

SWE-ContextBench: context learning benchmark in coding (arxiv.org)

2

SWE-AGI: benchmarking spec-driven software construction (arxiv.org)

1

Code Formatting Silently Consumes Your LLM Budget (arxiv.org)

1

Agent Trace by Cursor: open spec for tracking AI-generated code (agent-trace.dev)

1

METR releases Time Horizon 1.1 with 34% more tasks (metr.org)

2

Coffee timing isn't one-size-fits-all (examine.com)

1

ChatGPT subscription support in Kilo Code (kilo.ai)

2

Imposter Syndrome Predicts Perfectionism (psypost.org)

2

Motivation acts as a camera lens that shapes how memories form (psypost.org)

3

Claude Code: Merging Slash Commands into Skills (x.com)

2

The visual feedback tool for coding agents (agentation.dev)

1

Agent Skills to help developers using AI agents with Supabase (github.com/supabase)

2

METR AI Benchmark: Clarifying Limitations of Time Horizon (metr.org)

87

Scaling PostgreSQL to power 800M ChatGPT users (openai.com)

2

Claude Code plugin that rings your phone when a run needs you (github.com/zeframlou)

200

Exercise can be nearly as effective as therapy for depression (sciencedaily.com)

1

Party of One for Code Review (tidyfirst.substack.com)

4

FrontierScience Benchmark by OpenAI (openai.com)

1

Open Scouts: AI-driven web monitoring (firecrawl.dev)

2

A Rosetta Stone for AI Benchmarks (epoch.ai)

1

Study: Effects of LLMs versus Web Search on Depth of Learning (ssrn.com)

1

Learning via ChatGPT leads to shallower knowledge than using Google search (psypost.org)

5

Stop Hiring for Languages. Start Hiring Great Engineers (medium.com/jonathans-musings)

2

Artificial Analysis: Claude Opus 4.5 is the #2 most intelligent model (artificialanalysis.ai)

2

Large-scale trial finds 4-day workweek improves employee well-being (psypost.org)

1

mgrep: searching codebases with embeddings (github.com/mixedbread-ai)

2

Benchmarking LLMs at the Frontier of Physics (artificialanalysis.ai)

1

ChatGPT's social trait judgments align with human impressions (psypost.org)

1

Show HN: Visual GraphQL Query Builder (hadid.dev)

2

The Learning Loop and LLMs (martinfowler.com)

1

Is OpenAI becoming too big to fail? (msn.com)

2

AI and reverse Dunning-Kruger effect (sciencedirect.com)

1

MiniMax M2: open model for agents and code (minimaxi.com)

3

Study finds a shift toward liberal politics after leaving religion (psypost.org)

1

Hunting the body's hidden sixth sense (sciencedaily.com)

1

METR review of OpenAI's GPT-OSS fine-tuning safety methodology (metr.org)

5

Study: poor sleep linked to faster brain aging (psypost.org)

34

Researchers complete first human trial on viability of enteral ventilation (newatlas.com)

2

Gemini CLI Tips and Tricks (github.com/addyosmani)

2

Curl: which host, which protocol (haxx.se)

4

Prosper data breach impacts 17.6M accounts (bleepingcomputer.com)

3

The Pragmatic Engineer 2025 Survey: What's in your tech stack? Part 3 (pragmaticengineer.com)

3

Can AI replace junior workers? (economist.com)

1

Summary of Google's latest AI news (blog.google)

5

Free CDN for open-source projects (bunny.net)

10

When money is abundant, knowledge is the real wealth (2020) (lesswrong.com)

2

WindowMode: Webcam-based 3D window illusion using head tracking (github.com/true3dlabs)

8

The Czech Trump wins an election, again (economist.com)

2

Good Books (goodbooks.io)

1

What LLMs teach us about intelligence (every.to/chain-of-thought)

2

Building AI for Cyber Defenders (anthropic.com)

2

Vercel Domains (vercel.com)

1

Instruct: Build AI workflow agents with prompts (powerful-audience-885319.framer.app)

2

Anthropic releases official Claude Agent SDK for Python (github.com/anthropics)

2

Karpathy: LLMs are "ghosts," not "animals" (karpathy.bearblog.dev)

1

Glazed: AI turns Figma designs into tracking code for analytics (glazedanalytics.com)

4

Musk says xAI building "Grokipedia" after criticizing Wikipedia (thehill.com)

2

Fight for Open (ma.tt)

2

Claude and Slack (anthropic.com)

2

Vibe Coding Award (vibecodingaward.com)

104

Cursor 1.7 (cursor.com)

1

Miniffi: Call Rust code from JavaScript/Swift/C++ with minimal setup (github.com/evanw)

2

Imagine with Claude: build working software and UI on the fly [video] (youtube.com)

3

Your web images are probably oversized (reasonunderpressure.com)

1

Sidekick.nvim: AI CLI and Copilot edits inside Neovim (github.com/folke)

1

Sidekick.nvim: AI CLI and Copilot Edits Inside Neovim (github.com/folke)

4

Autism may be the price of human intelligence (sciencedaily.com)

3

Linkie: Free, unbranded Linktree alternative with unlimited links (linkie.bio)

1

Store passwords in one table with foreign keys (satire) (twitter.com/imsh4yy)

3

Do pushups... or we'll block your commits (gitpushups.com)

1

Every URL is RSSible (hadid.dev)

5

GitHub Copilot CLI: The Copilot coding agent in the terminal [public preview] (github.com/github)

1

Off-peak GPU hours give EU/AU engineers an edge with AI tools (seangoedecke.com)

1

Tiny JSON parsing library – 150 lines of C99 (github.com/rxi)

45

Feedmaker: URL + CSS selectors = RSS feed (feedmaker.fly.dev)

1

Facebook Research releases MapAnything, 3D reconstruction from images (github.com/facebookresearch)

3

Google releases AP2: open protocol for AI-driven payments (cloud.google.com)

3

GitHub Action to catch unsafe NPM package updates in lockfiles (github.com/danielroe)

2

Spec Kit By GitHub: turn natural-language specs into actionable dev steps (github.com/github)

2

Learn Your Way: transform content into interactive lessons by Google (withgoogle.com)

2

Transparency done right: Buttondown's OSS stack and donations (buttondown.com/open-source)

1

IBM's watsonx explores using LLMs to judge other LLMs [video] (youtube.com)

2

Building MCP servers with Docker: NetworkChuck's tutorial and starter kit (github.com/thenetworkchuck)

2

ROMA: Meta-agents with task decomposition, backed by benchmark wins (github.com/sentient-agi)

162

Top model scores may be skewed by Git history leaks in SWE-bench (github.com/swe-bench)

1

AI cut new hire onboarding from 91 to 49 days (getdx.com)

3

Browser extension gives Claude the ability to think step by step (github.com/richards199999)

1

CLI-based multi-agent trading system using LLMs (github.com/tauricresearch)

2

Liquid Glass Component for React Native (github.com/callstack)

1

Dev3000 – The browser for AI-based development by Vercel (vercel.sh)

1

Unofficial fork of Microsoft's VibeVoice after repo withdrawal (github.com/vibevoice-community)

1

A lightweight, browser-based Ethernet cable connection manager (github.com/bijomaru78)

1

An open-source and self-hostable alternative to Vercel (devpu.sh)

3

Ora: Fast, secure, and beautiful browser built for macOS (early beta) (github.com/the-ora)

21

A desktop environment without graphics (tmux-like) (github.com/julien-cpsn)