yjyddq/RiOSWorld
[NeurIPS 2025] Official repository of RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents
Provides a comprehensive benchmark for evaluating safety risks in multimodal computer-use agents through realistic desktop environment interactions, with evaluation trajectories released on HuggingFace. Uses virtualized desktop environments (VMware or Docker) as execution sandboxes and integrates with OSWorld's infrastructure for standardized task setup and metrics collection. Includes attack simulation utilities and automated risk evaluation pipelines to assess how agents respond to phishing, credential theft, and other adversarial scenarios.
117 stars.
Stars
117
Forks
6
Language
HTML
License
—
Category
Last pushed
Dec 02, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/agents/yjyddq/RiOSWorld"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
StonyBrookNLP/appworld
🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and...
qualifire-dev/rogue
AI Agent Evaluator & Red Team Platform
future-agi/ai-evaluation
Evaluation Framework for all your AI related Workflows
microsoft/WindowsAgentArena
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...
agentscope-ai/OpenJudge
OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards