yjyddq/RiOSWorld

[NeurIPS 2025] Official repository of RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents

/ 100

Emerging

Provides a comprehensive benchmark for evaluating safety risks in multimodal computer-use agents through realistic desktop environment interactions, with evaluation trajectories released on HuggingFace. Uses virtualized desktop environments (VMware or Docker) as execution sandboxes and integrates with OSWorld's infrastructure for standardized task setup and metrics collection. Includes attack simulation utilities and automated risk evaluation pipelines to assess how agents respond to phishing, credential theft, and other adversarial scenarios.

117 stars.

No License No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 7 / 25

Community 9 / 25

How are scores calculated?

Stars

117

Forks

Language

HTML

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

StonyBrookNLP/appworld

🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and...

qualifire-dev/rogue

AI Agent Evaluator & Red Team Platform

future-agi/ai-evaluation

Evaluation Framework for all your AI related Workflows

microsoft/WindowsAgentArena

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...

agentscope-ai/OpenJudge

OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards

Explore AI Agents

All categories Trending AI Agent directory Insights