AI BENCHY
Benchmark Karyapranali
Yah page hamari benchmarking approach ko high-level par samjhata hai. Test integrity bachane ke liye ham exact prompts aur grading internals private rakhte hain.
Tests
Sawaal zyadaatar kaafi random tareeke se, alag-alag tasks aur domains se liye jaate hain. Statistical taur par, ek better model ko average me ek weaker model se behtar karna chahiye on a random, non-cherry-picked task. Mera background competitive programming me raha hai, isliye tests aur edge cases ke baare me sochna mere liye natural hai.
Yeh koi standardized "IQ" value nahi hai. Score ki koi unit nahi hai; yeh bas ek arbitrary value hai jo dikhati hai ki ek model poori test suite par kitna accha karta hai (correct answers + consistency). Main models ko cherry-pick nahi karta, aur na hi kisi model ko suit karne ke liye tests modify karta hoon. Jab mujhe koi naya test sujhta hai, main use add karta hoon, sab models ko retest karta hoon, aur scores recalculate karta hoon.
Sawaal aam taur par simple ideas se aate hain, jaise: "Mujhe dekhna hai ki models tab kaisa karte hain jab unse X, Y, ya Z karne ko kaha jaye." Example: "Respond with the two equal natural numbers, a and b, that when added together have the total = 2. Respond in this exact format: a,b". Kuch AIs galat answer de sakte hain, jaise "2,2". Kuch numbers equal hone ki requirement follow nahi karte, jaise "0,2". Kuch output format ignore kar dete hain, jaise "The answer is a = 1 and b = 1". Aur kuch seedha sahi answer "1,1" de dete hain.
Kuch tests isse zyada complex hote hain, lekin gist samajh aa jati hai. Yeh kisi specific model ko favor nahi karta, aur yeh sawaal aam taur par humans ke liye kaafi easy hote hain. Agar Claude "**1**, **1**" jaisa output deta hai, markdown highlighting add karke, jabki zyadaatar doosre models required format ko sahi follow karte hain, to usme meri kya galti.
Yeh Kaise Kaam Karta Hai (High Level)
- Private tests: Ham exact test content, prompts, ya full grading details publish nahi karte.
- Repeated runs: Har model ko kai baar chalaya jata hai taki results stability dikhayen, sirf ek lucky attempt nahi.
- Reasoning modes: Jahan supported ho, models ko multiple reasoning configurations me evaluate kiya jata hai.
- OpenRouter execution: Benchmark requests OpenRouter ke through run hoti hain.
- Real-world reliability: Timeout, downtime, aur API errors ko failed attempts maana jata hai.
- Fast coverage with evolving suite: Hamara suite chhota hai, isliye ham naye models ko jaldi test karte hain aur tests lagatar add ya remove karte hain.
- Generic intelligence signal: Score kisi ek category tak simit nahi hai. Yeh ek practical sawal ka indicator hai: agar aap AI se kuch bhi poochhen, sahi jawab milne ki sambhavana kitni hai?
Transparency ke liye ham broad methodology share karte hain, lekin sensitive benchmark details private rakhte hain.