Benchmark Poddhoti

Ei page amader benchmarking approach high-level e bojhay. Test integrity rakhte amra exact prompt ebong grading internals private rakhi.

Tests

Proshnogulo beshirbhag somoy alada-alada task ar domain theke motamuti random bhabe neya hoy. Statistical bhabe bole, ekta better model-er average-e ekta weaker model-er cheye random, non-cherry-picked task-e bhalo korar kotha. Amar background competitive programming-e, tai test ar edge case niye bhaba amar kache khub natural.

Eta kono standardized "IQ" value noy. Ei score-er kono unit nei; eta shudhu ekta arbitrary value ja dekhae puro test suite-e ekta model koto bhalo korche (correct answer + consistency). Ami model cherry-pick kori na, ar kono model-er jonno test modify-o kori na. Jokhon amar mathay notun kono test ashe, ami seta add kori, shob model abar test kori, ar score recalculate kori.

Proshnogulo shadharonoto simple idea theke ashe, jemon: "Dekhi to model-gulo-ke jodi X, Y, ba Z korte bola hoy, tara bhalo kore kina". Dhori: "Respond with the two equal natural numbers, a and b, that when added together have the total = 2. Respond in this exact format: a,b". Kichhu AI vul answer dite pare, jemon "2,2". Kichhu abar numbers duita equal hote hobe ei requirement ta mante pare na, jemon "0,2". Kichhu output format ignore korte pare, jemon "The answer is a = 1 and b = 1". Ar onno kichhu shudhu thikmoto "1,1" bole dite pare.

Kichhu test er cheyeo beshi complex hoy, kintu motamuti idea ta clear. Eta kono specific model-ke favor korchhe na, ar ei proshnogulo manusher jonno shadharonoto khub easy.

Cristian

Eta Kivabe Kaj Kore (High Level)

Private tests: Amra exact test content, prompt, ba full grading details publish kori na.
Repeated runs: Prottek model ke anek bar chalano hoy jate result stability dekhay, sudhu ekbarer lucky attempt na.
Reasoning modes: Jekhane support ache, model ke multiple reasoning configurations e evaluate kora hoy.
OpenRouter execution: Benchmark requests OpenRouter er madhyome run hoy.
Real-world reliability: Timeout, downtime, ebong API error failed attempt hisebe count hoy.
Fast coverage with evolving suite: Amader suite chhoto bole notun model druto test kora jay, ebong test lagatar add/remove kora hoy.
Generic intelligence signal: Score kono ek category-te bondho na. Eta ekta practical proshner indicator: apni AI-ke kichu jiggesh korle shothik uttor pawar sombhabona kotota?

Transparency rakhte amra broad methodology share kori, kintu sensitive benchmark details private rakhi.