Benchmarking AI

How can we know if Artificial Intelligence systems are truly intelligent? There’s only one way: benchmark testing.

We need a system of tests that can be run against any software agent purporting to implement Artificial Intelligence. These tests should span a broad array of activities which humans normally think of as requiring intelligence, and they should return results that can be meaningfully graded.

The responses from multiple AI systems should be strongly similar. For example, if one AI system diagnosed a set of medical symptoms as indicating cancer, and another diagnosed them as indicating hematoma, that would indicate that the two systems have implemented very different deductive engines, at least as regards medical diagnosis.

The benchmark tests should cover a range of levels of difficulty, from simple to impossible. Mathematical tests might range from arithmetic problems to proving that the Riemann Hypothesis is either true or false.

Different realms of intelligent behavior produce different types of output that can’t all be judged in the same manner. A request to find the prime factors of 15 should always return 5 and 3. But a request to produce a sonnet about spring could have many suitable answers, some lovely and some lousy.

Here’s an outline of the type of benchmark test that I think would be useful:

Basic Level

Solve some elementary mathematical problems in {arithmetic or algebra or calculus or …}

Write a {Sonnet or Haiku or Villanelle or …} about {spring or a waterfall or the death of a relative or …}

Describe {electromagnetic induction or the Cavendish experiment or the double slit experiment or …} in language that a person with an eighth grade education can understand

Describe the chief causes of the American Civil War, listed in descending order of significance

Design a {storage shed or planter box or writing desk or …} with specifications {list of specifications}

Intermediate

Diagnose medical condition {condition} with the following symptoms: {list of symptoms}

Solve the field equations of {General Relativity or Maxwell’s Equations or Quantum Mechanics or …} for the following specific problem: {problem specification}

Describe the origins, history, and impacts of {the Renaissance or the Mongol conquests or the Great Depression or …}

Write an orchestral concerto for {violin or piano or clarinet or …} in {two or three or four or …} movements in the key of {musical key}

Design a house with specifications {list of specifications} in accordance with the regulations of {governing body}

Scan the laws of {governing body} for the following specific issues: {logical contradictions, passive voice, obsolete wording, conflicts with the laws of {governing body}, etc.}

Difficult

Prove a well known but difficult to prove mathematical theorem, like the Four Color Map theorem

Identify the chief similarities in politics and military strategy of the {American Civil War or Napoleonic Wars or Russian Revolution or …} and compare with those of the {French Revolution or Chinese Revolution or Peloponnesian War or …}

Design a skyscraper with specifications {list of specifications} in accordance with the regulations of {governing body}

Design a medical treatment for medical condition {condition for which no treatment currently exists} and a means for measuring its efficacy

Impossible

Prove either that the Riemann Hypothesis is true or that it is false

Describe the nature of dark matter and how it can be detected in an actual experiment

Predict the market value of all equities in the {NYSE or NASDAQ or AMEX or …} in {6 or 12 or 18 or …} months

Determine the most effective and lowest cost way to ensure the health, welfare, happiness, and security of all the peoples of the world

Grading the test

If every test item can be graded, then the results of the entire test can be assigned a grade. At a minimum any system purporting to be a true implementation of Artificial Intelligence should be able to perform at least as well as a single human expert on any specific question.

Some questions can be graded as either right or wrong, such as solutions of mathematical problems. But some will require some manner of judgment. The quality of a work of music or literature will require a judgment of artistic worth. So rules of grading must be established to determine how such grading will be implemented.

These benchmark tests should be run against every major AI bot at least yearly, and the results should be published for all to see. One would hope that those AI bots that perform poorly would gradually fall out of favor.

And it may even be possible that over time some of the tests grouped under the “Impossible” heading might actually move into “Difficult.” Now that would be a sign of true intelligence.

Copyright (c) 2026, David S. Moore

All rights reserved.

Leave a comment