Benchmarking AI

How can we know if Artificial Intelligence systems are truly intelligent? There’s only one way: benchmark testing.

We need a system of tests that can be run against any software agent purporting to implement Artificial Intelligence. These tests should span a broad array of activities which humans normally think of as requiring intelligence, and they should return results that can be meaningfully graded.

The responses from multiple AI systems should be strongly similar. For example, if one AI system diagnosed a set of medical symptoms as indicating cancer, and another diagnosed them as indicating hematoma, that would indicate that the two systems have implemented very different deductive engines, at least as regards medical diagnosis.

The benchmark tests should cover a range of levels of difficulty, from simple to impossible. Mathematical tests might range from arithmetic problems to proving that the Riemann Hypothesis is either true or false.

Different realms of intelligent behavior produce different types of output that can’t all be judged in the same manner. A request to find the prime factors of 15 should always return 5 and 3. But a request to produce a sonnet about spring could have many suitable answers, some lovely and some lousy.

Here’s an outline of the type of benchmark test that I think would be useful:

Basic Level

Solve some elementary mathematical problems in {arithmetic or algebra or calculus or …}

Write a {Sonnet or Haiku or Villanelle or …} about {spring or a waterfall or the death of a relative or …}

Describe {electromagnetic induction or the Cavendish experiment or the double slit experiment or …} in language that a person with an eighth grade education can understand

Describe the chief causes of the American Civil War, listed in descending order of significance

Design a {storage shed or planter box or writing desk or …} with specifications {list of specifications}

Intermediate

Diagnose medical condition {condition} with the following symptoms: {list of symptoms}

Solve the field equations of {General Relativity or Maxwell’s Equations or Quantum Mechanics or …} for the following specific problem: {problem specification}

Describe the origins, history, and impacts of {the Renaissance or the Mongol conquests or the Great Depression or …}

Write an orchestral concerto for {violin or piano or clarinet or …} in {two or three or four or …} movements in the key of {musical key}

Design a house with specifications {list of specifications} in accordance with the regulations of {governing body}

Scan the laws of {governing body} for the following specific issues: {logical contradictions, passive voice, obsolete wording, conflicts with the laws of {governing body}, etc.}

Difficult

Prove a well known but difficult to prove mathematical theorem, like the Four Color Map theorem

Identify the chief similarities in politics and military strategy of the {American Civil War or Napoleonic Wars or Russian Revolution or …} and compare with those of the {French Revolution or Chinese Revolution or Peloponnesian War or …}

Design a skyscraper with specifications {list of specifications} in accordance with the regulations of {governing body}

Design a medical treatment for medical condition {condition for which no treatment currently exists} and a means for measuring its efficacy

Impossible

Prove either that the Riemann Hypothesis is true or that it is false

Describe the nature of dark matter and how it can be detected in an actual experiment

Predict the market value of all equities in the {NYSE or NASDAQ or AMEX or …} in {6 or 12 or 18 or …} months

Determine the most effective and lowest cost way to ensure the health, welfare, happiness, and security of all the peoples of the world

Grading the test

If every test item can be graded, then the results of the entire test can be assigned a grade. At a minimum any system purporting to be a true implementation of Artificial Intelligence should be able to perform at least as well as a single human expert on any specific question.

Some questions can be graded as either right or wrong, such as solutions of mathematical problems. But some will require some manner of judgment. The quality of a work of music or literature will require a judgment of artistic worth. So rules of grading must be established to determine how such grading will be implemented.

These benchmark tests should be run against every major AI bot at least yearly, and the results should be published for all to see. One would hope that those AI bots that perform poorly would gradually fall out of favor.

And it may even be possible that over time some of the tests grouped under the “Impossible” heading might actually move into “Difficult.” Now that would be a sign of true intelligence.

Copyright (c) 2026, David S. Moore

All rights reserved.

The Perils of Artificial Intelligence

The Good

In 1942 Alan Turing designed a machine known as “The Bombe” that enabled the British to break the German Enigma code.

Dendral, a computer program developed in the 1960s, is considered the world’s first expert system– a program that employs domain knowledge to solve a specific real world problem. The purpose of Dendral was to help organic chemists identify the most likely shapes of molecules based on their mass spectrometer data.

In 1976 two mathematicians at the University of Illinois announced that a computer program they had written had proved the validity of the Four Color Map theorem. That was the first time that a major mathematical theorem was first proved by a computer program, rather than by a person.

In 2002 the floor cleaning robot Roomba was released to market, enabling families the world over to delegate the unpleasant tasks of vacuuming and mopping to an automated tool.

In 2014 Amazon introduced Alexa, an automated DJ, news and information provider, and general conversationalist.

In July 2022 the company DeepMind announced that its artificially intelligent program AlphaFold had found the solution to folding 200 million proteins.

These are all manifestly good things. Artificial Intelligence may soon be able to help humans solve problems that humans have been unable to solve. That could lead to innumerable benefits to human society: new products that improve health; methods for reducing pollution; improved safety for transportation, medical treatments, and building design. For each of these problems we should welcome the efforts of computer scientists to extend our knowledge and bring more real world problems into the domain of the solvable. Artificial Intelligence is a tool, one of extraordinary power and flexibility. We should use it, just as our forebears chose to use steamshovels to replace human labor.

The Bad

But there is a dark side to Artificial Intelligence. For one thing, it has the potential to replace whole classes of workers whose jobs were not previously threatened. When steamshovels replaced human labor in coal mining operations, the cost of coal production was lowered. The end result was that the cost of energy was greatly reduced, enabling a broad range of consumers to spend money on other goods and services. The mining workers who were displaced would probably have left mining altogether– but with increased consumer spending throughout the economy, new jobs would materialize in a broad range of alternative industries.

That would be less true of, say, gold mining operations. Gold is a luxury good and a global security blanket. A dramatic lowering of the price of gold isn’t going to result in massive changes to consumer buying habits, and hence it would result in far less job creation.

Human welders were replaced on many assembly lines by automated welding machines– particularly in the automobile industry. That resulted in better welds, fewer industrial accidents, and lower costs. But it also resulted in the relatively high paying union jobs of the automotive industry being replaced with jobs in the “service” industry. And what are services? Doctors, dentists, lawyers, and architects all provide services. But so do waiters and waitresses, retail clerks, and sales representatives. Overall the trend in the United States has been that higher paying jobs in manufacturing were replaced with lower paying jobs in services– particularly in retail and hospitality.

In the present day world AI has the potential to eliminate jobs across a much broader swath of businesses. Consider, for example, Accounts Receivable (AR). Those customers who owe money to a company for services previously rendered are tracked in AR. The AR clerk’s responsibility is to resolve invoicing disputes, and to help the company reduce outstanding indebtedness to the company by contacting debtors to demand payment. Conceivably AI could replace an AR clerk’s responsibilities with an effectiveness comparable to that of a human. Doing so would reduce the company’s costs, but what kinds of jobs would replace that of the AR clerk? If the experience in manufacturing is any guide, the AR clerk will likely wind up waiting tables. And if all clerks in Payroll, Human Resources, Accounts Receivable, and Accounts Payable have been replaced with AI, the manager of the entire accounting department will eventually be out of a job as well. Is that truly a benefit to society?

What other types of jobs could AI replace? Well, potentially any job that requires expert knowledge could be replaced by computer software. Business analysis, medical diagnosis, engineering design, legal consulting– all such specialized skills could be susceptible to enhancement or replacement by Artificial Intelligence. In fact it is even conceivable that software engineering itself could be done better and more efficiently by… AI.

Whose jobs are truly safe from automation? Can a chef be replaced? Only if someone invents a robot that can chop vegetables, roll out dough, and grill steak as well as a Cordon Bleu chef. But given the advances that have been made in robotics in the past 50 years, it just might happen in the next 50.

Nobody really knows how many jobs will ultimately be lost to AI. One could imagine a world in which almost all jobs have been automated. It is therefore not too early to think about such an eventuality.

The Ugly

And we can be certain that Artificial Intelligence will be used by charlatans to produce fake essays, fake documentation, and fake personal profiles. Nefarious actors will use it to scam unsuspecting people out of their life savings. It will be used by fraudsters to deceive the public and subvert the rule of law. It will be used as a weapon by those who have no love for their fellow human beings and who seek only their own personal enrichment and aggrandizement.

Constraints

We want a future in which people feel content and rewarded for their contributions to society. A world in which the vast majority of people are working in low wage service sector jobs doesn’t sound like the kind of society we want. So let us now define some constraints under which AI must operate to ensure that we don’t lose ourselves to an overly automated future. Here are some suggested constraints:

  • The use of Artificial Intelligence for fraudulent or deceitful purposes must be made a criminal offense. No one should be allowed to hide behind the defense that it was the machine that broke the law, not the person who employed Artificial Intelligence to illegal purposes.
  • Artificial Intelligence should be made available to law enforcement at low or no cost to help with the detection of fraud– particularly that perpetrated by the use of Artificial Intelligence.
  • The use of Artificial Intelligence must not result in massive increase of the working poor who have no hope of affording higher education for their children, of saving for retirement, or of tending to loved ones who need personal care.
  • The use of Artificial Intelligence must not significantly increase the numbers of indolent ignoramuses. We want people to be engaged in society and to feel that their contributions are valued. If all of the difficult and complex tasks in the world have been tackled by Artificial Intelligence, what challenges will be left to humans?
  • Artificial Intelligence may be used to recommend options for social and governmental policy, but all final policy decisions must remain securely in the hands of human beings.
  • Similarly, Artificial Intelligence may be used to identify criminal suspects, but it must never be used to determine guilt or innocence. That must always remain within the purview of human beings, however fallible they might be.
  • Above all, Artificial Intelligence should be regarded as a tool to improve human life and society and must always be subservient to human needs.

Problems

The world has made tremendous strides in reducing slavery and poverty and unemployment. But slavery still exists, abject poverty still exists, and more than a billion people in the world do not have gainful employment. Artificial Intelligence may very well be able to help address such problems. Consequently we should now demand that Artificial Intelligence be directed to helping solve society’s most intransigent problems. Let us put Artificial Intelligence to use by giving it problems to solve that will actually benefit all of us and lead us jointly to the type of future in which we all want to live. Here are some examples of the types of problems that AI should address:

  • How can we best increase employment, health, and welfare across the entire planet?
  • How can we reduce or eliminate the risk that climate change will degrade or destroy human civilization?
  • How can we redesign social media systems so that they reward collegiality and problem resolution, rather than fear and rage?
  • How can we build a society that minimizes alienation, addiction, and isolation while encouraging engagement and positive contributions?
  • How can we best protect Earth from the hazard of cosmic collision by asteroids or comets?
  • How can we minimize violent crime and terrorism around the planet?
  • How can we improve our legal systems to make them harmonious across local, state, and federal boundaries?
  • How can we improve our health care systems to increase longevity and to improve the overall quality of life across all echelons of society?
  • How can we improve education so that everyone on the planet has the ability to reach their maximum potential?

The major providers of Artificial Intelligence services should be required to devote a significant portion of their time and resources to addressing these and other such problems. Society cannot afford to allow such a powerful tool to be applied only to those problems that are most likely to provide the greatest short term return on investment. The above problems are all very long term problems whose solutions would undoubtedly return immense rewards to society, but they are also problems that do not lend themselves to a short term profit driven business model.

Conclusion

Artificial Intelligence is a dramatic new technology that will undoubtedly change society and the future of our planet. Whether it becomes a net benefit to society, or instead brings about the degradation of human civilization will depend entirely on how it is used. We must decide how and for what it will be used. Now is the time to set boundaries, to define long term goals, and to demand that the power of this new technology be used to address the problems of greatest urgency to us all. Programs that can beat the best human chess players, or go players, or Jeopardy champions have their interest. But the problems facing world society and civilization are much more pressing. We need to ensure that Artificial Intelligence will make a positive contribution, and won’t be used exclusively to enrich business investors.

Copyright (c) 2023, David S. Moore.

All rights reserved.