A CONVERSATION ON AI COIN GRADING

PREV ARTICLE NEXT ARTICLE FULL ISSUE PREV FULL ISSUE

V27 2024 INDEX E-SYLUM ARCHIVE

The E-Sylum: Volume 27, Number 48, 2024, Article 9

A CONVERSATION ON AI COIN GRADING

Late last month an article submitted by Justin Hinh led to a healthy back-and-forth discussion between Justin and Bill Eckberg, who had earlier commented on Justin's reports on his work with artificial intelligence and numismatics. While both were willing to share their discussion with E-Sylum readers, I was at a bit of a loss on how to edit it for publication. Justin offered to help - with a little more artificial intelligence! He provided the emails to ChatGPT with a prompt like: "Summarize this discussion into a concise format that would be most engaging for The E-Sylum readers."

Here's Justin's intro and the results of that prompt, which Bill and Justin reviewed for publication. Thanks. -Editor

AI Coin Grading: A Conversation with Bill Eckberg

After the article "The Good and Bad of AI in Numismatics" (October 13, 2024), Bill Eckberg, President of Early American Coppers (EAC), and I engaged in a thoughtful exchange about the progress and challenges of AI in coin grading that readers may enjoy.

Below is a summary of our discussion:

Bill Eckberg:

Hi Justin,

I have always enjoyed your work on AI grading. It was fascinating to see the improvement in the accuracy of the AI model over time. I have a few questions:

Data Sample: How many coins did your model analyze at each grade level when creating the graph, and which series were they drawn from?

Quality Variation: Since coin quality varies within each grade level, how does your system account for that?

Human vs. AI Variability: How does the variability within each grade level compare to the variability in grading the same coins by humans?

Luster and MS Grades: Since differences between Mint State (MS) grades largely depend on luster, how do your photos account for that? If the AI grades are "better," why and how are they better?

Also, regarding your graph, the horizontal axis cannot be quantitative as portrayed. The numbers represent categories, not quantities of grade. The number of grade steps between VF-20 and EF-40 isn't the same as from Po-1 to VF-20, yet the spacing on the graph is equal. There are only three grade steps between VF-20 and EF-40, but at least five between Po-1 and VF-20. Similarly, between MS60 and MS65, there are four steps (MS-61, 62, 63, and 64), so the spacing should reflect that. A bar graph might avoid implying a quantitative horizontal axis.

Justin Hinh

Thanks for your thoughtful questions, Bill. Here are my thoughts:

Data Sample: The graph in my update shows results from testing 12 different U.S. coins across 12-grade levels using ChatGPT models earlier this year. You can find a breakdown of the series I used in this Google Sheet. I included the graph to give a general sense of AI grading progress but should have added:
- "AI grading is heading in a positive direction based on my results. However, my testing has limitations. I only tested 12 U.S. coins across 12-grade levels, and many grades still need to be added. I also didn't test any world or ancient coins."

Quality Variation: I didn't train or run my own AI model; I tested publicly available models like ChatGPT and Google Gemini. These AI models are black boxes—even their creators can't fully explain how they work. We know they tend to improve with more data, likely using internet sources, including high-quality coin photos from PCGS and NGC.

Human vs. AI Variability: This is the million-dollar question. The only way to find out is to test several coins alongside an expert grader and AI. I have yet to do this since AI models can't fully mimic human grading conditions, but I aim to conduct such tests when AI can analyze coins in real time.

Luster and MS Grades: This touches on the heart of technical vs. market grading. My tests show AI models can detect luster, but how they quantify it is still being determined. Due to unknown training data, we need to determine if the grading leans more toward technical or market grading. Instructing the AI to use specific standards like the ANA's Official Grading Standards is possible, but even technical grading uses subjective terms like "attractive" and "original."

Regarding the graph, a bar graph would be more accurate in representing each grade definitively. I'll use that from now on.

Additionally, I should have included a link to a video of Google Gemini analyzing a video recording of a coin. You can check out that video here. I'm excited about this development because, previously, AI could only analyze a coin with a few photos. But a video captures hundreds of frames and gives the AI much more data to work with.

Bill Eckberg:

After reviewing your data, it seems impossible to claim an optimal number of photos for consistent grading. The variations were all over the place. If the machine was learning, it should get more consistent with more images.

As a long-time collector, I believe there's no consistent difference between MS-69 and MS-70, except the price some are willing to pay. Testing those grades isn't useful, especially since your AI could only tell the difference as much as I can.

I'd like to see your AI grade series, like Lincoln cents, in grades from Good to Uncirculated. Can it differentiate VG from G, F from XF, XF from AU, or AU from MS? Those insights would be very useful. If that works, you could try different levels of VF and so on.

Keep at it.

Justin Hinh

I should clarify my goals and the results in the data file. I was testing two hypotheses with my limited set of coins:

Does providing current AI models with more photos yield more accurate results? As you observed, that wasn't consistently the case.

Can the recently released advanced AI models achieve the same accuracy as previous models with fewer photos? Based on my results, these new AI models require fewer photos for coins graded VF20 than the previously tested models. However, more photos were needed for grades below VF-15.

The differences between testing grades like MS-69 and MS-70 are minimal and often subjective. I intended to see how the AI handles the full spectrum of grades, but focusing on more commonly traded grades like Good to Uncirculated would be more practical. I'll consider testing with series like Lincoln cents to see if the AI can differentiate between those grades.

Regarding the impact on third-party graders (TPGs), I've considered whether they're incentivized to adopt AI grading. While TPGs might eventually use AI for efficiencies like counterfeit detection, after exploring AI grading these past 14 months, I've concluded that AI won't be replacing human graders soon for several reasons:

Reputation and Liquidity: TPGs have billion-dollar valuations due to their reputation and the liquidity they provide, not just grading accuracy.

No Incentive to Normalize AI Grading: If AI grading becomes standard, the focus might shift to algorithm accuracy, commoditizing their services. TPGs aren't tech companies and may avoid the arms race for the perfect model.

Collector Resistance: Collectors might resist AI grading due to past experiences like Compugrade in the 1990s.

Financial Risk: There needs to be more financial gain for TPGs to risk their reputation to save on grader salaries.

Market vs. Technical Grading: TPGs perform market grading that requires a deep understanding of each coin series. AI excels at technical grading but lacks the "gut feeling" for market acceptance.

These considerations are driven more by market forces than technical limitations.

Bill Eckberg

Thanks, Justin.

I agree with much of what you say but strongly disagree that TPGs deeply understand most series. That's true for commodity coins like Morgans and Saints but not for pre-1836 issues struck on a screw press. Many are rare enough that TPGs need to see them more often to improve their grading.

Justin Hinh

I knew I should have used a non-EAC example!

Bill Eckberg

Ha!

This article still required effort on my part, but it was simple formatting work, not requiring a detailed reading and understanding. AI seemed to do a fine job of neatly summarizing the email exchange, and given that it was reviewed by both human parties (who made a few updates), I'm happy to publish it. Thanks, everyone, human and artificial. -Editor

To read the earlier E-Sylum article, see:
THE GOOD AND BAD OF AI IN NUMISMATICS (https://www.coinbooks.org/v27/esylum_v27n41a21.html)

E-Sylum Coin Photo Studio 2024-12 ad Ancient Calendar

Wayne Homren, Editor

The Numismatic Bibliomania Society is a non-profit organization promoting numismatic literature. See our web site at coinbooks.org.

To submit items for publication in The E-Sylum, write to the Editor at this address: whomren@gmail.com

To subscribe go to: https://my.binhost.com/lists/listinfo/esylum

PREV ARTICLE NEXT ARTICLE FULL ISSUE PREV FULL ISSUE

V27 2024 INDEX E-SYLUM ARCHIVE