• When you click on links to various merchants on this site and make a purchase, this can result in this site earning a commission. Affiliate programs and affiliations include, but are not limited to, the eBay Partner Network.

Archived

This topic is now archived and is closed to further replies.

An interesting statistical analysis - and a new way to grade

30 posts in this topic

It's interesting and many of the things written are things I also have worked out here at home. However, there are some major flaws in the study that Coin World undertook and additional flaws in the assumptions/statements made by the author of this web site. At this time I will not go into it as I have to be at a conference early tomorrow morning. A good person to also comment on this would be Hoot.

Link to comment
Share on other sites

I lean toward that idea that they never thought about what they were doing.

 

I suppose by this you mean they were not clear on the objective of the purpose of the study. Was it an analysis of how coins are graded or was it a comparason of the grading services?

 

I did find it funny that the study did show what I've always thought: PCGS is the most conservative but one of the more inconsistant. Something I've had the impression of for years....

 

jom

Link to comment
Share on other sites

As a statistics weenie, I thought this article was excellently done and conceived. And I think it would be great for the ANA to sponsor a test that is set up correctly for all the various grading services that has a sufficient population of coins to deal with.

 

The only flaw in the article here, though, was that the author didn't account for the NGs. I'm not sure what the best way is, but it can be a big issue in determining consistency.

 

Neil

Link to comment
Share on other sites

IMHO there are some problems with this analysis based on sampling size (degrees of freedom, which is admitted) and on number of samples from each service (only grading one sample coin of type, once by each service).

 

The only intuitive feeling that I have from this is that NGC is more consistant and PCGS is more conservative, but even that is not proven by this data. I think to be relevant, you would need more sample coins of each type with each coin being blindly submitted to all services several times each. IMHO, this would disclose more reliable grading variability within a service and more significant correlation of grades between services. foreheadslap.gif

Link to comment
Share on other sites

First, I'll agree with Neil that the post-hoc analysis done by stat-matics is quite good. They did what we have to do in the real world all the time: they worked with an incomplete data set. I like the fact that they blanketly state that "Drawing conclusions from one small experiment is dangerous." This could not be more true, and we tend to be swayed too greatly by affirmations, although they are small, of any preconceptions we may have. That said, I confess that I was a bit surprised by the analysis.

 

1) I really liked their first graph: consistent vs. scattered, liberal vs. conservative (I'd prefer high vs. low on this latter axis). This is a very telling conception, at least theoretically. Personally, I don't want a grading service to be out of the middle of either axis and they all are. A larger sample would be more telling, but I find it a bit disturbing that there are observable deviations on either axis. What would be more satisfying is seeing this plot for each service, each coin. This would provide a cloud of data points that would fit somewhere in that chart. I believe that it would provide much more information than a summary.

 

2) Their linear transform of the grading scale is good, but a bit theoretical. Have any of you ever seen a grading service provide a VG-7 or F-18 grade? I haven't, and I think those were placed in the scale for convenience more than rigor. This is not a big criticism... hell, I'd have done the same. blush.gif One disturbing conclusion, however, is this: "The MLG is computed as the average across all the grading services. It represents the best statistical estimate of the true grade, since it summarizes the consensus of all services on the same coin." I simply do not buy this. Averages have all kinds of bad qualities and since the distribution of outcomes (grades) cannot be summarized with a normal curve (or even a truncated Student's t at this point), I think this was a bad conclusion. Personally, I'd have opted for the most frequently observed outcome (grade) as the most likely for any given coin. With a small sample size, this is the highest form of rigor that can be reasonably applied and it meets sufficiency criteria.

 

3) I know that the authors were trying to provide simple definition, but this one missed the boat: "In statistical terms, the true grade of a coin is a random variable. A random variable is a number that is unknown until you look at it." First, we could argue all day about the "true grade of a coin" but that's not productive. What disturbs me are the ideas that the true grade of a coin is a random variable (RV) and their definition provided for an RV. A variable is an actual property or character to which an observed measurement is applied. Thus, a coins grade can conceivably be called a (discrete) variable. A random variable, however, is one generated from random measurement or selection. No single coin possesses this quality. I may be splitting hairs, but here's how it's important: The random variable in this "experiment" is the deviation of the individual scores from the most likely score. This variable exhibits stochastic qualities, much too detailed to get into here, but nevertheless identifiable. This is why it's so important to properly identify the most likely score (MLS). Though the authors found an adequate MLS, they did not find the most statistically sufficient. Still,I think they had the right idea nailed pretty well.

 

4) Their "analysis of variance" was true to form in a definitional sense but not a classic ANOVA. No big deal, but I think they could have clarified this point from the get-go.

 

All-in-all, I like the treatment. I especially like the graphical analysis. I am bothered by the outcome and agree that a larger sample should be pursued. In order to perform basic t-statistics on the data, at least 20 data points (20 coins) should be used. A sample of 100 would be excellent and a statistician could have a heyday.

 

My 2 cents. Hoot

Link to comment
Share on other sites

The biggest thing I like is their normalization of the scale to eliminate the false jumps between grade. Taking each "step" of the grade as a single point. This reduces the numerical bias we each have from the grading scale. And translating it back into a sheldon number is not too difficult.

 

I think, though, that ultimately we can never do comparisons quite this way. What I would like to see is a process control experiment. Take a "representative" set of coins. Say, 100. And then have each service grade them 100 times. Then see how many fall out of control at 2 or 3 sigma. That would give us a good idea of the process control of each company. Then we can begin to make conclusions of the consistency of the process for determining the grade. The variance demonstrated within the experiment would be enlightening, too (for precision). The SPC would also help us see the proportion of NGs and all. And then we can compare these values against other companies in the same industry.

 

I think that apples to apples comparison would be very interesting. I think all the services would be out of control, but to varying degrees.

 

Neil

Link to comment
Share on other sites

Neil - I like the idea of repetition for handling no grades. It would be the only way to adequately handle them given the discretization of the grading scale. I still don't think it would require a huge sample (100 coins) but the more the merrier! laugh.gif One would have to relinquish control for biases arising from recognition of a single coin or set of coins from any individual grader. Time dependencies and experience are other matters that simply could not be controlled well. Still, it's a nice idea. smile.gif

 

Hoot

Link to comment
Share on other sites

Good thread - the responses have been well thought out and are thought provoking.

 

One thing to consider that would skew the results: changes to the coins that would result because of all the necessary handling due to quantity of submissions. Not only would the grading services need to handle them to grade and slab, but whoever is doing the submitting would need to crack out every coin that wasn't body bagged. That being the case, I really wonder if we could actually submit the coins so many times to arrive at a sufficient sample size since they would be altered in the process. confused.gif

Link to comment
Share on other sites

About the only way to thwart the difficulties listed in the last few posts would be to pseudoreplicate with coins that are "about the same" initially, and adjust for actual differences between pseudoreplicates in the final analysis. Bear in mind that the actual random variable is the deviation from expected grade, so pseudoreplication would not be quite as non-robust as it may initially seem. What any experimenter would have to control strongly for are problem coins, i.e., coins that should get body bagged. This could be done by a series of blind trials before any submission to grading services. Any coin that did not pass a single "inspector" in the blind trial process could not be used in the final implementaiton. What would be a sufficient number of "inspectors" and their qualifications would have to be worked out. Then, any coin that was bb'd by the grading services would have to be analyzed by the simple criterion that if one grading service bb'd it, then it's an anomaly and that score is replaced by the most likely score. However, if two or more grading services bb'd the coin, then it's no longer an anomaly and would be considered a rank observation. In that case, the coin would simply be thrown out of the study. (The more I think about this, the more possible it seems!)

 

Hoot

Link to comment
Share on other sites

Man you guys are giving me a headache. 893blahblah.gif

 

Clem asks Abner, "Ain't statistics wonderful?"

"How so?" says Abner.

"Well, according to statistics, there's 42 million alligator

eggs laid every year. Of those only about half get

hatched. Of those that hatch, three-fourths of them get

eaten by predators in the first 36 days. And of the rest,

only 5 percent get to be a year old because of one

thing or another. Ain't statistics wonderful?"

Abner asks, "What's so wonderful about statistics?"

"Why, if it wasn't for statistics, we'd be up to our

asses in baby alligators!"

 

Link to comment
Share on other sites

Okay, I have the time now to describe what it is that I do not like about the initial test and the subsequent analysis. Before I start, however, Hoot has hit on one or two things that I didn't like very much but that I could not have explained nearly as well as he has already so I will leave them alone. As for the other things that I would have either changed or simply do not like, they are as follows.

 

1)The sample pool does not realistically reflect how the market uses the services. That is, the market does not send in common date, circulated Buffalo nickels for slabbing. Neither does the market send in approximately half of its coins with problems. The original article, which I do not have in front of me, listed the coins sent in and listed the problems associated with each coin. If I remember correctly, it was something like six or seven coins that Coin World thought had major problems such as PVC, cleaning or recoloring. This grouping of coins did not accurately reflect the state of preservation that is normally sent to the services. Think about it, how many submissions do you think approximately reflect the submission that Coin World used? They should have thought about what they were doing a little bit more before they started and then they could have chosen some more appropriate coins. Pieces such as State quarters and Ike dollars in high grade PF or MS, circulated key and semi-key dates from popular series, MS Morgan dollars and classic commems, and even some borderline contrived coins (borderline FS Jefferson nickels, borderline FBL Franklins, borderline FH SLQs and borderline FB Mercs). A submission, or several submissions, such as those mentioned would much more accurately reflect the types of coins that are typically submitted by dealers and the public and would have made the results that much more relevant.

 

2)The sample size should have been a bit larger. Not only for the total number of coins, but also for problem-free coins. The inclusion of so many coins with problems makes the analysis much more complex than it otherwise might have been. Since Coin World designed the experiment, they should have designed it so that the results could have more meaning.

 

3)Ignoring the NG coins all together, four coins out of fifteen, most definitely changes the pool of coins and the resulting conclusions. Do you think that a company that does not grade most coins due to a perceived problem is more conservative than one that grades anything, even without mentioning the problem? It is likely we would all agree that the company that issued more bags for problem coins would be considered more conservative. Those results should have either been addressed much more thoroughly or every coin that even received one NG should have been ignored.

 

4)The linear scale is something I had thought of when this article first appeared, however, the author of the analysis leaves out the AU53 grade. Yes, that grade is given out with regularity and should have been included.

 

5)The linear scale that was used also creates its own bias in that each grade range is treated equally. This is simply not the case. How often have we heard that an AU58 is really and MS63 with rub? If true, then we cannot say that the AU58 and MS63 grades are seperated by four discrete, linear points. They should likely be sequential. What does that then do to the MS60-MS62 range? Where does it fit in relation to the AU58/MS63 quandary? Additionally, most coins have relatively little value difference from AU53 to AU58, yet this is worth two discrete points in the linear scale. So, is the AU53 to AU58 as severe a change as the MS64 to MS66 change? Likely not.

 

6)I have problems with apparently using the grade assigned by a company in the average of the grades and then also using that grade to determine how far off the average that the company was in each instance. As far as I understood the article, for instance, each grade was averaged. So, in an extreme example, a single coin might have a linear grade of 1 given by seven companies and a grade of 17 given by the eighth company. In such an instance we would have an average of 3 for the eight companies even though we would likely believe that the coin deserved a 1. Using the comparison method that is subsequently employed, the eighth company would be 14 points too high (17-3=+14) and every other company would be 2 points too low (1-3=-2). I think a better way of doing this would have been to take the average grade exclusive of the company one is comparing it to and then to employ the difference. In this way one would compare the grade given by a specific company to the grade given by the rest of the market, exclusive of that company. Was all that clear?

 

7)Lastly, actually, there are more points but I cannot recall them at the moment, the linear grade scale doesn't tell us how the grading company values the coins. What would have been better would have been to use a single source of pricing information such as PCGS, the Greysheet or Trends and decide on the source before the experiment. Then, when all the grades were in, add up the valuation that each company gave to the coins, en masse, and compare that to each other. This gives a much better idea of how each company values the coins submitted to it. Take an extreme example; if seven companies graded a 1963 Lincoln cent PF69DCAM and one company gave it PF70DCAM then seven companies value it at approximately $1,000 while the eighth values it at approximately $35,000. The linear average method used would say the eighth company valued the coin by a fraction of a point more than the other companies and that that meant it was equivalent while the dollar valuations would tell an entirely different story. After all, we buy and sell these coins with money, not with linear points.

Link to comment
Share on other sites

Nice points Tom. I think that you are absolutely correct in assessing that tht pool of coins was not representative of the lots of coins typically provided to the services. In that sense, this is what the pre-experimental control is all about. The experiment was not well initiated, although it had merit in its initial conception. However, as with most sloppy science, too much was made of the results. I think that Stat-matics made good of a poor data set, nonetheless, and stated up-front that conclusions drawn would be erroneous. I think that what they tried to do was form a potential framework for future study.

 

I don't have as much of a problem with the grades from the grading companies forming the basis of an expected or most likely grade. The reason is simply that we really don't know in advance what the distribution of grades will be for any given coin. Thus, the distribution is updated by data, a Bayesian precept. This works if we use the right method to identify the expected/most likely grade (MLG). That's what Stat-matic did not do. It does not suffice to identify a central measure as an expectation with such a dearth of data. If we want information on the MLG, then we need more information on its probability density. This has two basic measures: within a grading service and between grading services. In a way, by deriving the MLG from the scores between services, we are conditioning on a standard that is created by the services. There's nothing wrong with that, if it's what we're after, and it seems we are. The identification of the MLG is onerous since there are so few grading services, but we might come close if we consider the likelihood of a number of probable distributions. This is crude, but it will get us in the ballpark. We can likely settle on a single distribution for all MLGs for all coins and that would make our job a lot easier. There's no mystery in this, we just have to have better rigor in the experiment. (I should say that we don't know what the measures of central tendency are until we define the likelihood function).

 

As for values, I'm not sure I follow you there. Seems like an ancillary point, albeit valid.

 

Again, I think that Stat-matics did a nice job at setting all of this up for a more rigorous future treatment.

 

893blahblah.gif Hoot 893blahblah.gif

Link to comment
Share on other sites

And that leads to my final thought. This article was a nice step into the foray. Are we to expect more of this? Was it just a one time deal or are we going to see more research and investigation into the process of coin grading and not just grading and varieties? I think the vacuum of developing a truly CONSISTENT AND REPEATABLE PROCESS for grading a coin is just begging to be filled.

 

Neil

Link to comment
Share on other sites

Hoot, I agree with you completely in that one of the banes of science is poorly done science. In my opinion, poorly done science is worse than misinterpreted science. That's just my opinion. 893blahblah.gif893blahblah.gif

 

You know, the 893blahblah.gif graemlin may be the most appropriate graemlin of all for many of my posts! frown.gif893scratchchin-thumb.gif

 

The problems I have with the MLG lie more in the minutia and my own preferences than with any quantitative, robust handling of the data. In other words; it rubs me the wrong way but I can't state it is necessarily incorrect.

 

The valuations that I mention, probably from my point seven, are subtle yet they have a profound impact on how one would describe the service of a company. They remove the bias of a linear grading scale and replace it with predetermined values derived from a price guide. This illustrates better where a company leans on its grading continuum. Let's take an example from a series that you might be familiar with; Buffalo nickels. wink.gif A pair of 1927-S nickels are sent in and one is pair is graded VF25 and MS64 by company A while the same pair of coins is graded VF20 and MS65 by company B. Using the linear scale method we come up with company A giving out 40 grade points for the pair of Buffalo nickels (14+26=40) and company B giving out 40 grade points (13+27=40). So, one might assume that the grading between the two companies was identical given they score the same points. However, the value of that pair of Buffalo nickels is waaaaaaaaay different. So, which is the more conservative company? The valuation method will give you a better clue as to the product.

Link to comment
Share on other sites

Tom I think to calculate what you are describing requires a model of all factors that all company uses to determine their scores.

 

For example:

 

Co A might use Luster+ Strike+ Year + Volume = Grade

Co B might use Mint Mark + Luster + Strike = Grade

Co C might use Luster + Strike + Year - Scratches = Grade

 

So in our model Luster + strike + year + volume + mint mark - scratches= grade,

 

Co A would be 1 + 1 + 1 + 1 + 0 - 0= Grade

 

Co B would be 1 + 1 + 0 + 0 + 1 - 0 = Grade

 

Co C would be 1 + 1 + 1 + 0 + 0 - 1 = Grade

 

Then can determine the r-squared value for each and every variable to determine significance. Although you would have to run the model through the r-squared analysis to get the actual values you can see the r-squared for Luster and Strike would be high because all companies use it to factor Grade. However Mint Mark would be less significant because only one company places value. When you line up all the factors used by all the major TPG you can begin to calculate the significance of each variable, thus weeding out some of the biases.

 

It's been a long time since I had to recall my stats textbook but I think that is a very simplistic way of saying it

Link to comment
Share on other sites

And that leads to my final thought. This article was a nice step into the foray. Are we to expect more of this? Was it just a one time deal or are we going to see more research and investigation into the process of coin grading and not just grading and varieties? I think the vacuum of developing a truly CONSISTENT AND REPEATABLE PROCESS for grading a coin is just begging to be filled.

 

I agree. It is my hope that those of you who understand this science (I certainly don't!) may take it upon themselves to perhaps contact the Stat-matics folks with the comments and observations made here. They could benefit from your expertise, and possibly provide the groundwork for another, more valid study of this issue. laugh.gif

Link to comment
Share on other sites

Of course, there is the one factor I alluded to (and others hinted at) that ultimately must be realized: An individual company may not grade using the same criteria as another company. An individual company's process for grading is not necessarily comparable to another company's process for grading (compare the 1 NTC grader and the 3+1 graders at NGC). An individual company's grader(s) skill level isn't easily comparable between companies or even within the same company.

 

With all these uncertainties, unless there is a repeatable and consistent process for determining a grade, the most you can really ever test for is the process control of grading per company and do comparisons based on degrees of variance between companies.

 

Neil

Link to comment
Share on other sites

If you're reading an internal analysis of the grade variables into my posts, then I'm not doing a sufficient job in describing my ideas.

 

I actually don't care how each company arrives at a grade; I care as to how the market values a correctly graded coin of any grade. They are two very different things. Since most companies will not divulge how they come to a grade, and those that do don't always actually follow their written guidelines, the grade, whether linear or Sheldon scale, doesn't matter as much as the price.

Link to comment
Share on other sites

OK let me rethink this after the clarification. However isn't the "market" supposed to have no bearing on the "neutral" professionals grading the coin? Or are you saying that if MS 66 is the flavor of the month then unconsciously graders may crank out more MS 66?

Link to comment
Share on other sites

I guess what I'm trying to write is, which service comes closer to grading the coins, to a pre-published valuation, that is most in line with what two experienced numismatists would agree to do the deal on the coins for?

Link to comment
Share on other sites

After reading about half of what everyone wrote, the bottom line is.........it's impossible to print it out in black and white, how every GC grades coins. It's impossible to sort out all the varibles a coin has and designate a double digit number to it. Do the three graders actually consult another before they come up with a grade? What if one or two of the graders are conservative or too liberal. And then the next trio are all liberal or consevative. Who decides where a XF45 or a MS65 grade exists? Who keeps it all in retrospect? This is the reason why they call it an opinion for there isn't an exact science for grading coins. If there were then we would not have any GC's (period) 893scratchchin-thumb.gif

 

Leo

Link to comment
Share on other sites

I guess what I'm trying to write is, which service comes closer to grading the coins, to a pre-published valuation, that is most in line with what two experienced numismatists would agree to do the deal on the coins for?

 

Tom

 

Yeesss, BUT the 2 numismatists (if they are experienced)

1. will need to agree on the grade

2. Will also need to agree that the assigned grade is the same grade they agree on, otherwise the "value" they pick will be different than the pre-assigned value.

 

In other words, the value is dependant upon the grade,

(the grade should not be dependant upon the value)

 

Because the graders should be evaluating the coin (not a price sheet) this, in a utopian world would be correct, but I keep hearing rumors that the graders (or the finalizers) also look at the value of the coin before making the final determination as to the grade. The coins submitted in the survey really were not value dependant (a 1 grade/pt difference wouldn't make much difference in value of the coins choosen for grading).

 

Before we get tooo carried away with this analysis, keep this in mind when dealing with statistics (particularly when there's not enough of them to draw valid conclusions) - A man with 1 foot in a bucket of ice and the other foot in a bucket of very hot water is statistically under "normal" conditions - BUT he is damned uncomfortable. tongue.gif

 

Here's what I think the CW "experiment" shows: Grading is not quantifiable, it's not math - It's subjective, its subject to human interpetation of a series of different aspects of condition, strike, luster, toning, impairments, marks, etc. that affect the overall conclusion of the "number" grade to be assigned to the coin. This can be affected be the particlar grader(s), the time of day (fresh v. tired), the amount of time spent assessing the coin, etc. It can also be affected by who defines what constitutes a particular grade for a particular coin type. One hit on a dime seems to have more of an effect than many hits on a Morgan dollar. The different criteria that determine grade are almost never the same (exclude Moderns here where they practically all look the same & the top pop grades seem more like a lottery system).

 

Here's the real test: pick say 50 coins break them into 5 -10 coin size packets, send them in, get the grades, crack 'em out send them back to the Same grading service in different mixed packets, - do this 3-4 times - Send the same coins 3-4 times again in random order to the other services, ( they could be round-robined - 1 pack to TPGS A, 2nd pack to TPGS B, Etc) keep track of all the results - Make sure no one knows which coins are going where in advance. THEN you'll be able to "Grade" the grading services, because then we'll see how consistent they are in grading the same coins at different times in different groups. AND make sure no one knows what's happening so you no-one is tipped off as to this type of "double blind" submission. Record all results, use lots of different coins, - make sure you don't throw in easily identifiable coins (like a low pop 18/17-S FH) - it would stick out like a sore thumb especially if the pop reports suddenly showed 20 being graded in a 2 month period.

 

I'm certain someone could develop a basket of coins for grading submission that would allow us to develope a statistical profile of how well the TPGS are able to consistently and accurately grade coins.

 

My guess - the results would show that even the best TPGSs are only about 80% (or thereabouts) consistent and accurate. Having seen many mis-grades by all of the TPGS, I can safely state: The grading services are not 100% accurate 100% of the time AND it's in their economic interest not to be - If they all got it right the 1st time, no one would ever have to re-submit a coin for re-grading, or crack-out a coin, or heaven forbid, to requst that it be downgraded it to take advantage of the "grade guarantee".

 

But then any one who has been to a major show and watched coins being cracked out and resubmitted already knows this. shocked.gif

 

 

 

 

Link to comment
Share on other sites

I agree with what you have written. The only reason I was still posting to this thread was because someone else seemed to be interpreting something I wrote in a way that I did not intend, therefore, I was trying to clarify it for him.

Link to comment
Share on other sites

I believe the Stat-matics approach provides a good framework in which one can adequately describe any service's consistency (variability) and degree of conservatism (high or low in comparison with the expected grade (MLG)). This is all the study was trying to get at. Attempts to interpret more than that are in error. Our criticisms of the Stat-matics approach have been in the vein of advancing the science of describing these qualities that we care about. If, in advance, a person knows what to expect from a grading service, it serves the hobby well in terms of the ultimate translation into dollars and cents. This is all imminently possible, no matter whether a person understands it or not.

 

And just to be perfectly clear, none of this has anything to do with methodologies of grading, only the ultimate outcome. To think that the Stat-matics analysis or CW study has anything to do with methods of grading is entirely erroneous and hedges on presumptuous. makepoint.gif

 

Hoot

Link to comment
Share on other sites

Now that I understand a little better I think my other forumulas would still work if you moved "grade" from the right side of the equation to the left and then put $xxx.xx on the right side of the equal. Still a very simplistic example to an extremely complex model. But the r-squared will determine significance towards the $$$.

 

However, given enough time some nerdy geek like me would try and figure it out. Maybe that is the main benefit of that article.

Link to comment
Share on other sites

That is why I wanted greater submission populations of each specific coin from to each service in order to establish population distributions vis-a-vis service to service. You than can use verticle line graphs from each service to compare like populations of each service's grade range, mean and standard deviation for each coin type. This will give a good visualization of grade differences on any specific series between services. This adds yet another graph to the analysis but one that I feel is relevent. This of course assumes control of samples to those that normally would be submission material. 893frustrated.gif

 

It is intereting to see the differences in approach between statisticians, engineers and scientists. Same statistics, different approaches.

Link to comment
Share on other sites