Cliff Grays Podcast with Aaron Davidson

Do they have their own ballistic software / optics line? Are they really good and using someone’s blank, other own, chambering it, to their own I. House manufactured action and putting it in a stock they designed for the hunter with explicit purpose?

Just answer yes or no.


I think the smooth brains really miss the point, sound poor, then hurl insults when this whole thing started about a podcast and a drop test.
YES
 
How about you enlighten me.

And, how about you provide me something better. Because if you don’t have anything better, it doesn’t matter how bad it is. It’s still the best there is. Again, I DO NOT CARE if good scopes fail. As long as bad scopes don’t pass, it has value. What is my alternative?
I got you bby.

“Clearly, Aaron did not read the notes on the scope testing…many variables are either addressed or semi-controlled.”

  • Claim without specifics. Saying variables are “addressed” or “semi-controlled” isn’t the same as demonstrating control. Which variables? How were they measured, bounded, and audited? Without a written protocol, tolerances, and QC checks, this is assertion, not evidence.
  • “Semi-controlled” invites bias. Partial control often shifts variance from random to systematic (operator, setup, environment). That tends to make results look repeatable while actually reflecting a hidden bias in the rig or method.

“Three failures in a row says something different than three passes in a row from small samples.”
  • Only under independence and identical conditions. Run-length in a Bernoulli process is meaningful if trials are i.i.d. If the same test setup systematically induces failure (e.g., impact angle, turret orientation, a stressed ring stack), those three “independent” failures may be three reads of the same bias.
  • Asymmetric inference. Three passes don’t prove reliability; agreed. But three fails don’t cleanly estimate population failure rate either—especially with convenience sampling, no randomization, and operator effects.
  • Math is conditional on unknowns. If a scope truly “passes” with probability p, then three fails in a row occur with (1-p)^3. Example: if p=0.9, (1-0.9)^3=0.1^3=0.001 (0.1%). If p=0.7, it’s 0.3^3=0.027 (2.7%). The point: without a credible estimate of p from unbiased, controlled data, the “wow” factor of a fail-run is hard to interpret.


“The drop tests aren’t scientific… but the open ‘available to anyone’ aspect has massive value.”
Availability ≠ validity. Openness is great, but decision value comes from measurement quality: calibrated height, measured impact energy, controlled surface durometer, defined orientation, pre-registered pass/fail criteria, and blinded scoring.
  • Construct validity gap. Does this test replicate field-relevant loads? Mixed, unmeasured impact vectors may overweight turret-first impacts and underweight recoil & vibration—skewing failure modes away from what most users experience.


“If a test is pretty repeatable with similar results…it has some validity.”


  • Repeatable ≠ correct. A biased bathroom scale is “repeatable.” Validity requires accuracy against a traceable reference (e.g., instrumented drop, collimator-based zero shift, tall-target tracking with error bounds), not just consistency.







“Manufacturers’ proprietary tests don’t help me; I need the same test across brands.”








  • False dilemma. It’s not “this open test or nothing.” The real bar is standardized, audited third-party methods (documented rigs, instrumented impacts, blind labeling). Proprietary data can still be probative if independently verified; open data can still mislead if poorly controlled.







“Unless critics replace it with something better that’s available, they’re blowing hot air.”








  • Burden of proof is on the test. Critique doesn’t require offering a turnkey replacement; it requires showing threats to validity (confounding, bias, poor reliability). “Use it until something better exists” is a policy stance, not a scientific defense.







“This is the ONLY option other than sticking my head in the sand.”

  • Availability bias. Claims of uniqueness ignore other reliability evidence (e.g., warranty/RMA rates, controlled tracking tests, recoil/vibration standards, multi-lab ring-down/box tests). If those aren’t consolidated, that’s a curation problem, not proof they don’t exist.

“Worrying about throwing out good scopes doesn’t matter to me; I just want a higher chance of a reliable one.”
  • Screening math cuts both ways. A harsh, noisy test with unknown specificity can spike false rejects—filtering out many good units and preferentially selecting designs robust to this impact profile, not necessarily to real-world use. Decision quality depends on sensitivity/specificity and the cost ratio of false fail vs false pass, none of which are quantified
“I had 3 of 4 scopes from one maker fail; swapping only the scope fixed it.
  • Confounding remains. Identical torque ≠ identical clamping force (lubricity, screw stretch, torque wrench calibration). Tube OD variances, wall thickness, and ring ovalization cause different stress states for each scope in the same rings. Without ABAB crossover (fail scope → good scope → fail scope again) and independent verification (collimator), you risk mistaking interaction effects for unit defects.
  • Selection & survivorship bias. Four units isn’t a population study. Batch effects, early production, or retailer pre-screening can skew your sample. Your experience is valid for you, but it doesn’t estimate brand-level failure rates.
“It isn’t rocket surgery to narrow it down when swapping scopes flips the result.”
  • Post hoc flip isn’t isolation. Flips can stem from small shifts in eye position, parallax, mounting tension release/re-clamp, rail stress relief, or ring seating. Isolation needs blinded mounting, fixture-based aim (no shooter influence), order randomization, and test-retest to rule out regression to the mean.
“Variables like angle, surface, landing point aren’t controlled—but that’s fine because the test is accessible.”
  • Uncontrolled inputs change the outcome distribution. Without fixed orientation (e.g., turret-first vs eyepiece-first), you’re not comparing like-for-like across designs. A scope robust to side impacts may look “bad” if the test over-represents turret-down hits. Accessibility doesn’t excuse mixing apples and anvils.

“It seems pretty repeatable with similar results.”

Where’s the reliability stat? “Seems” needs numbers: intra-rater agreement, test–retest variance, effect sizes with confidence intervals, and inter-lab reproducibility. If two operators can’t reproduce each other’s results under the same protocol, repeatability is illusory.
Core methodological gaps (the “why it’s vulnerable” list)

  • No pre-registered protocol: Without a frozen playbook (heights, surfaces, orientations, pass/fail thresholds, sample sizes), it’s easy to unconsciously tune conditions.
  • No instrumentation: Lack of measured acceleration/energy means you don’t know what you actually applied.
  • No blinding/randomization: Brand knowledge and order effects can influence setup, inspection, and interpretation.
  • Small-n with convenience sampling: Results are fragile and prone to runs, selection bias, and overinterpretation.
  • Outcome measure muddiness: Group shift can be shooter-, ammo-, or condition-driven; optical collimation or tall-target tracking would isolate the scope.
  • Unknown error rates: Sensitivity/specificity of the test to true mechanical failure modes are unquantified.
Constructive upgrades (minimal overhead, big payoff)
  • Fix three orientations (turret-down, ocular-down, side-impact) with a simple jig; photograph each setup.
  • Use one standard surface (documented durometer) and a measured drop height.
  • Blind the brand/model (tape the markings); randomize test order.
  • Pre-register pass/fail thresholds (e.g., ≥1.0 MOA zero shift after N drops) and publish all results, not just notable ones.
  • Verify zero shift with a collimator (no shooter noise) and add a tall-target tracking check pre/post.
  • Report CIs for shifts and a simple power analysis for planned sample sizes.


Bottom line: your policy argument (open, comparable, better than nothing) is understandable. But the scientific argument hinges on control, measurement, and error rates. Until those are nailed down, consecutive failures, personal flip-tests, and “seems repeatable” carry less evidentiary weight than they appear.
 
It’s so interesting to continually hear people talk about something with so much assurance when they haven’t read what is actually done, don’t understand it, and have never attempted to replicate it.

You know, something like the scientific method.
It’s easy to dismiss something when you’ve never actually tried it. Especially with internet confirmation bias from others who also have never tried it.

Honestly makes a guy want to just delete the little bit of internet he actually has (Rokslide).
 
I got you bby.

“Clearly, Aaron did not read the notes on the scope testing…many variables are either addressed or semi-controlled.”

  • Claim without specifics. Saying variables are “addressed” or “semi-controlled” isn’t the same as demonstrating control. Which variables? How were they measured, bounded, and audited? Without a written protocol, tolerances, and QC checks, this is assertion, not evidence.
  • “Semi-controlled” invites bias. Partial control often shifts variance from random to systematic (operator, setup, environment). That tends to make results look repeatable while actually reflecting a hidden bias in the rig or method.

“Three failures in a row says something different than three passes in a row from small samples.”
  • Only under independence and identical conditions. Run-length in a Bernoulli process is meaningful if trials are i.i.d. If the same test setup systematically induces failure (e.g., impact angle, turret orientation, a stressed ring stack), those three “independent” failures may be three reads of the same bias.
  • Asymmetric inference. Three passes don’t prove reliability; agreed. But three fails don’t cleanly estimate population failure rate either—especially with convenience sampling, no randomization, and operator effects.
  • Math is conditional on unknowns. If a scope truly “passes” with probability p, then three fails in a row occur with (1-p)^3. Example: if p=0.9, (1-0.9)^3=0.1^3=0.001 (0.1%). If p=0.7, it’s 0.3^3=0.027 (2.7%). The point: without a credible estimate of p from unbiased, controlled data, the “wow” factor of a fail-run is hard to interpret.


“The drop tests aren’t scientific… but the open ‘available to anyone’ aspect has massive value.”
Availability ≠ validity. Openness is great, but decision value comes from measurement quality: calibrated height, measured impact energy, controlled surface durometer, defined orientation, pre-registered pass/fail criteria, and blinded scoring.
  • Construct validity gap. Does this test replicate field-relevant loads? Mixed, unmeasured impact vectors may overweight turret-first impacts and underweight recoil & vibration—skewing failure modes away from what most users experience.


“If a test is pretty repeatable with similar results…it has some validity.”


  • Repeatable ≠ correct. A biased bathroom scale is “repeatable.” Validity requires accuracy against a traceable reference (e.g., instrumented drop, collimator-based zero shift, tall-target tracking with error bounds), not just consistency.







“Manufacturers’ proprietary tests don’t help me; I need the same test across brands.”








  • False dilemma. It’s not “this open test or nothing.” The real bar is standardized, audited third-party methods (documented rigs, instrumented impacts, blind labeling). Proprietary data can still be probative if independently verified; open data can still mislead if poorly controlled.







“Unless critics replace it with something better that’s available, they’re blowing hot air.”








  • Burden of proof is on the test. Critique doesn’t require offering a turnkey replacement; it requires showing threats to validity (confounding, bias, poor reliability). “Use it until something better exists” is a policy stance, not a scientific defense.







“This is the ONLY option other than sticking my head in the sand.”

  • Availability bias. Claims of uniqueness ignore other reliability evidence (e.g., warranty/RMA rates, controlled tracking tests, recoil/vibration standards, multi-lab ring-down/box tests). If those aren’t consolidated, that’s a curation problem, not proof they don’t exist.

“Worrying about throwing out good scopes doesn’t matter to me; I just want a higher chance of a reliable one.”
  • Screening math cuts both ways. A harsh, noisy test with unknown specificity can spike false rejects—filtering out many good units and preferentially selecting designs robust to this impact profile, not necessarily to real-world use. Decision quality depends on sensitivity/specificity and the cost ratio of false fail vs false pass, none of which are quantified
“I had 3 of 4 scopes from one maker fail; swapping only the scope fixed it.
  • Confounding remains. Identical torque ≠ identical clamping force (lubricity, screw stretch, torque wrench calibration). Tube OD variances, wall thickness, and ring ovalization cause different stress states for each scope in the same rings. Without ABAB crossover (fail scope → good scope → fail scope again) and independent verification (collimator), you risk mistaking interaction effects for unit defects.
  • Selection & survivorship bias. Four units isn’t a population study. Batch effects, early production, or retailer pre-screening can skew your sample. Your experience is valid for you, but it doesn’t estimate brand-level failure rates.
“It isn’t rocket surgery to narrow it down when swapping scopes flips the result.”
  • Post hoc flip isn’t isolation. Flips can stem from small shifts in eye position, parallax, mounting tension release/re-clamp, rail stress relief, or ring seating. Isolation needs blinded mounting, fixture-based aim (no shooter influence), order randomization, and test-retest to rule out regression to the mean.
“Variables like angle, surface, landing point aren’t controlled—but that’s fine because the test is accessible.”
  • Uncontrolled inputs change the outcome distribution. Without fixed orientation (e.g., turret-first vs eyepiece-first), you’re not comparing like-for-like across designs. A scope robust to side impacts may look “bad” if the test over-represents turret-down hits. Accessibility doesn’t excuse mixing apples and anvils.

“It seems pretty repeatable with similar results.”

Where’s the reliability stat? “Seems” needs numbers: intra-rater agreement, test–retest variance, effect sizes with confidence intervals, and inter-lab reproducibility. If two operators can’t reproduce each other’s results under the same protocol, repeatability is illusory.
Core methodological gaps (the “why it’s vulnerable” list)

  • No pre-registered protocol: Without a frozen playbook (heights, surfaces, orientations, pass/fail thresholds, sample sizes), it’s easy to unconsciously tune conditions.
  • No instrumentation: Lack of measured acceleration/energy means you don’t know what you actually applied.
  • No blinding/randomization: Brand knowledge and order effects can influence setup, inspection, and interpretation.
  • Small-n with convenience sampling: Results are fragile and prone to runs, selection bias, and overinterpretation.
  • Outcome measure muddiness: Group shift can be shooter-, ammo-, or condition-driven; optical collimation or tall-target tracking would isolate the scope.
  • Unknown error rates: Sensitivity/specificity of the test to true mechanical failure modes are unquantified.
Constructive upgrades (minimal overhead, big payoff)
  • Fix three orientations (turret-down, ocular-down, side-impact) with a simple jig; photograph each setup.
  • Use one standard surface (documented durometer) and a measured drop height.
  • Blind the brand/model (tape the markings); randomize test order.
  • Pre-register pass/fail thresholds (e.g., ≥1.0 MOA zero shift after N drops) and publish all results, not just notable ones.
  • Verify zero shift with a collimator (no shooter noise) and add a tall-target tracking check pre/post.
  • Report CIs for shifts and a simple power analysis for planned sample sizes.


Bottom line: your policy argument (open, comparable, better than nothing) is understandable. But the scientific argument hinges on control, measurement, and error rates. Until those are nailed down, consecutive failures, personal flip-tests, and “seems repeatable” carry less evidentiary weight than they appear.
How about get off the internet and go shoot? 🤓
 
Dude, not here to argue with you . You can drink Aaron’s KOOLAID. I still have my opinion of Aaron and his marketing, and his proclaimed know it all status, as nothing but a Carnival Barker . His arrogance precedes him and he offends many. I don’t have the patience to teach you, you are obviously a GW fanboy. Every gun builder seeks out Aaron’s knowledge and expertise. TFF. By the way your boy Aaron does not own a scope manufacturing company. Also the builders I mentioned do have proprietary stocks. Most of these builders use refined CRF Mod 70 actions, Granite Mtn Arms Mauser’s. D’Arcy also has his own action . Again you can be bedazzled by Aaron’s actions, I’m not. By the way , I checked with D’Arcy and he said he didn’t get any assistance on his design and engineering or consultation from Aaron. LOL
No hard feelings, I will hunt with my Echols, Simillion, Penrod, Buehler, Heilman rifles and you have fun with your GW’s and whatever else he’s marketing.
 
He asked for links to all the stuff you said the other manufacturers make. Not to blather on and throw out the names of all the manufacturers you can think of. Talk about coming across arrogant.
 
I listened to the entire podcast. It is clear that Aaron knows a LOT about shooting. And I too was interested in his spin on lighter calibers, negative comb stocks, and the $50k he spent on a bench to conduct drop testing.

My issue is most all of my rifles are either Tikka, Savage, or Weatherby Vanguard, and they all have factory barrels and Vortex Viper scopes. Someone like Aaron Davidson would laugh me out of his shop. I looked at his website and there was a scope for $2300. I am sure it is worth it too, but that is just so FAR out of my class.
Same here!
 
It’s so interesting to continually hear people talk about something with so much assurance when they haven’t read what is actually done, don’t understand it, and have never attempted to replicate it.

You know, something like the scientific method.
Often it's not what you say but how you say it, to really get the message across.

Sent from my SM-S926U using Tapatalk
 
Dude, not here to argue with you . You can drink Aaron’s KOOLAID. I still have my opinion of Aaron and his marketing, and his proclaimed know it all status, as nothing but a Carnival Barker . His arrogance precedes him and he offends many. I don’t have the patience to teach you, you are obviously a GW fanboy. Every gun builder seeks out Aaron’s knowledge and expertise. TFF. By the way your boy Aaron does not own a scope manufacturing company. Also the builders I mentioned do have proprietary stocks. Most of these builders use refined CRF Mod 70 actions, Granite Mtn Arms Mauser’s. D’Arcy also has his own action . Again you can be bedazzled by Aaron’s actions, I’m not. By the way , I checked with D’Arcy and he said he didn’t get any assistance on his design and engineering or consultation from Aaron. LOL
No hard feelings, I will hunt with my Echols, Simillion, Penrod, Buehler, Heilman rifles and you have fun with your GW’s and whatever else he’s marketing.

I want to start with I own no Gunwerks rifles. It is unlikely I ever will. I roll my own. So not a fanboy, no horse in the race.

I am really curious about the vitriol from a guy that hunts with 20K rifles. Davidson must have deeply offended you in some way. Perhaps you feel like you need to defend/justify the money you have spent on your rifles? You shouldn't worry about it, they are beautiful works of art and you are a lucky guy to own them. Enjoy them, you don't need to justify your choices.

The best analogy I can come up with the a guy that owns a bunch of really expensive swiss watches and belittles the Apple watch. This guy says "But these watches were expensive they took hundred's of hours to make, they are beautiful therefore they are better." Well no, the Apple watch is better at the job of keeping time than the 50k swiss watch, it just is. Your rifles shoot well and are beautiful, that doesn't mean that they are better at the job of killing animals than a GW gun. In fact I am guessing that side by side in an accuracy test the GW rifles are more accurate, heck I am guessing the rifles I put together are more accurate. Couple that with wood stocks in harsh environments, ...........problems abound.

The hunting world is a big tent, there is room for everybody. As far as I know Davidson hasn't committed any heinous crimes. Perhaps the harsh words are unwarranted and it is time for some introspection.
 
I got you bby.

“Clearly, Aaron did not read the notes on the scope testing…many variables are either addressed or semi-controlled.”

  • Claim without specifics. Saying variables are “addressed” or “semi-controlled” isn’t the same as demonstrating control. Which variables? How were they measured, bounded, and audited? Without a written protocol, tolerances, and QC checks, this is assertion, not evidence.
  • “Semi-controlled” invites bias. Partial control often shifts variance from random to systematic (operator, setup, environment). That tends to make results look repeatable while actually reflecting a hidden bias in the rig or method.

“Three failures in a row says something different than three passes in a row from small samples.”
  • Only under independence and identical conditions. Run-length in a Bernoulli process is meaningful if trials are i.i.d. If the same test setup systematically induces failure (e.g., impact angle, turret orientation, a stressed ring stack), those three “independent” failures may be three reads of the same bias.
  • Asymmetric inference. Three passes don’t prove reliability; agreed. But three fails don’t cleanly estimate population failure rate either—especially with convenience sampling, no randomization, and operator effects.
  • Math is conditional on unknowns. If a scope truly “passes” with probability p, then three fails in a row occur with (1-p)^3. Example: if p=0.9, (1-0.9)^3=0.1^3=0.001 (0.1%). If p=0.7, it’s 0.3^3=0.027 (2.7%). The point: without a credible estimate of p from unbiased, controlled data, the “wow” factor of a fail-run is hard to interpret.


“The drop tests aren’t scientific… but the open ‘available to anyone’ aspect has massive value.”
Availability ≠ validity. Openness is great, but decision value comes from measurement quality: calibrated height, measured impact energy, controlled surface durometer, defined orientation, pre-registered pass/fail criteria, and blinded scoring.
  • Construct validity gap. Does this test replicate field-relevant loads? Mixed, unmeasured impact vectors may overweight turret-first impacts and underweight recoil & vibration—skewing failure modes away from what most users experience.


“If a test is pretty repeatable with similar results…it has some validity.”


  • Repeatable ≠ correct. A biased bathroom scale is “repeatable.” Validity requires accuracy against a traceable reference (e.g., instrumented drop, collimator-based zero shift, tall-target tracking with error bounds), not just consistency.







“Manufacturers’ proprietary tests don’t help me; I need the same test across brands.”








  • False dilemma. It’s not “this open test or nothing.” The real bar is standardized, audited third-party methods (documented rigs, instrumented impacts, blind labeling). Proprietary data can still be probative if independently verified; open data can still mislead if poorly controlled.







“Unless critics replace it with something better that’s available, they’re blowing hot air.”








  • Burden of proof is on the test. Critique doesn’t require offering a turnkey replacement; it requires showing threats to validity (confounding, bias, poor reliability). “Use it until something better exists” is a policy stance, not a scientific defense.







“This is the ONLY option other than sticking my head in the sand.”

  • Availability bias. Claims of uniqueness ignore other reliability evidence (e.g., warranty/RMA rates, controlled tracking tests, recoil/vibration standards, multi-lab ring-down/box tests). If those aren’t consolidated, that’s a curation problem, not proof they don’t exist.

“Worrying about throwing out good scopes doesn’t matter to me; I just want a higher chance of a reliable one.”
  • Screening math cuts both ways. A harsh, noisy test with unknown specificity can spike false rejects—filtering out many good units and preferentially selecting designs robust to this impact profile, not necessarily to real-world use. Decision quality depends on sensitivity/specificity and the cost ratio of false fail vs false pass, none of which are quantified
“I had 3 of 4 scopes from one maker fail; swapping only the scope fixed it.
  • Confounding remains. Identical torque ≠ identical clamping force (lubricity, screw stretch, torque wrench calibration). Tube OD variances, wall thickness, and ring ovalization cause different stress states for each scope in the same rings. Without ABAB crossover (fail scope → good scope → fail scope again) and independent verification (collimator), you risk mistaking interaction effects for unit defects.
  • Selection & survivorship bias. Four units isn’t a population study. Batch effects, early production, or retailer pre-screening can skew your sample. Your experience is valid for you, but it doesn’t estimate brand-level failure rates.
“It isn’t rocket surgery to narrow it down when swapping scopes flips the result.”
  • Post hoc flip isn’t isolation. Flips can stem from small shifts in eye position, parallax, mounting tension release/re-clamp, rail stress relief, or ring seating. Isolation needs blinded mounting, fixture-based aim (no shooter influence), order randomization, and test-retest to rule out regression to the mean.
“Variables like angle, surface, landing point aren’t controlled—but that’s fine because the test is accessible.”
  • Uncontrolled inputs change the outcome distribution. Without fixed orientation (e.g., turret-first vs eyepiece-first), you’re not comparing like-for-like across designs. A scope robust to side impacts may look “bad” if the test over-represents turret-down hits. Accessibility doesn’t excuse mixing apples and anvils……
Holy hell. Are you really going to take my shorthand, colloquial explanation and use a computer to pick every nuance apart, as if my brief post was the actual eval protocol?

Address what I said for what it is (without the logical leaps that I did not make), and then address the eval separately based on the actual documentation that is already linked on this thread. I’m happy to participate in a different thread and talk about the points that are actually valid for the eval here, but there’s so much noise in that AI response that doesn’t apply here that I’m not going to bother going though the points. I do not agree with some of the assertions, because I think many of the arguments against what I said aren’t actually present or relevant in this specific case.

This part below I appreciate though. I think you may be one of the first handful of people Ive seen to actually make specific suggestions on improving the eval’s, rather than just tearing it apart and walking away. Your point 1 and 2 has +\- been my suggestion all along. Have you actually read the eval explanation and process? It’s written down on this site for everyone to see, but apparently very few people actually read it.Your list reads as if you have not read through the eval process, so I would be curious to hear what your specific suggestions for a crowdsource-able eval are after you read it. Then we can have a conversation about practicality.

  • Fix three orientations (turret-down, ocular-down, side-impact) with a simple jig; photograph each setup.
  • Use one standard surface (documented durometer) and a measured drop height.
  • Blind the brand/model (tape the markings); randomize test order.
  • Pre-register pass/fail thresholds (e.g., ≥1.0 MOA zero shift after N drops) and publish all results, not just notable ones.
  • Verify zero shift with a collimator (no shooter noise) and add a tall-target tracking check pre/post.
  • Report CIs for shifts and a simple power analysis for planned sample sizes
“This is the ONLY option other than sticking my head in the sand.”

  • Availability bias. Claims of uniqueness ignore other reliability evidence (e.g., warranty/RMA rates, controlled tracking tests, recoil/vibration standards, multi-lab ring-down/box tests). If those aren’t consolidated, that’s a curation problem, not proof they don’t exist.
This is super helpful, obviously this other measured and very specific evidence is what we’ve really wanted all along. I had trouble finding any of this just now though, any chance you could post the link? (Practicality matters. Not available=doesnt exist in practical terms)
 
I’ve had rifles built by a gentleman in the benchrest world whom others consider a hall of fame gunsmith. I then sent rifles to AZ Ammo for evaluation and development of several different loads that best fit the gun. AZ Ammo then sent me 30 pages of detailed notes on each rifle and the load development along with pictures. The results were great. This was over a decade ago. Recently I decided to try a Gunwerks product. Entire package was delivered and my first cold bore 5 shot group was .295. After zero my next 3 shot group was .29. Later tuned it in and shot 1000 yards clay targets. That was the gun shooting over my capabilities! In my dealings with Gunwerks I have spoken to 4 of their dealers, customer service, 2 salespeople and Aaron. They all have been very professional with no hints of superiority. During discussions I have brought up competitive products and never heard them disparage the competition. It’s been a great experience to actually have someone promptly return calls and offer help. I posted this just as my personal experience in dealing with GW as a customer.
 
I wish Ryan could block AI generated po.

Holy hell. Are you really going to take my shorthand, colloquial explanation and use a computer to pick every nuance apart, as if my brief post was the actual eval protocol?

Address what I said for what it is (without the logical leaps that I did not make), and then address the eval separately based on the actual documentation that is already linked on this thread. I’m happy to participate in a different thread and talk about the points that are actually valid for the eval here, but there’s so much noise in that AI response that doesn’t apply here that I’m not going to bother going though the points. I do not agree with some of the assertions, because I think many of the arguments against what I said aren’t actually present or relevant in this specific case.

This part below I appreciate though. I think you may be one of the first handful of people Ive seen to actually make specific suggestions on improving the eval’s, rather than just tearing it apart and walking away. Your point 1 and 2 has +\- been my suggestion all along. Have you actually read the eval explanation and process? It’s written down on this site for everyone to see, but apparently very few people actually read it.Your list reads as if you have not read through the eval process, so I would be curious to hear what your specific suggestions for a crowdsource-able eval are after you read it. Then we can have a conversation about practicality.

  • Fix three orientations (turret-down, ocular-down, side-impact) with a simple jig; photograph each setup.
  • Use one standard surface (documented durometer) and a measured drop height.
  • Blind the brand/model (tape the markings); randomize test order.
  • Pre-register pass/fail thresholds (e.g., ≥1.0 MOA zero shift after N drops) and publish all results, not just notable ones.
  • Verify zero shift with a collimator (no shooter noise) and add a tall-target tracking check pre/post.
  • Report CIs for shifts and a simple power analysis for planned sample sizes

This is super helpful, obviously this other measured and very specific evidence is what we’ve really wanted all along. I had trouble finding any of this just now though, any chance you could post the link? (Practicality matters. Not available=doesnt exist in practical terms)
The whole thread has devolved into smooth brains hurling insults because they are Reddit dudes who hate GW. My point was only to show there is variables. Just the exact as Aaron stated and you agreed as well.

Drop tests are good, the process between lab and field could be the best future marriage. I used RS drop tests to narrow my scope scoping for 3 separate rifles. (Went NF)


I’m not a worshipper of GW, I actually called BS on the glass being that close to the big guys just ordered a set of the new binos to compare against my Leica pro’s with AB. Both will joining me in north Idaho on elk/ deer hunt, along with getting my Eye Dr. wife with great eyes to help with some testing.
 
I’ve had rifles built by a gentleman in the benchrest world whom others consider a hall of fame gunsmith. I then sent rifles to AZ Ammo for evaluation and development of several different loads that best fit the gun. AZ Ammo then sent me 30 pages of detailed notes on each rifle and the load development along with pictures. The results were great. This was over a decade ago. Recently I decided to try a Gunwerks product. Entire package was delivered and my first cold bore 5 shot group was .295. After zero my next 3 shot group was .29. Later tuned it in and shot 1000 yards clay targets. That was the gun shooting over my capabilities! In my dealings with Gunwerks I have spoken to 4 of their dealers, customer service, 2 salespeople and Aaron. They all have been very professional with no hints of superiority. During discussions I have brought up competitive products and never heard them disparage the competition. It’s been a great experience to actually have someone promptly return calls and offer help. I posted this just as my personal experience in dealing with GW as a customer.
This is the stated ethos and goal of Gunwerks. We strive to live up the expectations of our customers. To own each aspect of the shooting system is a nearly impossible task. We will continue the pursuit of excellence and innovation!
 
I got you bby.

“Clearly, Aaron did not read the notes on the scope testing…many variables are either addressed or semi-controlled.”

  • Claim without specifics. Saying variables are “addressed” or “semi-controlled” isn’t the same as demonstrating control. Which variables? How were they measured, bounded, and audited? Without a written protocol, tolerances, and QC checks, this is assertion, not evidence.
  • “Semi-controlled” invites bias. Partial control often shifts variance from random to systematic (operator, setup, environment). That tends to make results look repeatable while actually reflecting a hidden bias in the rig or method.

“Three failures in a row says something different than three passes in a row from small samples.”
  • Only under independence and identical conditions. Run-length in a Bernoulli process is meaningful if trials are i.i.d. If the same test setup systematically induces failure (e.g., impact angle, turret orientation, a stressed ring stack), those three “independent” failures may be three reads of the same bias.
  • Asymmetric inference. Three passes don’t prove reliability; agreed. But three fails don’t cleanly estimate population failure rate either—especially with convenience sampling, no randomization, and operator effects.
  • Math is conditional on unknowns. If a scope truly “passes” with probability p, then three fails in a row occur with (1-p)^3. Example: if p=0.9, (1-0.9)^3=0.1^3=0.001 (0.1%). If p=0.7, it’s 0.3^3=0.027 (2.7%). The point: without a credible estimate of p from unbiased, controlled data, the “wow” factor of a fail-run is hard to interpret.


“The drop tests aren’t scientific… but the open ‘available to anyone’ aspect has massive value.”
Availability ≠ validity. Openness is great, but decision value comes from measurement quality: calibrated height, measured impact energy, controlled surface durometer, defined orientation, pre-registered pass/fail criteria, and blinded scoring.
  • Construct validity gap. Does this test replicate field-relevant loads? Mixed, unmeasured impact vectors may overweight turret-first impacts and underweight recoil & vibration—skewing failure modes away from what most users experience.


“If a test is pretty repeatable with similar results…it has some validity.”


  • Repeatable ≠ correct. A biased bathroom scale is “repeatable.” Validity requires accuracy against a traceable reference (e.g., instrumented drop, collimator-based zero shift, tall-target tracking with error bounds), not just consistency.







“Manufacturers’ proprietary tests don’t help me; I need the same test across brands.”








  • False dilemma. It’s not “this open test or nothing.” The real bar is standardized, audited third-party methods (documented rigs, instrumented impacts, blind labeling). Proprietary data can still be probative if independently verified; open data can still mislead if poorly controlled.







“Unless critics replace it with something better that’s available, they’re blowing hot air.”








  • Burden of proof is on the test. Critique doesn’t require offering a turnkey replacement; it requires showing threats to validity (confounding, bias, poor reliability). “Use it until something better exists” is a policy stance, not a scientific defense.







“This is the ONLY option other than sticking my head in the sand.”

  • Availability bias. Claims of uniqueness ignore other reliability evidence (e.g., warranty/RMA rates, controlled tracking tests, recoil/vibration standards, multi-lab ring-down/box tests). If those aren’t consolidated, that’s a curation problem, not proof they don’t exist.

“Worrying about throwing out good scopes doesn’t matter to me; I just want a higher chance of a reliable one.”
  • Screening math cuts both ways. A harsh, noisy test with unknown specificity can spike false rejects—filtering out many good units and preferentially selecting designs robust to this impact profile, not necessarily to real-world use. Decision quality depends on sensitivity/specificity and the cost ratio of false fail vs false pass, none of which are quantified
“I had 3 of 4 scopes from one maker fail; swapping only the scope fixed it.
  • Confounding remains. Identical torque ≠ identical clamping force (lubricity, screw stretch, torque wrench calibration). Tube OD variances, wall thickness, and ring ovalization cause different stress states for each scope in the same rings. Without ABAB crossover (fail scope → good scope → fail scope again) and independent verification (collimator), you risk mistaking interaction effects for unit defects.
  • Selection & survivorship bias. Four units isn’t a population study. Batch effects, early production, or retailer pre-screening can skew your sample. Your experience is valid for you, but it doesn’t estimate brand-level failure rates.
“It isn’t rocket surgery to narrow it down when swapping scopes flips the result.”
  • Post hoc flip isn’t isolation. Flips can stem from small shifts in eye position, parallax, mounting tension release/re-clamp, rail stress relief, or ring seating. Isolation needs blinded mounting, fixture-based aim (no shooter influence), order randomization, and test-retest to rule out regression to the mean.
“Variables like angle, surface, landing point aren’t controlled—but that’s fine because the test is accessible.”
  • Uncontrolled inputs change the outcome distribution. Without fixed orientation (e.g., turret-first vs eyepiece-first), you’re not comparing like-for-like across designs. A scope robust to side impacts may look “bad” if the test over-represents turret-down hits. Accessibility doesn’t excuse mixing apples and anvils.

“It seems pretty repeatable with similar results.”

Where’s the reliability stat? “Seems” needs numbers: intra-rater agreement, test–retest variance, effect sizes with confidence intervals, and inter-lab reproducibility. If two operators can’t reproduce each other’s results under the same protocol, repeatability is illusory.
Core methodological gaps (the “why it’s vulnerable” list)

  • No pre-registered protocol: Without a frozen playbook (heights, surfaces, orientations, pass/fail thresholds, sample sizes), it’s easy to unconsciously tune conditions.
  • No instrumentation: Lack of measured acceleration/energy means you don’t know what you actually applied.
  • No blinding/randomization: Brand knowledge and order effects can influence setup, inspection, and interpretation.
  • Small-n with convenience sampling: Results are fragile and prone to runs, selection bias, and overinterpretation.
  • Outcome measure muddiness: Group shift can be shooter-, ammo-, or condition-driven; optical collimation or tall-target tracking would isolate the scope.
  • Unknown error rates: Sensitivity/specificity of the test to true mechanical failure modes are unquantified.
Constructive upgrades (minimal overhead, big payoff)
  • Fix three orientations (turret-down, ocular-down, side-impact) with a simple jig; photograph each setup.
  • Use one standard surface (documented durometer) and a measured drop height.
  • Blind the brand/model (tape the markings); randomize test order.
  • Pre-register pass/fail thresholds (e.g., ≥1.0 MOA zero shift after N drops) and publish all results, not just notable ones.
  • Verify zero shift with a collimator (no shooter noise) and add a tall-target tracking check pre/post.
  • Report CIs for shifts and a simple power analysis for planned sample sizes.


Bottom line: your policy argument (open, comparable, better than nothing) is understandable. But the scientific argument hinges on control, measurement, and error rates. Until those are nailed down, consecutive failures, personal flip-tests, and “seems repeatable” carry less evidentiary weight than they appear.

It can’t possibly be true if my “scientific” tests can’t replicate it…. Can see it, can repeat it, but can’t figure out a controlled test for it so must not be real. It is a failure of the test protocols….. Arrogance? Willful ignorance ? Schizophrenia (see’s alternative reality)?

Please, for Leupold’s sake, find a way the average guy can assemble a rifle system and have it hold zero w a Leupold scope.
 
Back
Top