A/B testing favors B over A

mmerrill99 · June 16, 2017

1 hour ago, esldude said:

I think the same effect is occurring with break in. It takes time to trust your feeling. For it to feel right. However, in neither case do I think a change in sound is really behind reaching this point of feeling comfortable.

Why not face up to the flaws in blind testing rather than look for excuses to denigrate all other results - it's one of the reasons why you loose believability as it's seen that something other than the truth is being sought

pkane2001 · June 16, 2017

1 hour ago, mmerrill99 said:

Of course it depends on what the test is being used for - if the objective is other than testing whether a difference is heard then it's a perfect stealth weapon.

It was a fake test before fake news became a popular meme

Assuming this bias towards the second test is true, it's a flaw only if you do a small number of A/B tests. It in no way invalidates blind testing as a methodology.

I usually switch between components for days while trying to evaluate differences. I am very well aware of my attention changing and often revealing differences that are not there. That's why I always try to confirm what I hear with repeated tests. I try to focus on something very specific in a familiar sound track, played for a very short time (less than a minute) before switching. I have a few favorite recordings I know very well, and I use specific portions of those recordings to test for different sound qualities. I do this blind, if I can set up such a test (and I always try to do it this way, if at all possible).

jabbr · June 16, 2017

1 hour ago, mmerrill99 said:

And I'm saying - what's the point in this "randomizing" - as I said if we have a source of error which is likely masking any small differences what's the point of randomising it - it only hides the effect - we would be better being aware of the effect in the results & discounting the results.

What is "likely" being masked and how did you determine this? If the "internal validity" of an experiment does not allow very very small differences to be detected then that has nothing to do with the experiment ability to detect significant differences.

The typical situation is that an attempt to disprove the null hyp either shows a difference or not -- if not and you know the resolving ability of the experiment, then you know that at best the difference is less than the resolving ability of the experiment.

mmerrill99 · June 16, 2017

13 minutes ago, pkane2001 said:

Assuming this bias towards the second test is true, it's a flaw only if you do a small number of A/B tests. It in no way invalidates blind testing as a methodology.

It invalidates ABX testing of the type called for on audio forums which represent the majority of the blind testing spoken about in this hobby. Blind testing in sensory perceptual testing is a very difficult test to do correctly!

15 minutes ago, pkane2001 said:

I usually switch between components for days while trying to evaluate differences. I am very well aware of my attention changing and often revealing differences that are not there. That's why I always try to confirm what I hear with repeated tests. I try to focus on something very specific in a familiar sound track, played for a very short time (less than a minute) before switching. I have a few favorite recordings I know very well, and I use specific portions of those recordings to test for different sound qualities. I do this blind, if I can set up such a test (and I always try to do it this way, if at all possible).

Yes, you are really doing what the majority of people do - listen over a long time to gain familiarity with the sound & when satisfied with the overall 'feel' drilling down into particular aspects which may be examined in short term tests.

But this short term A/B listening does not always result in identifiable differences & yet the 'feel' is different between two devices. Have you encountered this?

mmerrill99 · June 16, 2017

23 minutes ago, jabbr said:

....

The typical situation is that an attempt to disprove the null hyp either shows a difference or not -- if not and you know the resolving ability of the experiment, then you know that at best the difference is less than the resolving ability of the experiment.

Well isn't that the point - the resolving ability of the 'particular' experiment (each home administered experiment is different) is unknown as there is often no control, no calibration to show this resolving ability - I'm talking about the usual blind tests called for on audio forums, not laboratory organised blind tests.

pkane2001 · June 16, 2017

4 minutes ago, mmerrill99 said:

Well isn't that the point - the resolving ability of the 'particular' experiment (each home administered experiment is different) is unknown as there is often no control, no calibration to show this resolving ability - I'm talking about the usual blind tests called for on audio forums, not laboratory organised blind tests.

There's a huge difference between a sighted, long term A/B comparison with one or two attempts to switch components and a blind A/B test, repeated sufficient number of times to achieve statistical significance. While both can contain biases and other flaws, the blind test controls for many more variables and is much more objective and reproducible by others.

mmerrill99 · June 16, 2017

23 minutes ago, jabbr said:

What is "likely" being masked and how did you determine this? If the "internal validity" of an experiment does not allow very very small differences to be detected then that has nothing to do with the experiment ability to detect significant differences.

The differences perceived in normal listening are likely being masked.

mmerrill99 · June 16, 2017

4 minutes ago, pkane2001 said:

There's a huge difference between a sighted, long term A/B comparison with one or two attempts to switch components and a blind A/B test, repeated sufficient number of times to achieve statistical significance. While both can contain biases and other flaws, the blind test controls for many more variables and is much more objective and reproducible by others.

A test which has consistent & inherent flaws is by definition "reproducible by others".

That doesn't make it objective

By that definition sighted testing is objective as it is reproducible by others

pkane2001 · June 16, 2017

Just now, mmerrill99 said:

A test which has consistent & inherent flaws is by definition "reproducible by others".

That doesn't make it objective

What makes it objective is eliminating subjective biases that are known and well-studied.

jabbr · June 16, 2017

14 minutes ago, mmerrill99 said:

Well isn't that the point - the resolving ability of the 'particular' experiment (each home administered experiment is different) is unknown as there is often no control, no calibration to show this resolving ability - I'm talking about the usual blind tests called for on audio forums, not laboratory organised blind tests.

I was careful to discuss the need for randomization in the setting of multiple subjects. If the test isn't calibrated to start with then there is no way to determine its validity. As I said the fact that randomization doesn't solve every problem doesn't mean it doesn't solve any problems.

At home I do my own pseudo-blinded, casual listening impressions that I wouldn't describe as a formal experiment ... doing ABX correctly is work

mmerrill99 · June 16, 2017

2 minutes ago, pkane2001 said:

What makes it objective is eliminating subjective biases that are known and well-studied.

Eliminating some biases does not make it "objective" - if you said eliminating all biases, you would be able to make that claim but I would ask you to prove this!

pkane2001 · June 16, 2017

3 minutes ago, mmerrill99 said:

Eliminating some biases does not make it "objective" - if you said eliminating all biases, you would be able to make that claim but I would ask you to prove this!

My claim, as quoted below, is that a blind test is significantly more objective:

9 minutes ago, pkane2001 said:

blind test controls for many more variables and is much more objective and reproducible by others.

mmerrill99 · June 16, 2017

4 minutes ago, jabbr said:

I was careful to discuss the need for randomization in the setting of multiple subjects. If the test isn't calibrated to start with then there is no way to determine its validity. As I said the fact that randomization doesn't solve every problem doesn't mean it doesn't solve any problems.

Yes & I was hopefully careful to just point out that randomization was not the answer to the listening order bias - it seemed to be suggested by you & others that this was the case until I pointed out the flaw in your logic.

6 minutes ago, jabbr said:

doing ABX correctly is work

Yes & more than the average hobbyist wishes to engage in. Why should they ruin the their hobby?

jabbr · June 16, 2017

5 minutes ago, mmerrill99 said:

The differences perceived in normal listening are likely being masked.

The likelihood greatly depends on the calibration done eg was the volume carefully calibrated? If reasonable calibration is done then reasonable differences are not likely masked.

What this heightens however is the real need to correlate measurements with impressions -- in the simplest case the volume needs to be measured.

mmerrill99 · June 16, 2017

2 minutes ago, pkane2001 said:

My claim, as quoted below, is that a blind test is significantly more objective:

This is like comparing fatal diseases - they both result in the same outcome

I can't see how flaws can be 'objectively' compared, can you?

pkane2001 · June 16, 2017

Just now, mmerrill99 said:

This is like comparing fatal diseases - they both result in the same outcome

I can't see how flaws can be 'objectively' compared, can you?

No, it's like comparing a plague and a cold. Very different outcomes in most cases.

mmerrill99 · June 16, 2017

15 minutes ago, pkane2001 said:

No, it's like comparing a plague and a cold. Very different outcomes in most cases.

I disagree - sighted listening is biased towards false positives (hearing differences) ; blind testing is biased towards false negatives (not hearing differences)

The outcomes of both are very different so my analogy was flawed - one results in possibly always chasing what is better sounding - the other results in possibly not hearing what is better.

Which one is this hobby mainly about?

mmerrill99 · June 16, 2017

53 minutes ago, mmerrill99 said:

But this short term A/B listening does not always result in identifiable differences & yet the 'feel' is different between two devices. Have you encountered this?

@pkane2001, you never answered this - have you encountered this situation?

Jud · June 16, 2017

31 minutes ago, pkane2001 said:

Assuming this bias towards the second test is true, it's a flaw only if you do a small number of A/B tests. It in no way invalidates blind testing as a methodology.

I usually switch between components for days while trying to evaluate differences. I am very well aware of my attention changing and often revealing differences that are not there. That's why I always try to confirm what I hear with repeated tests. I try to focus on something very specific in a familiar sound track, played for a very short time (less than a minute) before switching. I have a few favorite recordings I know very well, and I use specific portions of those recordings to test for different sound qualities. I do this blind, if I can set up such a test (and I always try to do it this way, if at all possible).

A few thoughts here:

- The "less than a minute" time frame isn't nearly short enough. Scientific research shows echoic memory for everything except loudness lasts maybe 4-10 seconds.

- "[V]ery specific in a familiar sound track" is a really good thing. It means you may not actually be trying to compare your memories of the two sounds (see the first point; after a few seconds this is fruitless, *especially* if there's intervening music - I can give you a web link if you'd like), but rather doing "pattern matching," at which people are very, very good. (In fact we're so good we can sense patterns where none exist, which is responsible for optical and aural illusions).

- Regarding loudness, we are *really* good at detecting and remembering this, but really bad (as I mentioned before) at remembering other acoustic qualities. So very often, when we think we are comparing two musical passages as a whole, I think it is very possible what we are in fact doing is comparing loudness of the end of passage A with the beginning of passage B. We like louder (thus the loudness wars). I think if the end of music sample A is softer than the beginning of music sample B, that alone might easily account for a preference of B over A.

mmerrill99 · June 16, 2017

4 minutes ago, Jud said:

Regarding loudness, we are *really* good at detecting and remembering this, but really bad (as I mentioned before) at remembering other acoustic qualities. So very often, when we think we are comparing two musical passages as a whole, I think it is very possible what we are in fact doing is comparing loudness of the end of passage A with the beginning of passage B. We like louder (thus the loudness wars). I think if the end of music sample A is softer than the beginning of music sample B, that alone might easily account for a preference of B over A.

Good point among a series of good points

pkane2001 · June 16, 2017

2 minutes ago, Jud said:

- The "less than a minute" time frame isn't nearly short enough. Scientific research shows echoic memory for everything except loudness lasts maybe 4-10 seconds.

Agreed. I mentioned less than one minute to highlight the difference with days and weeks recommended by those who believe in long term evaluation

mmerrill99 · June 16, 2017

One point about patterns, depending on how they are being codified & stored, they have the potential to be a very discriminating tool at the level of overall differences but not at the specific differences level

jabbr · June 16, 2017

3 minutes ago, Jud said:

The "less than a minute" time frame isn't nearly short enough. Scientific research shows echoic memory for everything except loudness lasts maybe 4-10 seconds.

Yes but... when I compare two sounds, be they components or instruments or recording I "feature extract" that is commit my impression to memory be that smoothness of a string in a certain octave or extension of bass or position of instruments on soundstage, and these features can be remembered longer than 10sec

pkane2001 · June 16, 2017

7 minutes ago, mmerrill99 said:

@pkane2001, you never answered this - have you encountered this situation?

While I've encountered (many times) differences in sighted tests that I could swear were very obvious, I frequently could not make the same distinction in a blind test. To me, this is an indication of a failure of the sighted test to control for subjective variables, and not a failure of a blind test to discover differences.

Daudio · June 16, 2017

5 minutes ago, mmerrill99 said:

sighted listening is biased towards false positives (hearing differences) ; blind testing is biased towards false negatives (not hearing differences)

The outcomes of both are very different so my analogy was flawed - one results in possibly always chasing what is better sounding - the other results in possibly not hearing what is better.

Which one is this hobby mainly about?

Merrill,

Thanks, I think that is the best, simple, 10,000 ft up, analysis of this whole fervor over audiophile testing methods I've yet seen, cutting through the bias, obscuring details, and bullshit ! It should be copied to a pinned thread to guide all of us.

And if one isn't in this hobby to chase better sound, then what the hell are they doing it for ? Waste money, exercise their oscilloscopes, or Online Armored Combat ?

A/B testing favors B over A

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Create an account or sign in to comment

Create an account

Sign in