A/B testing favors B over A

mmerrill99 · June 17, 2017

So rather than admit that randomization does not eliminate listening order bias in A/B testing, we see here an attempt at obfuscation - changing the test from the A/B testing that is being discussed.

Why am I not surprised - all these red herring posts about randomization & now we see that randomization is not the answer to eliminating listening order bias - it requires a totally different test design?

Daudio · June 17, 2017

51 minutes ago, Teresa said:

Let's assume that Amp A sounds better than Amp B when...

Teresa, I thought you said you had a memory problem ???

Good work girl

WMW · June 17, 2017

jabbr's "list" is about a differently designed experiment than the OP detailed, designed to eliminate the propensity for error owing to experimental design rather than true difference between variables studied. Do you not see?

Bill Walker

PS I have no dog in this race. As a physician however, I submit that if randomized, double blind testing were not embraced in medicine we might still be studying voodoo and astrology as vectors in human disease.

WMW · June 17, 2017

And you, ralf11 - I VERY much enjoy your posts and your humor. Thanks most kindly!

Bill Walker

mmerrill99 · June 17, 2017

12 minutes ago, WMW said:

jabbr's "list" is about a differently designed experiment than the OP detailed, designed to eliminate the propensity for error owing to experimental design rather than true difference between variables studied. Do you not see?

Bill Walker

PS I have no dog in this race. As a physician however, I submit that if randomized, double blind testing were not embraced in medicine we might still be studying voodoo and astrology as vectors in human disease.

Yes & the whole discussion was about A/B listening & order listening bias

he made this statement in his post on the first page

Quote

The purpose of randomization is to reduce/eliminate systemic errors assuming sufficient sample size. (The preference for first vs second would cancel as roughly equal numbers of Amp A and Amp B would be listened first vs second.) ... but you need to have enough different people listening

Where he specifically stated that randomization in A/B testing would cancel any bias preferences & has argued with me all through this thread along the same lines.

It's only now, 6 pages in, we find he changes the goal posts - what a waste of everyone's time & energy!!

pkane2001 · June 17, 2017

58 minutes ago, Teresa said:

If the last played sample is always chosen as better, and there is close to the same numbers of both A and B being randomly selected as the last sample, then it hides any real audible differences.

I think there's a misunderstanding here. It's not the last sample played, it is the second sample played that seems to get the preference. I don't know how the last sample in a long random length sequence could be possibly preferred, as long as the experimenter does not warn the test subjects before the last sample.

The Stereophile article that started this thread is explicit that this bias was found only after the first two samples:

Quote

As much as this says about the limits of an A/B comparisons based on listening to short passages of music without the opportunity to at least return to A after having heard A and B, it also produced some extremely revealing commentary.

By repeating the test more than two times and by randomizing A and B samples one reduces the effect of this identification bias. The more tests you run, the lower the effect of the bias will be on the overall result. You can also throw out the results of the first two samples to eliminate all effects of this 'second sample preference' bias.

jabbr · June 17, 2017

On 6/15/2017 at 9:07 AM, jabbr said:

This is an example of study bias. ...

The study designer needs to correct. One way is to randomize -- another is to do multiple tests of each amp combo with order mixed. etc etc

This is from my first post. Let me clarify that I would not use one exclusive/or the other technique, rather one and the other. So 3-6 listening episodes with A/B randomized at each episode. That means 1/8 would get A-A-A (1/2x1/2x1/2) 1/8 would get A-A-B and so on to B-B-B. That's what the lists mean. This would be one example of a study design that didn't exhibit listening order bias. There are other ways to eliminate bias.

It should be obvious that if we repeated the same study it might very well show the same listening order bias and that's why I explained that we would change the study design in order to eliminate the bias.

mmerrill99 · June 17, 2017

16 minutes ago, pkane2001 said:

I think there's a misunderstanding here. It's not the last sample played, it is the second sample played that seems to get the preference. I don't know how the last sample in a long random length sequence could be possibly preferred, as long as the experimenter does not warn the test subjects before the last sample.

The Stereophile article that started this thread is explicit that this bias was found only after the first two samples:

By repeating the test more than two times and by randomizing A and B samples one reduces the effect of this identification bias. The more tests you run, the lower the effect of the bias will be on the overall result. You can also throw out the results of the first two samples to eliminate all effects of this 'second sample preference' bias.

Again, you don't seem to understand - maybe willfully so - is it so difficult to understand that what Teresa means by last sample is perfectly clear -the second listened to sample in every pair of samples.

All you are doing in the randomizing is spreading the bias between A & B so that it is no longer evident in the results but it is not eliminated from the test. Again, lets say there is no randomization, and B is always listened to 'last' in the pair - the results when analyzed will show a clear bias towards a preference for B.

When we randomise, if A is last heard then it will be preferred (irrespective of the real difference between them), if B is last heard then it will be preferred (irrespective of the real difference between them) - we are assuming subtle differences. All this randomisation is doing is returning a null result, masking any chance of discriminating real differences & hiding the fact that there is real bias in operation which would be evident in the results if no randomisation had been done

Teresa · June 17, 2017

22 minutes ago, pkane2001 said:

I think there's a misunderstanding here. It's not the last sample played, it is the second sample played that seems to get the preference. I don't know how the last sample in a long random length sequence could be possibly preferred, as long as the experimenter does not warn the test subjects before the last sample.

No misunderstanding the second of two samples is the last sample. There is no long random length sequence played, either A-B or B-A. The last (second) sample played aways sounded the best. If you tested a 100 people and the first and last (second) sample was played randomly an equal number of times you would nullified any sonic differences. I have brain problems so I have an excuse, why is this so hard for you to understand?

22 minutes ago, pkane2001 said:

The Stereophile article that started this thread is explicit that this bias was found only after the first two samples:

Quote

As much as this says about the limits of an A/B comparisons based on listening to short passages of music without the opportunity to at least return to A after having heard A and B, it also produced some extremely revealing commentary.

Correct, there was only two samples, sometimes A was first, sometimes B was first, however the first played sample was never chosen as the best.

22 minutes ago, pkane2001 said:

By repeating the test more than two times and by randomizing A and B samples one reduces the effect of this identification bias. The more tests you run, the lower the effect of the bias will be on the overall result. You can also throw out the results of the first two samples to eliminate all effects of this 'second sample preference' bias.

There is no way to get more than 50% correct if the second (last) sample was always preferred to the first. Any sonic differences would be hidden in the "A/B testing favors B over A" scenario.

Ralf11 · June 17, 2017

46 minutes ago, WMW said:

And you, ralf11 - I VERY much enjoy your posts and your humor. Thanks most kindly!

Bill Walker

Thx!

Your comment re testing is spot on. Cue the anti-vaxers...

Teresa · June 17, 2017

48 minutes ago, WMW said:

...PS I have no dog in this race. As a physician however, I submit that if randomized, double blind testing were not embraced in medicine we might still be studying voodoo and astrology as vectors in human disease.

Bill, I agree with regards to medical double-blind studies. Double blind tests work in drug testing as the human subjects don't have to make any decisions whatsoever. The subjects either are given the real medicine or a sugar pill. Those who get well taking the sugar pill do so as unconsciously their believe the medicine might be real thus their antibodies manage to kill the disease, this is known as the placebo effect. If considerably more people get well with the new drug than with the sugar pill, the drug is considered effective. None of our senses come into play in this type of test.

Audio is different as it requires human beings to consciously make choices and we are not to good at that.

pkane2001 · June 17, 2017

1 minute ago, Teresa said:

No misunderstanding the second of two samples is the last sample. There is no random length sequence played. The last (second) sample played aways sounded the best. If you tested a 100 people and the first and last (second) sample was played was randomly equally you would nullified any sonic differences. I have brain problems so I have an excuse, why is this so hard for you to understand?

Correct, there was only two samples, sometimes A was first, sometimes B was first, however the first played sample was never chosen as the best.

There is no way to get more than 50% correct if the second (last) sample was always preferred to the first. Any sonic differences would be hidden in the "A/B testing favors B over A" scenario.

A blind A/B test consists of two samples, A and B, tested in a random sequence of a variable length. Saying that the A/B test must only consist of two tests unnecessarily cripples it, reducing its statistical validity.

So, take my suggestion and run the A/B test 10 times with each test subject and throw out the first two results from each sequence of 10. Where is this 'second sample' bias in such a blind test?

jabbr · June 17, 2017

4 minutes ago, pkane2001 said:

So, take my suggestion and run the A/B test 10 times with each test subject and throw out the first two results from each sequence of 10. Where is this 'second sample' bias in such a blind test?

This is a good way to remove "training bias"

mmerrill99 · June 17, 2017

6 minutes ago, pkane2001 said:

A blind A/B test consists of two samples, A and B, tested in a random sequence of a variable length. Saying that the A/B test must only consist of two tests unnecessarily cripples it, reducing its statistical validity.

So, take my suggestion and run the A/B test 10 times with each test subject and throw out the first two results from each sequence of 10. Where is this 'second sample' bias in such a blind test?

Oh dear - massive fail in understanding

pkane2001 · June 17, 2017

2 minutes ago, mmerrill99 said:

Oh dear - massive fail in understanding

I'm glad you've finally admitted it

Teresa · June 17, 2017

5 hours ago, pkane2001 said:

...So, take my suggestion and run the A/B test 10 times with each test subject and throw out the first two results from each sequence of 10. Where is this 'second sample' bias in such a blind test?

That wouldn't work either due to how the human brain works. Repeated A/B'ing runs into trouble with both cognitive bias and listener fatigue, meaning you would have to also throw out samples 3-10 in your proposed test.

With cognitive bias your brain will either fill in missing information or remove it thus making each sample sound the same on repeated switching back and forth. And listener fatigue will guarantee that after just a few switches back and forth both A and B music samples will sound like crap.

There is no quick shortcut, long-term listening to music one loves is the only way to discover what one likes.

pkane2001 · June 17, 2017

2 hours ago, Teresa said:

That wouldn't work either due to how the human brain works. Repeated A/B'ing runs into trouble with both cognitive bias and listener fatigue, meaning you would have to also throw out samples 3-10 in your proposed test.

With cognitive bias your brain will either fill in missing information or remove it thus making each sample sound the same on repeated switching back and forth. And listener fatigue will guarantee that after just a few switches back and forth both A and B music samples will sound like crap.

There is no quick shortcut, long-term listening to music one loves is the only way to discover what one likes.

You are really determined to make a blind A/B test into a bad thing. Cognitive bias is exactly what you get in a long term listening test. In such a test you are not evaluating the equipment, you are evaluating the ability of your brain to adjust to your equipment. If you conduct it sighted, then you are also adding in the much more powerful selection bias into your test.

Listener fatigue is something that might occur after a prolonged exposure to some sound. Show me a study that demonstrates listener fatigue after a total of a few minutes of repeated listening to the same track in an A/B test.

mmerrill99 · June 17, 2017

18 minutes ago, pkane2001 said:

Show me a study that demonstrates listener fatigue

It's a well known issue with blind testing & is recommended in ITU BS standards to cater for this among many other impediments in testing

Recognizing the weaknesses/faultlines in any test is the objective approach, don't you think? Unless of course you don't want to face these issues & instead stay happy in your belief system.

AJ Soundfield · June 17, 2017

Audiophile arguments from scientific ignorance against (audio) perceptual tests are always very amusing.

Luckily, the world of science stopped listening to audiophiles over a hundred years ago, when they insisted horses could count, because their perceptions are physical reality .

Now for example, all major orchestras bar audiophiles from making any audio decisions and have been blind audio testing all potential players. The result has been a remarkable turnaround in the diversity of orchestras, especially gender.

When misogynist biases are no longer allowed to affect results, this is of course expected. Blind tests are not used to 'cure' the maladies of audiophiles, only remove those maladies from the results. If the result desired is to include maladies, then no blind tests are required.

The only dichotomy in these "arguments", is never "objective" vs "subjective" (folks don't know the meanings of the word "Subjective"), but knowledge vs ignorance.

Thankfully...for orchestras at least, the latter no longer dictates.

AJ Soundfield · June 17, 2017

12 minutes ago, mmerrill99 said:

It's a well known issue with blind testing & is recommended in ITU BS standards

Right, those ITU blind standards exist because science has known the uselessness of uncontrolled 'horse counting' perception for over hundred years, something audiophiles don't.

Quote

Recognizing the weaknesses/faultlines in any test is the objective approach, don't you think? Unless of course you don't want to face these issues & instead stay happy in your belief system.

The irony after mentioning ITU blind tests....

Jud · June 17, 2017

I would like to stay with the specific rather than the general here. There are many potential problems with blind tests, as with any kind of test, and various ways of dealing with these potential problems. We can (and I sometimes have) had discussions about some of these problems. But the discussion up to now has been about one particular potential problem, preference for a second sample.

Randomization has been suggested as one of several possible ways of dealing with the potential problem, as has throwing out the first two samples. One specific objection that has been raised is that this does not account for other potential problems and is therefore not a useful solution.

It seems to me obvious that (1) if I fix my car it will not deal with any potential problems with the dishwasher; and (2) this isn’t a reason not to fix the car.

If there are yet other potential problems with a given test, they of course should also be minimized insofar as possible. From my reading in the scientific literature so far, there are a few of these that I think might be difficult to resolve with sequential A/B/X testing. But I would suppose if good solutions exist, the way I might find them is by further reading, not performing “thought experiments” on the possibly faulty basis of my current knowledge.

AJ Soundfield · June 17, 2017

6 minutes ago, Jud said:

I would like to stay with the specific rather than the general here. There are many potential problems with blind tests, as with any kind of test, and various ways of dealing with these potential problems. We can (and I sometimes have) had discussions about some of these problems. But the discussion up to now has been about one particular potential problem, preference for a second sample.

You've lost me here in this paragraph Jud. This "test" wasn't blind (nor was any DUT).

That is the "particular problem". Nicely highlighted too I may add.

jabbr · June 17, 2017

34 minutes ago, Jud said:

It seems to me obvious that (1) if I fix my car it will not deal with any potential problems with the dishwasher; and (2) this isn’t a reason not to fix the car.

Indeed! There is hope for your car. You might need a new dishwasher though

29 minutes ago, AJ Soundfield said:

You've lost me here in this paragraph Jud. This "test" wasn't blind (nor was any DUT).

That is the "particular problem". Nicely highlighted too I may add.

That would be a different problem Brand name bias, expectation bias etc.

Removal of all types of bias is important for an objective study. Expectation bias would be removed with a different technique than listening order bias.

jabbr · June 17, 2017

10 hours ago, mmerrill99 said:

Where he specifically stated that randomization in A/B testing would cancel any bias preferences & has argued with me all through this thread along the same lines.

So study design involving multiple complicated variables is complicated and back to my first post: real work

I never meant to suggest that randomization alone would solve every problem, but in trying to explain this we've become caught up in the discussion of randomization alone. Randomization is necessary, but not sufficient. There are Type I and Type II errors. Both situations need to be corrected for.

Without randomization bias can lead to a Type I error (improper rejection of a true null hypothesis)

With randomization, internal invalidity can lead to a Type II error (rejection of a false null hypothesis)

You've correctly pointed out the problem that occurs when a strong bias error masks differences in a weaker signal. Mathematically this is a type of quantization error. In many cases it can be handled by drastically increasing the number of samples (think sampling rate of DSD64 vs PCM44 -- @mansr: this is just a very vague analogy ). But that isn't necessarily the best option. The "problem" here relies on the measurement units. "A is better" vs "B is better" is binary, and small statistical differences in A vs B are quantized away by a strong bias.

A better technique (which I've already suggested but am explaining here) is to increase the resolution of the measurement. Thats why we move to an analogue scale, perhaps a visual analogue scale (VAS) in which the track is rates from 1-10 without reference to other tracks. Given enough samples, even small differences are statistically resolved. Note that as the difference becomes smaller the required number of samples in order to be statistically significant greatly rises. Consider that both Amplifiers A and B are "high end", that means that bias aside, the number of subjects needed to validly distinguish a difference in "quality" using a binary measure would be very large -- 10 subjects probably waay too small.

mmerrill99 · June 17, 2017

5 hours ago, jabbr said:

I never meant to suggest that randomization alone would solve every problem

No you suggested that randomization would solve the problem of listener order bias - that was very clear & you clearly argued this to be the case until you started to realize you were wrong & changed the goalposts as you are doing now.

But it's good that you now recognize that randomization is not the solution & a test redesign is necessary to address this bias.

Please point out this realization to ralf11, mansr & pkane2001 who maintain that I am a fool with no knowledge of the subject - indeed even Jud seems to say the same.

A/B testing favors B over A

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Create an account or sign in to comment

Create an account

Sign in