Jump to content
IGNORED

A/B testing favors B over A


Recommended Posts

So rather than admit that randomization does not eliminate listening order bias in A/B testing, we see here an attempt at obfuscation - changing the test from the A/B testing that is being discussed.

 

Why am I not surprised - all these red herring posts about randomization & now we see that  randomization is not the answer to eliminating listening order bias - it requires a totally different test design?

Link to comment

jabbr's "list" is about a differently designed experiment than the OP detailed, designed to eliminate the propensity for error owing to experimental design rather than true difference between variables studied.  Do you not see?

 

Bill Walker

 

PS  I have no dog in this race.  As a physician however, I submit that if randomized, double blind testing were not embraced in medicine we might still be studying voodoo and astrology as vectors in human disease.  

 

Link to comment
12 minutes ago, WMW said:

jabbr's "list" is about a differently designed experiment than the OP detailed, designed to eliminate the propensity for error owing to experimental design rather than true difference between variables studied.  Do you not see?

 

Bill Walker

 

PS  I have no dog in this race.  As a physician however, I submit that if randomized, double blind testing were not embraced in medicine we might still be studying voodoo and astrology as vectors in human disease.  

 

Yes & the whole discussion was about A/B listening & order listening bias

he made this statement in his post on the first page

Quote

The purpose of randomization is to reduce/eliminate systemic errors assuming sufficient sample size. (The preference for first vs second would cancel as roughly equal numbers of Amp A and Amp B would be listened first vs second.) ... but you need to have enough different people listening

Where he specifically stated that randomization in A/B testing would cancel any bias preferences & has argued with me all through this thread along the same lines.

 

It's only now, 6 pages in, we find he changes the goal posts - what a waste of everyone's time & energy!!

Link to comment
58 minutes ago, Teresa said:

If the last played sample is always chosen as better, and there is close to the same numbers of both A and B being randomly selected as the last sample, then it hides any real audible differences.

 

I think there's a misunderstanding here. It's not the last sample played, it is the second sample played that seems to get the preference. I don't know how the last sample in a long random length sequence could be possibly preferred, as long as the experimenter does not warn the test subjects before the last sample.

 

The Stereophile article that started this thread is explicit that this bias was found only after the first two samples:

 

Quote

As much as this says about the limits of an A/B comparisons based on listening to short passages of music without the opportunity to at least return to A after having heard A and B, it also produced some extremely revealing commentary.

 

By repeating the test more than two times and by randomizing A and B samples one reduces the effect of this identification bias. The more tests you run, the lower the effect of the bias will be on the overall result. You can also throw out the results of the first two samples to eliminate all effects of this 'second sample preference' bias.

Link to comment
On 6/15/2017 at 9:07 AM, jabbr said:

This is an example of study bias. ...

The study designer needs to correct. One way is to randomize -- another is to do multiple tests of each amp combo with order mixed. etc etc

 

This is from my first post. Let me clarify that I would not use one exclusive/or the other technique, rather one and the other. So 3-6 listening episodes with A/B randomized at each episode. That means 1/8 would get A-A-A (1/2x1/2x1/2) 1/8 would get A-A-B and so on to B-B-B. That's what the lists mean. This would be one example of a study design that didn't exhibit listening order bias. There are other ways to eliminate bias.

 

It should be obvious that if we repeated the same study it might very well show the same listening order bias and that's why I explained that we would change the study design in order to eliminate the bias.

Custom room treatments for headphone users.

Link to comment
16 minutes ago, pkane2001 said:

 

I think there's a misunderstanding here. It's not the last sample played, it is the second sample played that seems to get the preference. I don't know how the last sample in a long random length sequence could be possibly preferred, as long as the experimenter does not warn the test subjects before the last sample.

 

The Stereophile article that started this thread is explicit that this bias was found only after the first two samples:

 

 

By repeating the test more than two times and by randomizing A and B samples one reduces the effect of this identification bias. The more tests you run, the lower the effect of the bias will be on the overall result. You can also throw out the results of the first two samples to eliminate all effects of this 'second sample preference' bias.

Again, you don't seem to understand - maybe willfully so - is it so difficult to understand that what Teresa means by last sample is perfectly clear -the second listened to sample in every pair of samples.

 

All you are doing in the randomizing is spreading the bias between A & B so that it is no longer evident in the results but it is not eliminated from the test. Again, lets say there is no randomization, and B is always listened to 'last' in the pair - the results when analyzed will show a clear bias towards a preference for B.

 

When we randomise, if A is last heard then it will be preferred (irrespective of the real difference between them), if B is last heard then it will be preferred (irrespective of the real difference between them) - we are assuming subtle differences. All this randomisation is doing is returning a null result, masking any chance of discriminating real differences & hiding the fact that there is real bias in operation which would be evident in the results if no randomisation had been done

Link to comment
22 minutes ago, pkane2001 said:

I think there's a misunderstanding here. It's not the last sample played, it is the second sample played that seems to get the preference. I don't know how the last sample in a long random length sequence could be possibly preferred, as long as the experimenter does not warn the test subjects before the last sample.

 

No misunderstanding the second of two samples is the last sample. There is no long random length sequence played, either A-B or B-A. The last (second) sample played aways sounded the best. If you tested a 100 people and the first and last (second) sample was played randomly an equal number of times you would nullified any sonic differences. I have brain problems so I have an excuse, why is this so hard for you to understand?

 

22 minutes ago, pkane2001 said:

The Stereophile article that started this thread is explicit that this bias was found only after the first two samples:

 

Quote

As much as this says about the limits of an A/B comparisons based on listening to short passages of music without the opportunity to at least return to A after having heard A and B, it also produced some extremely revealing commentary.

 

 

Correct, there was only two samples, sometimes A was first, sometimes B was first, however the first played sample was never chosen as the best.

 

22 minutes ago, pkane2001 said:

By repeating the test more than two times and by randomizing A and B samples one reduces the effect of this identification bias. The more tests you run, the lower the effect of the bias will be on the overall result. You can also throw out the results of the first two samples to eliminate all effects of this 'second sample preference' bias.

 

There is no way to get more than 50% correct if the second (last) sample was always preferred to the first. Any sonic differences would be hidden in the "A/B testing favors B over A" scenario.

I have dementia. I save all my posts in a text file I call Forums.  I do a search in that file to find out what I said or did in the past.

 

I still love music.

 

Teresa

Link to comment
48 minutes ago, WMW said:

...PS  I have no dog in this race.  As a physician however, I submit that if randomized, double blind testing were not embraced in medicine we might still be studying voodoo and astrology as vectors in human disease.  

 

 

Bill, I agree with regards to medical double-blind studies. Double blind tests work in drug testing as the human subjects don't have to make any decisions whatsoever. The subjects either are given the real medicine or a sugar pill. Those who get well taking the sugar pill do so as unconsciously their believe the medicine might be real thus their antibodies manage to kill the disease, this is known as the placebo effect. If considerably more people get well with the new drug than with the sugar pill, the drug is considered effective. None of our senses come into play in this type of test.

 

Audio is different as it requires human beings to consciously make choices and we are not to good at that.

I have dementia. I save all my posts in a text file I call Forums.  I do a search in that file to find out what I said or did in the past.

 

I still love music.

 

Teresa

Link to comment
1 minute ago, Teresa said:

 

No misunderstanding the second of two samples is the last sample. There is no random length sequence played. The last (second) sample played aways sounded the best. If you tested a 100 people and the first and last (second) sample was played was randomly equally you would nullified any sonic differences. I have brain problems so I have an excuse, why is this so hard for you to understand?

 

 

Correct, there was only two samples, sometimes A was first, sometimes B was first, however the first played sample was never chosen as the best.

 

 

There is no way to get more than 50% correct if the second (last) sample was always preferred to the first. Any sonic differences would be hidden in the "A/B testing favors B over A" scenario.

 

A blind A/B test consists of two samples, A and B, tested in a random sequence of a variable length. Saying that the A/B test must only consist of two tests unnecessarily cripples it, reducing its statistical validity. 

 

So, take my suggestion and run the A/B test 10 times with each test subject and throw out the first two results from each sequence of 10. Where is this 'second sample' bias in such a blind test?

Link to comment
4 minutes ago, pkane2001 said:

So, take my suggestion and run the A/B test 10 times with each test subject and throw out the first two results from each sequence of 10. Where is this 'second sample' bias in such a blind test?

This is a good way to remove "training bias"

Custom room treatments for headphone users.

Link to comment
6 minutes ago, pkane2001 said:

 

A blind A/B test consists of two samples, A and B, tested in a random sequence of a variable length. Saying that the A/B test must only consist of two tests unnecessarily cripples it, reducing its statistical validity. 

 

So, take my suggestion and run the A/B test 10 times with each test subject and throw out the first two results from each sequence of 10. Where is this 'second sample' bias in such a blind test?

Oh dear - massive fail in understanding

Link to comment
5 hours ago, pkane2001 said:

...So, take my suggestion and run the A/B test 10 times with each test subject and throw out the first two results from each sequence of 10. Where is this 'second sample' bias in such a blind test?

 

That wouldn't work either due to how the human brain works. Repeated A/B'ing runs into trouble with both cognitive bias and listener fatigue, meaning you would have to also throw out samples 3-10 in your proposed test. 

 

With cognitive bias your brain will either fill in missing information or remove it thus making each sample sound the same on repeated switching back and forth. And listener fatigue will guarantee that after just a few switches back and forth both A and B music samples will sound like crap.

 

There is no quick shortcut, long-term listening to music one loves is the only way to discover what one likes.

I have dementia. I save all my posts in a text file I call Forums.  I do a search in that file to find out what I said or did in the past.

 

I still love music.

 

Teresa

Link to comment
2 hours ago, Teresa said:

 

That wouldn't work either due to how the human brain works. Repeated A/B'ing runs into trouble with both cognitive bias and listener fatigue, meaning you would have to also throw out samples 3-10 in your proposed test. 

 

With cognitive bias your brain will either fill in missing information or remove it thus making each sample sound the same on repeated switching back and forth. And listener fatigue will guarantee that after just a few switches back and forth both A and B music samples will sound like crap.

 

There is no quick shortcut, long-term listening to music one loves is the only way to discover what one likes.

 

You are really determined to make a blind A/B test into a bad thing. Cognitive bias is exactly what you get in a long term listening test. In such a test you are not evaluating the equipment, you are evaluating the ability of your brain to adjust to your equipment. If you conduct it sighted, then you are also adding in the much more powerful selection bias into your test.

 

Listener fatigue is something that might occur after a prolonged exposure to some sound. Show me a study that demonstrates listener fatigue after a total of a few minutes of repeated listening to the same track in an A/B test.

Link to comment
18 minutes ago, pkane2001 said:

Show me a study that demonstrates listener fatigue

It's a well known issue with blind testing & is recommended in ITU BS standards to cater for this among many other impediments in testing

 

Recognizing the weaknesses/faultlines in any test is the objective approach, don't you think? Unless of course you don't want to face these issues & instead stay happy in your belief system.

Link to comment

Audiophile arguments from scientific ignorance against (audio) perceptual tests are always very amusing.

Luckily, the world of science stopped listening to audiophiles over a hundred years ago, when they insisted horses could count, because their perceptions are physical reality :).

Now for example, all major orchestras bar audiophiles from making any audio decisions and have been blind audio testing all potential players. The result has been a remarkable turnaround in the diversity of orchestras, especially gender.

When misogynist biases are no longer allowed to affect results, this is of course expected. Blind tests are not used to 'cure' the maladies of audiophiles, only remove those maladies from the results. If the result desired is to include maladies, then no blind tests are required.

The only dichotomy in these "arguments", is never "objective" vs "subjective" (folks don't know the meanings of the word "Subjective"), but knowledge vs ignorance.

Thankfully...for orchestras at least, the latter no longer dictates.

Link to comment
12 minutes ago, mmerrill99 said:

It's a well known issue with blind testing & is recommended in ITU BS standards

Right, those ITU blind standards exist because science has known the uselessness of uncontrolled 'horse counting' perception for over hundred years, something audiophiles don't.

 

Quote

Recognizing the weaknesses/faultlines in any test is the objective approach, don't you think? Unless of course you don't want to face these issues & instead stay happy in your belief system.

The irony after mentioning ITU  blind tests....

Link to comment
6 minutes ago, Jud said:

I would like to stay with the specific rather than the general here.  There are many potential problems with blind tests, as with any kind of test, and various ways of dealing with these potential problems.  We can (and I sometimes have) had discussions about some of these problems.  But the discussion up to now has been about one particular potential problem, preference for a second sample.

You've lost me here in this paragraph Jud. This "test" wasn't blind (nor was any DUT).

That is the "particular problem". Nicely highlighted too I may add.

Link to comment
34 minutes ago, Jud said:

It seems to me obvious that (1) if I fix my car it will not deal with any potential problems with the dishwasher; and (2) this isn’t a reason not to fix the car.

 

 

Indeed! There is hope for your car. You might need a new dishwasher though ;) 

29 minutes ago, AJ Soundfield said:

You've lost me here in this paragraph Jud. This "test" wasn't blind (nor was any DUT).

That is the "particular problem". Nicely highlighted too I may add.

That would be a different problem ;) Brand name bias, expectation bias etc.

 

Removal of all types of bias is important for an objective study. Expectation bias would be removed with a different technique than listening order bias. 

Custom room treatments for headphone users.

Link to comment
5 hours ago, jabbr said:

I never meant to suggest that randomization alone would solve every problem

No you suggested that randomization would solve the problem of listener order bias - that was very clear & you clearly argued this to be the case until you started to realize you were wrong & changed the goalposts as you are doing now.

 

But it's good that you now recognize that randomization is not the solution & a test redesign is necessary to address this bias.

 

Please point out this realization to ralf11, mansr & pkane2001 who maintain that I am a fool with no knowledge of the subject - indeed even Jud seems to say the same.

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now



×
×
  • Create New...