mmerrill99 Posted June 17, 2017 Share Posted June 17, 2017 So rather than admit that randomization does not eliminate listening order bias in A/B testing, we see here an attempt at obfuscation - changing the test from the A/B testing that is being discussed. Why am I not surprised - all these red herring posts about randomization & now we see that randomization is not the answer to eliminating listening order bias - it requires a totally different test design? Link to comment
Daudio Posted June 17, 2017 Share Posted June 17, 2017 51 minutes ago, Teresa said: Let's assume that Amp A sounds better than Amp B when... Teresa, I thought you said you had a memory problem ??? Good work girl Teresa 1 Link to comment
WMW Posted June 17, 2017 Share Posted June 17, 2017 jabbr's "list" is about a differently designed experiment than the OP detailed, designed to eliminate the propensity for error owing to experimental design rather than true difference between variables studied. Do you not see? Bill Walker PS I have no dog in this race. As a physician however, I submit that if randomized, double blind testing were not embraced in medicine we might still be studying voodoo and astrology as vectors in human disease. Link to comment
WMW Posted June 17, 2017 Share Posted June 17, 2017 And you, ralf11 - I VERY much enjoy your posts and your humor. Thanks most kindly! Bill Walker Link to comment
mmerrill99 Posted June 17, 2017 Share Posted June 17, 2017 12 minutes ago, WMW said: jabbr's "list" is about a differently designed experiment than the OP detailed, designed to eliminate the propensity for error owing to experimental design rather than true difference between variables studied. Do you not see? Bill Walker PS I have no dog in this race. As a physician however, I submit that if randomized, double blind testing were not embraced in medicine we might still be studying voodoo and astrology as vectors in human disease. Yes & the whole discussion was about A/B listening & order listening bias he made this statement in his post on the first page Quote The purpose of randomization is to reduce/eliminate systemic errors assuming sufficient sample size. (The preference for first vs second would cancel as roughly equal numbers of Amp A and Amp B would be listened first vs second.) ... but you need to have enough different people listening Where he specifically stated that randomization in A/B testing would cancel any bias preferences & has argued with me all through this thread along the same lines. It's only now, 6 pages in, we find he changes the goal posts - what a waste of everyone's time & energy!! Link to comment
pkane2001 Posted June 17, 2017 Share Posted June 17, 2017 58 minutes ago, Teresa said: If the last played sample is always chosen as better, and there is close to the same numbers of both A and B being randomly selected as the last sample, then it hides any real audible differences. I think there's a misunderstanding here. It's not the last sample played, it is the second sample played that seems to get the preference. I don't know how the last sample in a long random length sequence could be possibly preferred, as long as the experimenter does not warn the test subjects before the last sample. The Stereophile article that started this thread is explicit that this bias was found only after the first two samples: Quote As much as this says about the limits of an A/B comparisons based on listening to short passages of music without the opportunity to at least return to A after having heard A and B, it also produced some extremely revealing commentary. By repeating the test more than two times and by randomizing A and B samples one reduces the effect of this identification bias. The more tests you run, the lower the effect of the bias will be on the overall result. You can also throw out the results of the first two samples to eliminate all effects of this 'second sample preference' bias. jabbr 1 -Paul DeltaWave, DISTORT, Earful, PKHarmonic, new: Multitone Analyzer Link to comment
jabbr Posted June 17, 2017 Share Posted June 17, 2017 On 6/15/2017 at 9:07 AM, jabbr said: This is an example of study bias. ... The study designer needs to correct. One way is to randomize -- another is to do multiple tests of each amp combo with order mixed. etc etc This is from my first post. Let me clarify that I would not use one exclusive/or the other technique, rather one and the other. So 3-6 listening episodes with A/B randomized at each episode. That means 1/8 would get A-A-A (1/2x1/2x1/2) 1/8 would get A-A-B and so on to B-B-B. That's what the lists mean. This would be one example of a study design that didn't exhibit listening order bias. There are other ways to eliminate bias. It should be obvious that if we repeated the same study it might very well show the same listening order bias and that's why I explained that we would change the study design in order to eliminate the bias. Custom room treatments for headphone users. Link to comment
mmerrill99 Posted June 17, 2017 Share Posted June 17, 2017 16 minutes ago, pkane2001 said: I think there's a misunderstanding here. It's not the last sample played, it is the second sample played that seems to get the preference. I don't know how the last sample in a long random length sequence could be possibly preferred, as long as the experimenter does not warn the test subjects before the last sample. The Stereophile article that started this thread is explicit that this bias was found only after the first two samples: By repeating the test more than two times and by randomizing A and B samples one reduces the effect of this identification bias. The more tests you run, the lower the effect of the bias will be on the overall result. You can also throw out the results of the first two samples to eliminate all effects of this 'second sample preference' bias. Again, you don't seem to understand - maybe willfully so - is it so difficult to understand that what Teresa means by last sample is perfectly clear -the second listened to sample in every pair of samples. All you are doing in the randomizing is spreading the bias between A & B so that it is no longer evident in the results but it is not eliminated from the test. Again, lets say there is no randomization, and B is always listened to 'last' in the pair - the results when analyzed will show a clear bias towards a preference for B. When we randomise, if A is last heard then it will be preferred (irrespective of the real difference between them), if B is last heard then it will be preferred (irrespective of the real difference between them) - we are assuming subtle differences. All this randomisation is doing is returning a null result, masking any chance of discriminating real differences & hiding the fact that there is real bias in operation which would be evident in the results if no randomisation had been done Link to comment
Teresa Posted June 17, 2017 Share Posted June 17, 2017 22 minutes ago, pkane2001 said: I think there's a misunderstanding here. It's not the last sample played, it is the second sample played that seems to get the preference. I don't know how the last sample in a long random length sequence could be possibly preferred, as long as the experimenter does not warn the test subjects before the last sample. No misunderstanding the second of two samples is the last sample. There is no long random length sequence played, either A-B or B-A. The last (second) sample played aways sounded the best. If you tested a 100 people and the first and last (second) sample was played randomly an equal number of times you would nullified any sonic differences. I have brain problems so I have an excuse, why is this so hard for you to understand? 22 minutes ago, pkane2001 said: The Stereophile article that started this thread is explicit that this bias was found only after the first two samples: Quote As much as this says about the limits of an A/B comparisons based on listening to short passages of music without the opportunity to at least return to A after having heard A and B, it also produced some extremely revealing commentary. Correct, there was only two samples, sometimes A was first, sometimes B was first, however the first played sample was never chosen as the best. 22 minutes ago, pkane2001 said: By repeating the test more than two times and by randomizing A and B samples one reduces the effect of this identification bias. The more tests you run, the lower the effect of the bias will be on the overall result. You can also throw out the results of the first two samples to eliminate all effects of this 'second sample preference' bias. There is no way to get more than 50% correct if the second (last) sample was always preferred to the first. Any sonic differences would be hidden in the "A/B testing favors B over A" scenario. I have dementia. I save all my posts in a text file I call Forums. I do a search in that file to find out what I said or did in the past. I still love music. Teresa Link to comment
Ralf11 Posted June 17, 2017 Share Posted June 17, 2017 46 minutes ago, WMW said: And you, ralf11 - I VERY much enjoy your posts and your humor. Thanks most kindly! Bill Walker Thx! Your comment re testing is spot on. Cue the anti-vaxers... Link to comment
Teresa Posted June 17, 2017 Share Posted June 17, 2017 48 minutes ago, WMW said: ...PS I have no dog in this race. As a physician however, I submit that if randomized, double blind testing were not embraced in medicine we might still be studying voodoo and astrology as vectors in human disease. Bill, I agree with regards to medical double-blind studies. Double blind tests work in drug testing as the human subjects don't have to make any decisions whatsoever. The subjects either are given the real medicine or a sugar pill. Those who get well taking the sugar pill do so as unconsciously their believe the medicine might be real thus their antibodies manage to kill the disease, this is known as the placebo effect. If considerably more people get well with the new drug than with the sugar pill, the drug is considered effective. None of our senses come into play in this type of test. Audio is different as it requires human beings to consciously make choices and we are not to good at that. I have dementia. I save all my posts in a text file I call Forums. I do a search in that file to find out what I said or did in the past. I still love music. Teresa Link to comment
pkane2001 Posted June 17, 2017 Share Posted June 17, 2017 1 minute ago, Teresa said: No misunderstanding the second of two samples is the last sample. There is no random length sequence played. The last (second) sample played aways sounded the best. If you tested a 100 people and the first and last (second) sample was played was randomly equally you would nullified any sonic differences. I have brain problems so I have an excuse, why is this so hard for you to understand? Correct, there was only two samples, sometimes A was first, sometimes B was first, however the first played sample was never chosen as the best. There is no way to get more than 50% correct if the second (last) sample was always preferred to the first. Any sonic differences would be hidden in the "A/B testing favors B over A" scenario. A blind A/B test consists of two samples, A and B, tested in a random sequence of a variable length. Saying that the A/B test must only consist of two tests unnecessarily cripples it, reducing its statistical validity. So, take my suggestion and run the A/B test 10 times with each test subject and throw out the first two results from each sequence of 10. Where is this 'second sample' bias in such a blind test? -Paul DeltaWave, DISTORT, Earful, PKHarmonic, new: Multitone Analyzer Link to comment
jabbr Posted June 17, 2017 Share Posted June 17, 2017 4 minutes ago, pkane2001 said: So, take my suggestion and run the A/B test 10 times with each test subject and throw out the first two results from each sequence of 10. Where is this 'second sample' bias in such a blind test? This is a good way to remove "training bias" Custom room treatments for headphone users. Link to comment
mmerrill99 Posted June 17, 2017 Share Posted June 17, 2017 6 minutes ago, pkane2001 said: A blind A/B test consists of two samples, A and B, tested in a random sequence of a variable length. Saying that the A/B test must only consist of two tests unnecessarily cripples it, reducing its statistical validity. So, take my suggestion and run the A/B test 10 times with each test subject and throw out the first two results from each sequence of 10. Where is this 'second sample' bias in such a blind test? Oh dear - massive fail in understanding Teresa 1 Link to comment
pkane2001 Posted June 17, 2017 Share Posted June 17, 2017 2 minutes ago, mmerrill99 said: Oh dear - massive fail in understanding I'm glad you've finally admitted it -Paul DeltaWave, DISTORT, Earful, PKHarmonic, new: Multitone Analyzer Link to comment
Teresa Posted June 17, 2017 Share Posted June 17, 2017 5 hours ago, pkane2001 said: ...So, take my suggestion and run the A/B test 10 times with each test subject and throw out the first two results from each sequence of 10. Where is this 'second sample' bias in such a blind test? That wouldn't work either due to how the human brain works. Repeated A/B'ing runs into trouble with both cognitive bias and listener fatigue, meaning you would have to also throw out samples 3-10 in your proposed test. With cognitive bias your brain will either fill in missing information or remove it thus making each sample sound the same on repeated switching back and forth. And listener fatigue will guarantee that after just a few switches back and forth both A and B music samples will sound like crap. There is no quick shortcut, long-term listening to music one loves is the only way to discover what one likes. I have dementia. I save all my posts in a text file I call Forums. I do a search in that file to find out what I said or did in the past. I still love music. Teresa Link to comment
pkane2001 Posted June 17, 2017 Share Posted June 17, 2017 2 hours ago, Teresa said: That wouldn't work either due to how the human brain works. Repeated A/B'ing runs into trouble with both cognitive bias and listener fatigue, meaning you would have to also throw out samples 3-10 in your proposed test. With cognitive bias your brain will either fill in missing information or remove it thus making each sample sound the same on repeated switching back and forth. And listener fatigue will guarantee that after just a few switches back and forth both A and B music samples will sound like crap. There is no quick shortcut, long-term listening to music one loves is the only way to discover what one likes. You are really determined to make a blind A/B test into a bad thing. Cognitive bias is exactly what you get in a long term listening test. In such a test you are not evaluating the equipment, you are evaluating the ability of your brain to adjust to your equipment. If you conduct it sighted, then you are also adding in the much more powerful selection bias into your test. Listener fatigue is something that might occur after a prolonged exposure to some sound. Show me a study that demonstrates listener fatigue after a total of a few minutes of repeated listening to the same track in an A/B test. -Paul DeltaWave, DISTORT, Earful, PKHarmonic, new: Multitone Analyzer Link to comment
mmerrill99 Posted June 17, 2017 Share Posted June 17, 2017 18 minutes ago, pkane2001 said: Show me a study that demonstrates listener fatigue It's a well known issue with blind testing & is recommended in ITU BS standards to cater for this among many other impediments in testing Recognizing the weaknesses/faultlines in any test is the objective approach, don't you think? Unless of course you don't want to face these issues & instead stay happy in your belief system. Teresa 1 Link to comment
AJ Soundfield Posted June 17, 2017 Share Posted June 17, 2017 Audiophile arguments from scientific ignorance against (audio) perceptual tests are always very amusing. Luckily, the world of science stopped listening to audiophiles over a hundred years ago, when they insisted horses could count, because their perceptions are physical reality . Now for example, all major orchestras bar audiophiles from making any audio decisions and have been blind audio testing all potential players. The result has been a remarkable turnaround in the diversity of orchestras, especially gender. When misogynist biases are no longer allowed to affect results, this is of course expected. Blind tests are not used to 'cure' the maladies of audiophiles, only remove those maladies from the results. If the result desired is to include maladies, then no blind tests are required. The only dichotomy in these "arguments", is never "objective" vs "subjective" (folks don't know the meanings of the word "Subjective"), but knowledge vs ignorance. Thankfully...for orchestras at least, the latter no longer dictates. Link to comment
AJ Soundfield Posted June 17, 2017 Share Posted June 17, 2017 12 minutes ago, mmerrill99 said: It's a well known issue with blind testing & is recommended in ITU BS standards Right, those ITU blind standards exist because science has known the uselessness of uncontrolled 'horse counting' perception for over hundred years, something audiophiles don't. Quote Recognizing the weaknesses/faultlines in any test is the objective approach, don't you think? Unless of course you don't want to face these issues & instead stay happy in your belief system. The irony after mentioning ITU blind tests.... Link to comment
Popular Post Jud Posted June 17, 2017 Popular Post Share Posted June 17, 2017 I would like to stay with the specific rather than the general here. There are many potential problems with blind tests, as with any kind of test, and various ways of dealing with these potential problems. We can (and I sometimes have) had discussions about some of these problems. But the discussion up to now has been about one particular potential problem, preference for a second sample. Randomization has been suggested as one of several possible ways of dealing with the potential problem, as has throwing out the first two samples. One specific objection that has been raised is that this does not account for other potential problems and is therefore not a useful solution. It seems to me obvious that (1) if I fix my car it will not deal with any potential problems with the dishwasher; and (2) this isn’t a reason not to fix the car. If there are yet other potential problems with a given test, they of course should also be minimized insofar as possible. From my reading in the scientific literature so far, there are a few of these that I think might be difficult to resolve with sequential A/B/X testing. But I would suppose if good solutions exist, the way I might find them is by further reading, not performing “thought experiments” on the possibly faulty basis of my current knowledge. Teresa and jabbr 2 One never knows, do one? - Fats Waller The fairest thing we can experience is the mysterious. It is the fundamental emotion which stands at the cradle of true art and true science. - Einstein Computer, Audirvana -> optical Ethernet to Fitlet3 -> Fibbr Alpha Optical USB -> iFi NEO iDSD DAC -> Apollon Audio 1ET400A Mini (Purifi based) -> Vandersteen 3A Signature. Link to comment
AJ Soundfield Posted June 17, 2017 Share Posted June 17, 2017 6 minutes ago, Jud said: I would like to stay with the specific rather than the general here. There are many potential problems with blind tests, as with any kind of test, and various ways of dealing with these potential problems. We can (and I sometimes have) had discussions about some of these problems. But the discussion up to now has been about one particular potential problem, preference for a second sample. You've lost me here in this paragraph Jud. This "test" wasn't blind (nor was any DUT). That is the "particular problem". Nicely highlighted too I may add. Link to comment
jabbr Posted June 17, 2017 Share Posted June 17, 2017 34 minutes ago, Jud said: It seems to me obvious that (1) if I fix my car it will not deal with any potential problems with the dishwasher; and (2) this isn’t a reason not to fix the car. Indeed! There is hope for your car. You might need a new dishwasher though 29 minutes ago, AJ Soundfield said: You've lost me here in this paragraph Jud. This "test" wasn't blind (nor was any DUT). That is the "particular problem". Nicely highlighted too I may add. That would be a different problem Brand name bias, expectation bias etc. Removal of all types of bias is important for an objective study. Expectation bias would be removed with a different technique than listening order bias. Jud 1 Custom room treatments for headphone users. Link to comment
Popular Post jabbr Posted June 17, 2017 Popular Post Share Posted June 17, 2017 10 hours ago, mmerrill99 said: Where he specifically stated that randomization in A/B testing would cancel any bias preferences & has argued with me all through this thread along the same lines. So study design involving multiple complicated variables is complicated and back to my first post: real work I never meant to suggest that randomization alone would solve every problem, but in trying to explain this we've become caught up in the discussion of randomization alone. Randomization is necessary, but not sufficient. There are Type I and Type II errors. Both situations need to be corrected for. Without randomization bias can lead to a Type I error (improper rejection of a true null hypothesis) With randomization, internal invalidity can lead to a Type II error (rejection of a false null hypothesis) You've correctly pointed out the problem that occurs when a strong bias error masks differences in a weaker signal. Mathematically this is a type of quantization error. In many cases it can be handled by drastically increasing the number of samples (think sampling rate of DSD64 vs PCM44 -- @mansr: this is just a very vague analogy ). But that isn't necessarily the best option. The "problem" here relies on the measurement units. "A is better" vs "B is better" is binary, and small statistical differences in A vs B are quantized away by a strong bias. A better technique (which I've already suggested but am explaining here) is to increase the resolution of the measurement. Thats why we move to an analogue scale, perhaps a visual analogue scale (VAS) in which the track is rates from 1-10 without reference to other tracks. Given enough samples, even small differences are statistically resolved. Note that as the difference becomes smaller the required number of samples in order to be statistically significant greatly rises. Consider that both Amplifiers A and B are "high end", that means that bias aside, the number of subjects needed to validly distinguish a difference in "quality" using a binary measure would be very large -- 10 subjects probably waay too small. Jud and pkane2001 2 Custom room treatments for headphone users. Link to comment
mmerrill99 Posted June 17, 2017 Share Posted June 17, 2017 5 hours ago, jabbr said: I never meant to suggest that randomization alone would solve every problem No you suggested that randomization would solve the problem of listener order bias - that was very clear & you clearly argued this to be the case until you started to realize you were wrong & changed the goalposts as you are doing now. But it's good that you now recognize that randomization is not the solution & a test redesign is necessary to address this bias. Please point out this realization to ralf11, mansr & pkane2001 who maintain that I am a fool with no knowledge of the subject - indeed even Jud seems to say the same. Teresa 1 Link to comment
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now