27 votes

Extraverted introverts, cautious risk-takers, and selfless narcissists: A demonstration of why you can’t trust data collected on MTurk

11 comments

  1. [3]
    unkz
    Link
    This would be obvious to anyone who spent any time on the worker side of turk. I once did, as part of my process for building apps on the turk platform, and the culture of turk workers is pretty...

    This would be obvious to anyone who spent any time on the worker side of turk. I once did, as part of my process for building apps on the turk platform, and the culture of turk workers is pretty actively hostile to the "employers". If you get into the private discords and forums, you'll find a number of people building and distributing browser plugins to automate and sometimes actively subvert the tasks that are supposed to be done by humans. Most discussion is around the topic of finding tasks that pay well and don't block users based on low quality work. There's also quite a lot of discussion about how to detect "gold standard" questions that are designed to detect workers that aren't paying attention, so that they can be answered correctly while not paying any attention at all to the real questions.

    20 votes
    1. [2]
      bitshift
      Link Parent
      That's wild that there's active subversion going on, as in people putting in significant effort to cheat the system, and to cheat more effectively. I shouldn't be surprised, but I had assumed...

      That's wild that there's active subversion going on, as in people putting in significant effort to cheat the system, and to cheat more effectively. I shouldn't be surprised, but I had assumed cheating was more low effort, such as mashing keyboards.

      You said you built apps on the platform? Were you (or your clients) in the position of "employer", then? If so, I'm curious what your approach was for combatting cheating.

      3 votes
      1. unkz
        Link Parent
        The basic plan is: Set up a bunch of “gold standard” that have known good answers, and mix them in with the regular questions. Always submit questions to multiple workers, and use the quality...

        The basic plan is:

        • Set up a bunch of “gold standard” that have known good answers, and mix them in with the regular questions.
        • Always submit questions to multiple workers, and use the quality score of the workers to determine how many agreeing results are necessary for a result to be considered good.
        • Use the results of the gold standard questions, as well as the inter-rater agreement ratio with the other workers, to both rate the quality of a worker, and scale the percentage of gold questions, but never eliminate them entirely as workers have a tendency to degrade later on.
        • Have a system for automatically feeding clearly scored questions into the gold standard pool so it doesn’t go stale and people can’t easily game that element.
        • Make sure workers get feedback about their work quality so they can adjust rather than getting blocked, which is really quite bad for their account.

        There’s a lot that goes into making a good on-boarding experience that helps with this as well. A clear paid training system that is transparent about whether they are performing well and makes it clear that there are fair but effective ongoing quality checks means fewer workers go to the effort of trying to game it.

        There will always be a number of people who will go though the training just to get the risk-free money and then stop before doing any actual work, or in some cases still go ahead and just submit garbage work until they get blocked, but over time if your app is priced fairly and it isn’t too mind numbingly boring you should be able to collect a sufficient number of regular workers that you can close it to new applicants. This can be an arduous task though, and needs a pretty long term and large scale project to make it worthwhile rather than just hiring a regular employee pool.

        6 votes
  2. [5]
    palimpsest
    Link
    Absolutely not surprised. I don't trust studies using MTurk - I think it's lazy research. However, I do want to say that 'I like order' and 'I crave chaos' can coexist. I personally like order,...

    Absolutely not surprised. I don't trust studies using MTurk - I think it's lazy research.

    However, I do want to say that 'I like order' and 'I crave chaos' can coexist. I personally like order, but I also crave chaos. :D

    19 votes
    1. em-dash
      Link Parent
      I wouldn't be surprised if this was true for most people, in different areas of their lives. I like order in my work and varying amounts of chaos elsewhere.

      However, I do want to say that 'I like order' and 'I crave chaos' can coexist. I personally like order, but I also crave chaos.

      I wouldn't be surprised if this was true for most people, in different areas of their lives. I like order in my work and varying amounts of chaos elsewhere.

      4 votes
    2. [3]
      AnEarlyMartyr
      Link Parent
      I mean, I think people regularly are more scenario specific than anything else. For me personally, some scenarios I rarely talk, some scenarios I talk a lot. It wouldn't be much of a stretch to...

      I mean, I think people regularly are more scenario specific than anything else. For me personally, some scenarios I rarely talk, some scenarios I talk a lot. It wouldn't be much of a stretch to call me an extroverted introvert or a cautious risk taker. I sometimes will make big life changing bets on relatively quick timelines but I generally spend time thinking them through and being careful where exactly I take big risks. All just to say that while I'm sure MTurk has plenty of mediocre data, I almost find this paper to say more about the shortfalls of trying to neatly categorize people into personality types and behaviors than it does about anything else. People tend to be big contradictory messes that don't really fit into simple descriptors.

      2 votes
      1. Minori
        Link Parent
        I'm not sure if you read the paper if you thought it was about categorizing personality types. The fact of the matter is, most people would not say "I talk a lot" and "I rarely talk". Those are...

        I'm not sure if you read the paper if you thought it was about categorizing personality types. The fact of the matter is, most people would not say "I talk a lot" and "I rarely talk". Those are extremely contradictory statements. This isn't some complex psychoanalysis. It's a simple "which one of these better describes you?"

        In psychometrics, researchers try really hard to measure and account for Internal Consistency. In simple terms, did the respondent answer "Yes, I love ice cream" and "Yes, I hate ice cream"? Psychologists are supposed to be trained to look for consistency and evasiveness in assessments and psych panels. If related questions have contradictory answers, there's an issue.

        This study says MTurk respondents are totally inconsistent on semantic antonyms and, in other studies, nearly identical questions. So, we shouldn't trust MTurk surveys.

        5 votes
      2. palimpsest
        Link Parent
        I have to disagree with you there. If you read the paper, you can see that MTurk results had positive correlations for almost every 'opposite' pair, as opposed to the other platform, which had...

        All just to say that while I'm sure MTurk has plenty of mediocre data, I almost find this paper to say more about the shortfalls of trying to neatly categorize people into personality types and behaviors than it does about anything else.

        I have to disagree with you there. If you read the paper, you can see that MTurk results had positive correlations for almost every 'opposite' pair, as opposed to the other platform, which had negative correlations (as would be expected). The authors present a very good case why it's likely that MTurk participants just picked 'yes' (or 'no') for everything instead of bothering to answer the questions. (You can also see that a bunch of them spent less than 2 seconds answering each question, meaning they likely didn't read it at all.)

        4 votes
  3. skybrian
    Link
    Here is the abstract:

    Here is the abstract:

    Over the last several years, a number of studies have used advanced statistical and methodological techniques to demonstrate that there is an issue with the quality of data on Amazon’s Mechanical Turk (MTurk). The current preregistered study aims to provide an accessible demonstration of this issue using a face-valid indicator of data quality: Do items that assess clearly contradictory content show positive correlations on the platform? We administered 27 semantic antonyms—pairs of items that assess incompatible beliefs or behaviours (e.g., “I am an extrovert” and “I am an introvert”)—to a sample of MTurk participants (N = 400). Over 96% of the semantic antonyms were positively correlated in the sample. For example, “I talk a lot” was positively correlated with “I rarely talk”; “I am narcissistic” was positively correlated with “I am a selfless person”; and “I like order” was positively correlated with “I crave chaos.” Moreover, 67% of the correlations remained positive even after we excluded nearly half of the sample for failing common attention check measures. These findings provide clear evidence that data collected on MTurk cannot be trusted, at least without a considerable amount of screening.

    17 votes
  4. JCPhoenix
    Link
    Interesting paper. I'm not surprised that MTurk participants are just running through surveys and tasks. As low as payments typically are, such as a penny, people have to minimize time in order to...

    Interesting paper. I'm not surprised that MTurk participants are just running through surveys and tasks. As low as payments typically are, such as a penny, people have to minimize time in order to maximize revenue. Which means speedrunning surveys and other tasks. But I didn't expect it to be that bad.

    I wonder which is worse: using primarily college students as lab rats (I had to a lot of these when I was in Psych 100) or using MTurk?

    That said, I definitely signed up to be a participant on those other two platforms mentioned: Cloud Research Connect and Prolific. $8/hr doing surveys? Sure, why not?

    16 votes
  5. Noox
    Link
    It's an endless game of cat and mouse. I worked for a survey company (being very vague on purpose) and there were considerable resources spent on detecting faulty participants - available to our...

    It's an endless game of cat and mouse. I worked for a survey company (being very vague on purpose) and there were considerable resources spent on detecting faulty participants - available to our clients for an upcharge, of course.

    If you did your survey through the company, and got your participants from MTurk, you'd definitely have an uphill battle and could take considerable time rooting out the bad quality respondents.

    I do believe that there might have been an option to only allow certain ranked participants from MTurk to take your survey though? Like only those who have had a high amount of survey responses accepted. I may be thinking of another participant-recruitment site though.

    9 votes