Google’s contractors told to evaluate Gemini responses outside their expertise

Behind the responses from genAI models are testers who evaluate those answers for accuracy, but a report released this week casts doubt on the process.

According to a story published on Wednesday, contractors working on Google Gemini are now being directed to evaluate AI prompts and responses in areas in which they have no background, rather than being allowed to skip them as before.

This flies in the face of the “Building responsibly” section of the Gemini 2.0 announcement, which said, “As we develop these new technologies, we recognize the responsibility it entails, and the many questions AI agents open up for safety and security. That is why we are taking an exploratory and gradual approach to development, conducting research on multiple prototypes, iteratively implementing safety training, working with trusted testers and external experts and performing extensive risk assessments and safety and assurance evaluations.”

Mismatch raises questions

According to TechCrunch, “a new internal guideline passed down from Google to contractors working on Gemini has led to concerns that Gemini could be more prone to spouting out inaccurate information on highly sensitive topics, like healthcare, to regular people.”

It said that the new guideline reads: “You should not skip prompts that require specialized domain knowledge.” Contractors are instead instructed to rate the parts they understand and add a note that they lack the necessary domain knowledge for the rest.

And a blog that appeared on Artificial Intelligence+ on Thursday noted that, while “contractors hired by Google to support Gemini are key players in the evaluation process … one of the challenges is that [they] are often required to evaluate responses that might lie outside their own areas of expertise. For instance, while some may come from technical backgrounds, the AI can produce outputs related to literature, finance, healthcare, or even scientific research.”

It said, “this mismatch raises questions about how effectively human oversight can serve in validating AI-generated content across diverse fields.”

However, Google pointed out in a later statement to TechCrunch that the “raters” don’t only review content, they “provide valuable feedback on style, format, and other factors.”

‘Hidden component’ of genAI

When organizations are looking to leverage an AI model, it is important to reflect on responsible AI principles, Thomas Randall, research lead at Info-Tech Research Group said Thursday.

He said that there is “a hidden component to the generative AI market landscape: companies that fall under the guise of ‘reinforcement learning from human feedback (RLHF)’. These companies, such as Appen, Scale AI, and Clickworker, rely on a gig economy of millions of crowd workers for data production and training the AI algorithms that we find with OpenAI, Anthropic, Google, and others. RLHF companies pose issues for fair labor practices, and are scored poorly by Fairwork.”

Last year, Fairwork, which defines itself as an “action-research project that aims to shed light on how technological changes affect working conditions around the world,” released a set of AI principles that, it said, “assess the working conditions behind the development and deployment of AI systems in the context of an employment relation.”

There is, it stated at the time, “nothing ‘artificial’ about the immense amount of human labor that builds, supports, and maintains AI products and services. Many workers interact with AI systems in the workplace, and many others perform the critical data work that underpins the development of AI systems.”

Questions to ask

The executive branch of an organization looking to leverage an AI model, said Randall, needs to ask itself an assortment of questions such as “does the AI model you’re using rely on or use an RLHF company? If so, was the crowd worker pool diverse enough and provided sufficient expertise? How opaque was the training process for the models you are using? Can you trace data production? If the AI vendor does not know the answers to these questions, the organization needs to be prepared to take on accountability for any outputs the AI models provide.”

Paul Smith-Goodson, VP and principal analyst at Moor Insights & Strategy, added that it is vitally important that Retrieval Augmented Generation (RAG) be implemented, “because AI models do hallucinate and it is one way to make sure that language models are putting out the right information.”

He echoed Rick Villars, IDC group vice president of worldwide research, who earlier this year said, “more and more the solutions around RAG — and enabling people to use that more effectively — are going to focus on tying into the right data that has business value, as opposed to just the raw productivity improvements.”

A ‘corrosive effect’ on workers

Ryan Clarkson, managing partner at the Clarkson Law Firm, based in Malibu, California, said that the rapid growth of generative AI as a business has had corrosive effects on tech workers around the world.

For example, last week, workers filed a class action lawsuit through his firm against AI data processing company Scale AI, whose services include providing the human labor to label the data used in training AI models and in shaping their responses to queries.

The Scale AI lawsuit alleges poor working conditions and exploitive behavior by Scale, also saying that workers responsible for generating much of its product were mischaracterized by the company as independent contractors instead of employees.

Source link