LLM Evaluator (Model Response Analyst)

Partout au CamerounTemps pleinTélétravail11 juin 2026

Description du poste

We are seeking a detail-oriented and analytical LLM Evaluator to assess, analyze, and improve the performance of large language models (LLMs). In this role, you will evaluate AI-generated content for accuracy, coherence, factual reliability, bias, safety, and alignment with defined guidelines. Responsibilities: - Evaluate and rank model-generated text based on complex rubrics covering dimensions such as factuality, coherence, safety, instruction-following, and creativity. - Review multiple model responses to the same prompt and determine which output a human would prefer, providing justifications for your choices. - Provide clear, concise feedback to the modeling and training teams regarding recurring failure models observed during evaluation sessions. - Attempt to "break" the model by crafting prompts designed to elicit biased, harmful, or insecure outputs to help patch safety vulnerabilities. - Collaborate with the quality assurance team to suggest improvements to evaluation guidelines when you encounter ambiguous or unclassifiable edge cases. - Participate in regular "cross-checking" sessions with other evaluators to calibrate scoring standards and ensure inter-rater reliability across the global team. - When a model underperforms, dig deeper than the surface score to hypothesize "why" the model made a specific error. - Identify and flag novel or unexpected model behaviors to the research team, contributing to a living library of unique model outputs and failure modes.

Profil recherché

- Minimum of 2 years of professional experience in a relevant field such as Computational Linguistics, Data Analysis, Technical Writing, Quality Assurance (specifically for NLP/AI), or cognitive science. - Bachelor's degree in Computer Science, or a relating field. - Deep understanding of how-to craft prompts to elicit specific behaviors and test model limits. - Ability to look at a text output and explain "why" it is "good" or "bad" based on logic, tone, factuality, and instruction adherence. - Experience working with Reinforcement Learning from Human Feedback (RLHF) data collection. - Proven experience monitoring and improving consistency among evaluation teams. - Experience sourcing, cleaning, and annotating datasets specifically for the fine-tuning or evaluating LLMs.