Underrepresented Languages(OpenAI)
GPT-4o shows improved reading comprehension and reasoning across a sample of historically underrepresented languages, and narrows the gap in performance between these languages and English.
To evaluate GPT-4o’s performance in text across a select group of languages historically underrepresented in Internet text, we collaborated with external researchers 12 and language facilitators to develop evaluations in five African languages: Amharic, Hausa, Northern Sotho (Sepedi), Swahili, Yoruba. This initial assessment focused on translating two popular language benchmarks and creating small novel language-specific reading comprehension evaluation for Amharic, Hausa and Yoruba.
ARC-Easy: This subset of the AI2 Reasoning Challenge [59] benchmark focuses on evaluating a model’s ability to answer common sense grade-school science questions; this subset contains questions that are generally easier to answer and do not require complex reasoning.
TruthfulQA[60]: This benchmark consists of questions that some humans might answer falsely due to misconceptions. The objective is to see if models can avoid generating false answers that mimic these misconceptions. Our principal research collaborators were Dr. David Adelani, Jonas Kgomo, Ed Bayes.
Uhura-Eval: In partnership with fluent speakers of Amharic, Hausa and Yoruba, our research partners created this benchmark to assess models’ reading comprehension in those respective languages.
This project focuses on evaluating the performance of large language models (LLMs) in African linguistic contexts. It aims to preserve and promote African languages by benchmarking LLMs to ensure they understand and generate text accurately in these languages. The project involves collaboration with linguists and AI researchers to develop datasets and evaluation metrics that reflect the unique characteristics of African languages.