Skip to content
Go back

GSoC Week 3: Pushing the boundary of LLMs for Hindi Relation Extraction

Published:  at  11:42 AM

It may be that today’s large neural networks are slightly conscious. — Ilya Sutskever

Introduction

Week 3 of my GSoC’25 journey was all about scaling up: from setup and debugging to running experiments with LLMs on Hindi information extraction. This week, I evaluated the performance of current open source LLMs on Open Information Extraction (OIE) tasks. We established a framework to test the performance of a LLM through ollama on a full Hindi-BenchIE dataset. We then systematically compared models and prompting strategies, and tried to figure out what works (and what doesn’t) for extracting structured knowledge from Hindi text.

This post explains our setup, methodology, and results.


Framework Overview

Our framework is simple. We have 3 main components:


The Dataset: Hindi-BenchIE

The evaluation is done on the Hindi-BenchIE dataset—a gold-standard collection of 112 Hindi sentences, each annotated with clusters(possibilities) of subject-predicate-object (SPO) triples. This dataset is challenging: sentences are diverse, often ambiguous, and require nuanced understanding to extract the correct relations.

Each sentence comes with multiple valid clusters of extractions, reflecting the inherent ambiguity and richness of the language. Our goal: see how close LLMs can get to human-level extraction on this benchmark.


Models and Prompting Strategies

We evaluated three LLMs:

For each model, we tested a suite of prompting strategies:

This matrix of models × strategies lets us probe not just which LLM is best, but how to talk to it for optimal extraction.


Evaluation: How We Measure Extraction Quality

Extracting triples is only half the battle. To measure how well our models perform, we compare their outputs to the gold standard using three core metrics:

From these, we compute:

Intuition:

Our evaluation script does a detailed, cluster-aware comparison for each sentence, ensuring that partial matches and alternative valid extractions are handled fairly.


Results: What Did We Learn?

We ran 18 experiments (3 models × 6 strategies), generating thousands of extractions. Here are the highlights:

Key Insights:

You can find the detailed code and results here: code, results.


Reflections and Next Steps

This week was a crash course in the realities of LLM-based information extraction for low-resource languages. The models are powerful, but far from perfect and prompting matters—a lot.

Next up:

Stay tuned for more experiments and more insights as I try to push the boundaries of LLM assisted OIE on hindi text.


Share this post on:

Previous Post
GSoC Week 4: Enhancing OIE using ReAct prompting + Link Prediction
Next Post
GSoC Weeks 1-2: Bootstrapping DBpedia Hindi Extraction