GSoC Week 8-9: Paper Review, SPARQL Endpoint Demonstration, and Synthetic Data Generation | Blog

Introduction

Over the past two weeks, I did a diverse set of activities that span from academic paper review and synthetic data generation. Week 8 focused on reviewing a significant paper on multilingual question answering over knowledge graphs, while Week 9 was dedicated to demonstrating our Hindi DBpedia SPARQL endpoint capabilities and generating synthetic data for model finetuning.

Paper Review: Multilingual Text-to-SPARQL

Paper Overview

I conducted a comprehensive review of the paper “Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning” during our weekly meeting. This paper presents a novel approach to multilingual question answering over knowledge graphs using a human-inspired reasoning framework.

Strengths Analysis

Comprehensive Experimental Setup The paper’s experimental methodology was particularly impressive, featuring:

Evaluation across 10 languages including low-resource ones
Use of both commercial (OpenAI GPT-3.5 and GPT-4) and open-source models (Qwen and Llama)
Detailed cost analysis for practical deployment considerations

Valuable Ablation Study The ablation study was especially noteworthy as it systematically broke down the contributions of different components:

Experience pool effectiveness
Feedback step improvements
Quantified gains from each architectural component

This granular analysis provides clear insights into which parts of the system contribute most to performance improvements.

Areas for Improvement

Lack of Error Analysis The most significant gap identified was the absence of detailed error analysis. While the paper shows where the agent performs well, it doesn’t explain:

Specific failure conditions and patterns
When the agent creates flawed execution plans
Whether NEL (Named Entity Linking) tool failures contribute to errors
Performance degradation with increasing query complexity (constraints, counting, ordering)

This analysis would be invaluable for determining future work directions and improvement areas.

Limited Feedback Loop The feedback mechanism was identified as overly simplistic:

Only single-pass correction implemented
Authors mention avoiding infinite loops as justification
Potential for multi-step self-correction or debate-based refinement
Could explore iterative refinement strategies

Inconsistent Language Prompting The paper used a mix of native speaker-written prompts and machine-translated ones without discussing:

How this inconsistency affected performance
Comparative analysis of native vs. translated prompts
Impact on low-resource language performance

Hindi DBpedia SPARQL Endpoint Demonstration

Motivation and Benefits

The deployment of a dedicated Hindi DBpedia chapter represents a significant milestone in making structured knowledge accessible to the Hindi-speaking community. The primary motivations include:

Unlocking Hindi Wikipedia Hindi Wikipedia contains vast amounts of information, but as unstructured prose intended for human readers. DBpedia transforms this repository into a structured, queryable knowledge graph by extracting information from infoboxes, tables, and categories.

Centralized Query Endpoint Instead of requiring developers to build complex web scrapers for individual Wikipedia pages, the Hindi DBpedia provides a single, stable SPARQL endpoint. This centralized graph acts as a unified source of truth, dramatically simplifying data access for applications.

Empowering Localized Applications By providing structured data in Hindi, this project enables developers to build culturally relevant applications for one of the world’s largest language communities. Potential applications include:

Intelligent chatbots with cultural context
Localized search engine knowledge panels
Academic research tools
Recommendation systems tailored to Hindi-speaking audiences

Case Study: Entity Exploration (Amitabh Bachchan)

I created a comprehensive demonstration focusing on a single well-defined entity to showcase fundamental query patterns:

Entity Location and Retrieval

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dbo: <http://dbpedia.org/ontology/>

SELECT ?person ?name
WHERE {
  ?person a dbo:Person .
  ?person foaf:name ?name .
  FILTER(LANG(?name) = "hi")
  FILTER(CONTAINS(?name, "अमिताभ बच्चन"))
}
LIMIT 10

Property Traversal and Relationship Following

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr-hi: <http://hi.dbpedia.org/resource/>

SELECT ?parentDeathDate
WHERE {
  dbr-hi:अमिताभ_बच्चन dbo:parent ?parentURI .
  ?parentURI dbo:deathDate ?parentDeathDate .
}

Advanced Query Capabilities

Categorical and Aggregate Queries Demonstrated counting instances, assessing property availability, and performing complex aggregations:

PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT (COUNT(?s) as ?numberOfCities)
WHERE {
  ?s a dbo:City .
}

Complex Relationship Traversal Showcased multi-step queries involving actors, movies, and directors:

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr-hi: <http://hi.dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?movieName ?directorName
WHERE {
  ?movie dbo:starring dbr-hi:अमिताभ_बच्चन .
  ?movie foaf:name ?movieName .
  OPTIONAL {
    ?movie dbo:director ?director .
    ?director foaf:name ?directorName .
  }
}
LIMIT 20

Advanced Filtering and Data Types Demonstrated date filtering, regex pattern matching, and geographic coordinate retrieval:

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?personName ?birthDate
WHERE {
  ?person a dbo:Person ;
          dbo:birthDate ?birthDate ;
          foaf:name ?personName .
  FILTER (?birthDate > "1950-01-01"^^xsd:date)
}
ORDER BY ?birthDate
LIMIT 50

Synthetic Data Generation and Model Benchmarking

Data Generation Strategy

Building on the synthetic data generation work from Week 7, I adapted the script for OpenAI models and ran experiments with the 2.5 Flash model. The generation process yielded approximately 800 high-quality data points through:

Multi-Strategy Approach

Structure-first generation (50% weight)
Multi-relation generation (30% weight)
Targeted relation generation (20% weight)

Quality Control Measures

Semantic concept grouping from benchie dataset
Template-based sentence generation
Programmatic validation and filtering
Thought process inclusion for LM learning

Model Benchmarking Results

2.5 Flash Lite Performance After generating the synthetic data, I benchmarked the 2.5-flash-lite model on the Hindi benchie dataset:

{
  "precision": 0.43,
  "recall": 0.43,
  "f1_score": 0.42999999999999994,
  "total_true_positives": 86,
  "total_false_positives": 114,
  "total_false_negatives": 114
}

Key Observations

Balanced precision and recall (both at 43%)
Significant improvement over previous baseline models
Room for further optimization through better data quality
Need for more diverse training examples

Technical Challenges and Solutions

Programmatic Validation Issues Attempted to implement programmatic validations and filtering mechanisms, but these only reduced true positives without improving overall performance. This suggests that the current validation approaches may be too restrictive for the task.

Data Quality Improvements

Adapted script for OpenAI models successfully
Generated approximately 800 high-quality data points
Implemented chunk integrity maintenance
Added comprehensive annotation guidelines

Next Steps

Immediate Actions

DBpedia Mappings Access: Requested access for updating DBpedia mappings via UI to improve data quality
Model Optimization: Continue refining synthetic data generation for better model performance
Error Analysis: Implement detailed error analysis for the 2.5 Flash model to identify improvement areas

Future Directions

Enhanced Data Generation: Explore more sophisticated data generation strategies
Model Finetuning: Use the generated synthetic data for finetuning smaller models
Performance Optimization: Focus on improving precision while maintaining recall
Integration Work: Integrate improved models into the existing pipeline

Conclusion

These two weeks have been highly productive, spanning academic research review, practical system demonstration, and data generation work. The paper review provided valuable insights into multilingual question answering approaches, while the SPARQL endpoint demonstration showcased the practical capabilities of our Hindi DBpedia chapter. The synthetic data generation work represents a significant step forward in improving model performance for Hindi information extraction tasks.

The balanced performance metrics from the 2.5 Flash Lite model (43% precision and recall) show promising progress, though there’s still room for improvement through better data quality and model optimization strategies.