Skip to content
Go back

GSoC Week 7: Synthetic Data Generation and Performance Testing of Hindi Chapter

Published:  at  07:46 AM

Introduction

As last week’s results of the finetuned model were unsatisfactory, we decided to rewrite the synthetic data generation script to get better data that aligns with our goals (representative of the benchie dataset). In addition to this, we got temporary server access from the DBpedia team for testing and benchmarking the Hindi SPARQL endpoint.


Synthetic Data Generation

The primary problem with our previously generated data was that it was not representative of the benchie dataset. That is why when we finetuned the base model (gemma3-4b), it learned well and performed well on the test split with a loss of 0.484 and test perplexity of 1.623, but it didn’t perform well on the benchie dataset (recall of 24%).

Data Generation Strategy

  1. I extracted all relations from the benchie dataset and grouped all of these into semantic concepts as such:
SEMANTIC_CONCEPTS = {
    "Becoming & Appointment": ["बने", "बन चुके हैं", "नियुक्त हुई", "निदेशक बने", "घोषित किया गया", "एक प्रमुख हस्ती बन चुके हैं", "आजाद हुआ"],
    "Initiation & Beginning": ["शुरू की", "शुरू हुई", "आरम्भ हुई", "सुनवाई शुरू हुई", "आधारशिला रखी गयी थी", "बोलीवुड करियर की शुरुआत की"],
    ...
}
  1. We also extracted sentence structure templates from all the benchie sentences as such: [“A sentence stating a contribution to a field (which is a list of two items joined by ‘और’), using a comparison with another person (’…के समान’).”, “A sentence explaining the usage of a word (’… प्रयोग में आता है’), where the entity it refers to is described by a participle phrase (’… बनाने वाली’).” … ]

Now we have three sub-strategies: structure_first, multi_relation, and targeted_relation.

The script runs in a loop and picks which strategy to execute based on a weight assigned to it. We assign 50% weight to structure_first, then 30% to multi_relation, and 20% to targeted_relation. This ensures the generated data is representative of our final Hindi benchie dataset. We also added instructions to output a “thought_process” field which our small LM can learn from for extracting relations. The prompt also has specific annotation guidelines for the LLM to adhere to benchie standards.

You can find the whole script here.


Test Deployment and Testing

Server Setup

First things first, the server was set up with Ubuntu 24.x. SSH access was added and then I added the required dependencies like Docker, pip, and Python.

Creating the Docker Image

Once the server was set up, I made the changes for simple deployment of the Hindi chapter via Docker image. I set up the Dockerfile to set the environment variables, copy the RDF data into the container, copy the data loading and startup script, and finally expose the ports.

Then I built and ran the image locally for testing.

docker build -t hindi-dbpedia-sparql:latest .
docker run -d -p 8890:8890 -p 1111:1111 --name hindi-sparql-test hindi-dbpedia-sparql:latest

There were a few hiccups with the data loading script which were resolved, and then the image was pushed to my Docker Hub after using a multi-platform build to support both AMD and ARM architecture.

docker buildx build --platform linux/amd64,linux/arm64 \
  -t 42bitstogo/hindi-dbpedia-sparql:latest \
  --push .

On the server I deployed this image using:

docker run -d -p 8890:8890 -p 1111:1111 --name hindi-sparql 42bitstogo/hindi-dbpedia-sparql:latest

Performance Testing

The server was tested for performance with varying levels. I simulated hitting the server with multiple requests at different levels of concurrency.

Simple Query Performance

3 different simple queries:

Complex Query Performance

Quick Load Test

Comprehensive Python Performance Testing

queries = [
    "Total Triples Count",
    "Distinct Subjects", 
    "Distinct Properties",
    "Person Entities",
    "Geographic Data"
]

Purpose: Test fundamental data access patterns
Results: 0.02s - 1.12s response times
Coverage: Count queries, entity queries, geographic data

complex_queries = [
    "Multi-Join Query",
    "Aggregation Query", 
    "Text Search"
]

Purpose: Test advanced SPARQL features
Results: 0.01s - 0.04s response times
Coverage: Joins, aggregations, text search with Hindi filters

Concurrent Load Testing

def test_concurrent_load(self, num_threads=10, duration=60):
    # 10 threads running for 60 seconds
    # Continuous query execution

Purpose: Measure sustained throughput under load
Results: 23,515 queries, 391.92 QPS, 25.5ms average response time
Coverage: Multi-threaded sustained load

Memory Usage Testing

def test_memory_usage(self):
    # Monitor memory before/after queries
    # Test with different query sizes

Purpose: Measure memory efficiency and stability
Results: 1.95GB baseline, 1.97GB peak, 0.01GB increase
Coverage: Memory leak detection, resource efficiency

Advanced Load Testing (load_test.sh)

Progressive Concurrency Testing

for concurrency in [1, 5, 10, 20, 50]:
    # 100 requests per concurrency level
    # Measure success rate, response times

Purpose: Progressive concurrency scaling analysis
Results: 100% success rate across all levels
Response Times: 12ms → 16ms → 25ms → 45ms → 64ms
Coverage: Concurrency limits and scaling characteristics

Resource Monitoring During Load

Background monitoring during 60-second heavy load test with 20 workers, continuous requests

Purpose: Monitor system resources under sustained load
Results: 23,488 requests in 60s, 391.47 QPS
Coverage: CPU, memory, network usage during peak load

Final Results

Comprehensive Test Coverage Matrix

Test CategoryTest TypeDurationConcurrencyMetrics Measured
ConnectivitySingle Request<1s1Response time, HTTP status
Basic PerformanceSimple Queries<5s1Query response times
Complex PerformanceAdvanced Queries<10s1Join/aggregation performance
Quick LoadConcurrent Requests<30s5Success rate, response times
Sustained LoadContinuous Load60s10Throughput, QPS, response times
Concurrency ScalingProgressive LoadVariable1-50Scaling characteristics
Resource MonitoringSystem Resources60s20CPU, memory, I/O usage
Memory TestingMemory Usage<30s1Memory efficiency, leaks

Key Performance Metrics Collected

Response Time Metrics

Throughput Metrics

Resource Utilization

Scalability Metrics

All of the code and results in detail are available on GitHub.


Next Steps


Share this post on:

Previous Post
GSoC Week 8-9: Paper Review, SPARQL Endpoint Demonstration, and Synthetic Data Generation
Next Post
GSoC Week 6: IndIE Enhancement + Finetuning