Introduction
As last week’s results of the finetuned model were unsatisfactory, we decided to rewrite the synthetic data generation script to get better data that aligns with our goals (representative of the benchie dataset). In addition to this, we got temporary server access from the DBpedia team for testing and benchmarking the Hindi SPARQL endpoint.
Synthetic Data Generation
The primary problem with our previously generated data was that it was not representative of the benchie dataset. That is why when we finetuned the base model (gemma3-4b), it learned well and performed well on the test split with a loss of 0.484 and test perplexity of 1.623, but it didn’t perform well on the benchie dataset (recall of 24%).
Data Generation Strategy
- I extracted all relations from the benchie dataset and grouped all of these into semantic concepts as such:
SEMANTIC_CONCEPTS = {
"Becoming & Appointment": ["बने", "बन चुके हैं", "नियुक्त हुई", "निदेशक बने", "घोषित किया गया", "एक प्रमुख हस्ती बन चुके हैं", "आजाद हुआ"],
"Initiation & Beginning": ["शुरू की", "शुरू हुई", "आरम्भ हुई", "सुनवाई शुरू हुई", "आधारशिला रखी गयी थी", "बोलीवुड करियर की शुरुआत की"],
...
}
- We also extracted sentence structure templates from all the benchie sentences as such: [“A sentence stating a contribution to a field (which is a list of two items joined by ‘और’), using a comparison with another person (’…के समान’).”, “A sentence explaining the usage of a word (’… प्रयोग में आता है’), where the entity it refers to is described by a participle phrase (’… बनाने वाली’).” … ]
Now we have three sub-strategies: structure_first, multi_relation, and targeted_relation.
- In
structure_first, we prompt the LLM to generate a sentence using a template of a random structure and give it one of the random relations to incorporate in that. - In
multi_relation, we prompt the LLM to use two of the random relations from two different semantic concepts. - In
targeted_relation, we prompt the LLM to use randomly one of the relations to generate a sentence.
The script runs in a loop and picks which strategy to execute based on a weight assigned to it. We assign 50% weight to structure_first, then 30% to multi_relation, and 20% to targeted_relation.
This ensures the generated data is representative of our final Hindi benchie dataset. We also added instructions to output a “thought_process” field which our small LM can learn from for extracting relations. The prompt also has specific annotation guidelines for the LLM to adhere to benchie standards.
You can find the whole script here.
Test Deployment and Testing
Server Setup
First things first, the server was set up with Ubuntu 24.x. SSH access was added and then I added the required dependencies like Docker, pip, and Python.
Creating the Docker Image
Once the server was set up, I made the changes for simple deployment of the Hindi chapter via Docker image. I set up the Dockerfile to set the environment variables, copy the RDF data into the container, copy the data loading and startup script, and finally expose the ports.
Then I built and ran the image locally for testing.
docker build -t hindi-dbpedia-sparql:latest .
docker run -d -p 8890:8890 -p 1111:1111 --name hindi-sparql-test hindi-dbpedia-sparql:latest
There were a few hiccups with the data loading script which were resolved, and then the image was pushed to my Docker Hub after using a multi-platform build to support both AMD and ARM architecture.
docker buildx build --platform linux/amd64,linux/arm64 \
-t 42bitstogo/hindi-dbpedia-sparql:latest \
--push .
On the server I deployed this image using:
docker run -d -p 8890:8890 -p 1111:1111 --name hindi-sparql 42bitstogo/hindi-dbpedia-sparql:latest
Performance Testing
The server was tested for performance with varying levels. I simulated hitting the server with multiple requests at different levels of concurrency.
Simple Query Performance
3 different simple queries:
- SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o }
- SELECT (COUNT(DISTINCT ?s) AS ?count) WHERE { ?s ?p ?o }
- SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 10 Results: 0.7-0.9ms response times
Complex Query Performance
- Aggregation query with GROUP BY and ORDER BY
- Result: 0.8ms response time
Quick Load Test
- 5 concurrent requests, 10 iterations
- Result: 20ms average, 100% success rate
Comprehensive Python Performance Testing
queries = [
"Total Triples Count",
"Distinct Subjects",
"Distinct Properties",
"Person Entities",
"Geographic Data"
]
Purpose: Test fundamental data access patterns
Results: 0.02s - 1.12s response times
Coverage: Count queries, entity queries, geographic data
complex_queries = [
"Multi-Join Query",
"Aggregation Query",
"Text Search"
]
Purpose: Test advanced SPARQL features
Results: 0.01s - 0.04s response times
Coverage: Joins, aggregations, text search with Hindi filters
Concurrent Load Testing
def test_concurrent_load(self, num_threads=10, duration=60):
# 10 threads running for 60 seconds
# Continuous query execution
Purpose: Measure sustained throughput under load
Results: 23,515 queries, 391.92 QPS, 25.5ms average response time
Coverage: Multi-threaded sustained load
Memory Usage Testing
def test_memory_usage(self):
# Monitor memory before/after queries
# Test with different query sizes
Purpose: Measure memory efficiency and stability
Results: 1.95GB baseline, 1.97GB peak, 0.01GB increase
Coverage: Memory leak detection, resource efficiency
Advanced Load Testing (load_test.sh)
Progressive Concurrency Testing
for concurrency in [1, 5, 10, 20, 50]:
# 100 requests per concurrency level
# Measure success rate, response times
Purpose: Progressive concurrency scaling analysis
Results: 100% success rate across all levels
Response Times: 12ms → 16ms → 25ms → 45ms → 64ms
Coverage: Concurrency limits and scaling characteristics
Resource Monitoring During Load
Background monitoring during 60-second heavy load test with 20 workers, continuous requests
Purpose: Monitor system resources under sustained load
Results: 23,488 requests in 60s, 391.47 QPS
Coverage: CPU, memory, network usage during peak load
Final Results
Comprehensive Test Coverage Matrix
| Test Category | Test Type | Duration | Concurrency | Metrics Measured |
|---|---|---|---|---|
| Connectivity | Single Request | <1s | 1 | Response time, HTTP status |
| Basic Performance | Simple Queries | <5s | 1 | Query response times |
| Complex Performance | Advanced Queries | <10s | 1 | Join/aggregation performance |
| Quick Load | Concurrent Requests | <30s | 5 | Success rate, response times |
| Sustained Load | Continuous Load | 60s | 10 | Throughput, QPS, response times |
| Concurrency Scaling | Progressive Load | Variable | 1-50 | Scaling characteristics |
| Resource Monitoring | System Resources | 60s | 20 | CPU, memory, I/O usage |
| Memory Testing | Memory Usage | <30s | 1 | Memory efficiency, leaks |
Key Performance Metrics Collected
Response Time Metrics
- Baseline: 12ms (single request)
- Simple Queries: 0.7-0.9ms
- Complex Queries: 0.7-0.8ms
- Concurrent Load: 25.5ms (10 threads)
- Peak Load: 63.5ms (50 concurrent users)
Throughput Metrics
- Peak QPS: 391.92 requests/second
- Total Queries: 23,515 (60-second test)
- Success Rate: 100% across all tests
Resource Utilization
- CPU Usage: 0% (idle) → 750% (peak load)
- Memory Usage: 1.3GB stable
- Memory Efficiency: 2% of available 62GB RAM
Scalability Metrics
- Concurrency Scaling: Linear response time increase
- Maximum Tested: 50 concurrent users
- Performance Degradation: Predictable and acceptable
All of the code and results in detail are available on GitHub.
Next Steps
- Continue working on finetuning once we have the data collected from the synthetic data generation script.
- Upon receiving the permanent server we can deploy the Hindi chapter sparql endpoint.