Medical Anamnesis Chatbot with NLP (Chatbot PUSTU)
Production-ready medical chatbot achieving 92.61% intent classification accuracy using Multinomial Naive Bayes for Indonesian Puskesmas healthcare anamnesis workflow. Automated training data generation via Gemini Flash 2.0 API with custom NLP preprocessing pipeline built from scratch.
Role
Full Stack Developer & NLP Engineer
Client
Academic Project - Natural Language Processing Course
Team
2-person Team
Timeline
5 month • 2025

Skills & Tools
Skills Applied
Tools & Software
Challenges
Building production NLP system for Indonesian medical terminology without external Indonesian NLP libraries. Generating high-quality balanced training data across 14 intent classes with medical domain vocabulary. Implementing stateful 14-stage dialog management with context-aware entity extraction and smart prefilling algorithm.
Solutions
Automated training data generation using Gemini Flash 2.0 with custom prompt engineering for Indonesian medical context (14,000 samples). Built NLP preprocessing pipeline from scratch with custom slang normalization (94 mappings) and stopword filtering (93 words). Implemented dictionary-based NER system using regex patterns and context-aware extraction without external NLP libraries. Hybrid prediction combining ML model outputs with keyword boosting (0.90-0.95 confidence thresholds).
Impact
Successfully streamlined patient anamnesis workflow for Indonesian Puskesmas healthcare workers. Achieved production-grade 92.61% accuracy on medical intent classification. 24/7 cloud deployment enables immediate adoption without infrastructure requirements. Demonstrates feasibility of building domain-specific NLP systems for low-resource languages using LLM-powered data generation and classical ML techniques.
Project Overview
Chatbot PUSTU is a full-stack medical anamnesis chatbot designed for Indonesian Puskesmas (community health centers) healthcare workers. Built as a Natural Language Processing course project, this production-ready system uses Multinomial Naive Bayes with TF-IDF vectorization to achieve 92.61% intent classification accuracy across 14 medical intent classes for Indonesian language patient interviews.
The system demonstrates end-to-end NLP engineering: from LLM-powered training data generation (Gemini Flash 2.0 API) to custom preprocessing pipeline built from scratch, dictionary-based NER system, stateful dialog management, and full-stack deployment on Railway (Flask backend) and Vercel (Next.js frontend) with 24/7 availability.
Healthcare Problem Statement
Indonesian Puskesmas face significant workflow inefficiencies in patient anamnesis:
- Manual Data Collection: Time-consuming patient history documentation (15+ minutes per patient)
- Inconsistent Forms: Non-standardized anamnesis records across healthcare workers
- Language Barriers: Indonesian medical terminology lacks robust NLP tools and libraries
- Limited Resources: Small health centers cannot afford expensive EMR systems
- Workflow Bottleneck: Doctor time consumed by administrative data entry instead of diagnosis
Technical Architecture
System Overview
Three-tier architecture with ML-powered backend:
Frontend (Vercel)
- Next.js 16 + TypeScript
- Chat UI with conversation history
- Dark/Light mode toggle
- PDF export (client-side jsPDF)
- Session state management
Backend (Railway)
- Flask + Gunicorn
- Intent classification endpoint
- Entity extraction (NER)
- Dialog state management
- Smart prefilling algorithm
- Session persistence (in-memory)
ML Pipeline
- Multinomial Naive Bayes (alpha=0.1)
- TF-IDF Vectorizer (5000 features, n-grams 1-2)
- Custom preprocessing (slang + stopwords)
- Dictionary-based NER (97 symptoms, 23 locations)
- Hybrid prediction (ML + keyword boosting)
Technology Stack
Machine Learning Pipeline
- Algorithm: Multinomial Naive Bayes (sklearn)
- Vectorization: TF-IDF (5000 max features, n-grams 1-2)
- Hyperparameters: Laplace smoothing alpha=0.1
- Custom NLP: Regex-based preprocessing, dictionary-based NER
Training Data Generation
- LLM API: Gemini Flash 2.0 (Google AI Studio)
- Prompt Engineering: Custom prompts for Indonesian medical terminology
- Dataset Size: 14,000 samples (1,000 per intent × 14 classes)
- Generation Cost: ~$3.50 total (FREE tier)
Training Data Generation with Gemini Flash 2.0
LLM-Powered Dataset Creation
Challenge: No existing Indonesian medical anamnesis dataset available for training.
Solution: Automated generation using Gemini Flash 2.0 API with custom prompt engineering.
Intent Definitions
14 medical conversation intent classes:
- keluhan_utama: Patient's chief complaint with symptoms
- jawab_gejala_penyerta: Additional accompanying symptoms
- jawab_durasi: Duration of symptoms (days/weeks/months)
- jawab_lokasi: Body location of complaint
- jawab_severity: Severity level (mild/moderate/severe)
- jawab_riwayat_penyakit: Previous medical history
- jawab_riwayat_obat: Current medications
- jawab_alergi: Drug/food allergies
- jawab_faktor_risiko: Lifestyle risk factors
- sapaan: Greeting from patient
- ucapan_terima_kasih: Thank you from patient
- konfirmasi: Yes/confirmation response
- penyangkalan: No/denial response
- tidak_jelas: Unclear/confused response
Prompt Engineering Strategy
Key prompt design decisions:
- Patient persona: "Puskesmas patient" for realistic Indonesian medical language
- Medical context: Reference common conditions (ISPA, Gastritis) for relevant terminology
- Natural language: Explicit instruction for colloquial Indonesian, not formal medical jargon
- Variability: Encourage sentence diversity to prevent model overfitting
- Format constraints: Prevent numbering/bullets that complicate parsing
Final Dataset Statistics
- Total samples: 14,000 (balanced)
- Intent classes: 14 (1,000 samples each)
- Average sentence length: 51.3 characters raw, 38.7 processed
- Language: Indonesian (Bahasa Indonesia)
- Domain: Medical anamnesis (Puskesmas context)
- Generation time: ~4 hours (with FREE tier rate limits)
NLP Preprocessing Pipeline (From Scratch)
Why Build From Scratch?
Challenge: Existing Indonesian NLP libraries (NLTK, spaCy) lack medical domain vocabulary and Indonesian slang normalization for patient speech.
Solution: Custom preprocessing pipeline with domain-specific dictionaries.
Text Preprocessing
def preprocess(text, slang_dict, stopwords):
text = text.lower() # Lowercase normalization
text = re.sub(r'[^\w\s]', ' ', text) # Remove punctuation
words = text.split() # Tokenization
words = [slang_dict.get(w, w) for w in words] # Slang normalization
words = [w for w in words if w not in stopwords] # Stopword removal
return ' '.join(words)
Example transformation:
Input: "Dok saya batuk gak sembuh-sembuh udah 3 hari"
Output: "batuk tidak sembuh sudah 3 hari"
Slang Normalization
Dictionary: 94 Indonesian slang mappings
Key mappings (medical context):
- gak to tidak
- udah to sudah
- gimana to bagaimana
- kenapa to mengapa
Rationale: Indonesian patients use colloquial speech. Normalization improves TF-IDF feature consistency.
Stopword Filtering
Dictionary: 93 Indonesian stopwords
Custom medical stopwords: Removed common but uninformative words ("dok", "bu", "pak") while preserving medical terms.
Custom Named Entity Recognition (NER)
Dictionary-Based NER System
Why not spaCy/Stanza: No pre-trained Indonesian medical NER models available.
Solution: Dictionary-based keyword matching with regex patterns.
Entity Dictionaries
Symptom Dictionary (97 symptom types)
- demam: systemic (synonyms: panas, meriang, demam tinggi)
- batuk: respiratory (synonyms: batuk kering, batuk berdahak)
- nyeri: pain (synonyms: sakit, perih, nyeri hebat)
Body Location Dictionary (23 body locations)
- kepala, dada, perut, tenggorokan, etc.
Severity Keywords (3 levels)
- ringan: ringan, sedikit, agak, lumayan
- sedang: sedang, biasa, normal
- berat: berat, parah, sangat, sekali, hebat
Context-Aware Duration Extraction
Critical feature: Context-aware extraction prevents false positives (e.g., "28 tahun" for age should NOT be extracted as duration).
# Pattern 1: Duration with context keywords
duration_contexts = ['sudah', 'sejak', 'selama', 'sekitar']
pattern = rf'{context}\s+(\d+)\s*(?:hari|minggu|bulan|tahun)'
# Example:
Input: "Batuk sudah 3 hari" outputs 'sudah 3 hari'
Input: "Umur saya 28 tahun" outputs None (correctly rejects)
NER Performance
Validation (500-sample manual test):
| Entity Type | Precision | Recall | F1-Score |
|---|---|---|---|
| Symptoms | 94.2% | 87.3% | 90.6% |
| Body Locations | 96.1% | 89.7% | 92.8% |
| Duration | 92.5% | 88.1% | 90.2% |
| Severity | 89.3% | 85.6% | 87.4% |
Intent Classification Model
Multinomial Naive Bayes Architecture
Why Naive Bayes?
- Fast inference: Less than 10ms prediction latency (critical for real-time chat)
- Low training time: ~2 seconds for 14,000 samples
- Handles sparse TF-IDF: Designed for high-dimensional text features
- Probabilistic outputs: Enables confidence thresholding
- Interpretable: Can inspect feature importance
TF-IDF Configuration
vectorizer = TfidfVectorizer(
max_features=5000, # Top 5000 most informative terms
ngram_range=(1, 2), # Unigrams + bigrams for context
min_df=2, # Ignore very rare terms
sublinear_tf=True # Log-scale term frequency
)
Feature examples:
- Unigrams: batuk, demam, pusing, sakit, sudah, hari
- Bigrams: batuk kering, demam tinggi, sakit kepala, sudah hari
Model Evaluation
Test set performance (2,800 samples):
Overall Metrics
- Accuracy: 92.61%
- Precision (macro-avg): 93%
- Recall (macro-avg): 93%
- F1-Score (macro-avg): 93%
Top-performing intents (F1-score):
- sapaan: 98%
- jawab_riwayat_obat: 98%
- tidak_jelas: 97%
- penyangkalan: 96%
- ucapan_terima_kasih: 96%
Challenging intents:
- keluhan_utama: 84% (often confused with jawab_gejala_penyerta)
- jawab_riwayat_penyakit: 88% (overlaps with medical history intents)
- jawab_severity: 89% (severity keywords ambiguous)
Hybrid Prediction System
Challenge: ML model alone sometimes overconfident on ambiguous inputs.
Solution: Combine ML predictions with keyword-based confidence boosting.
Confidence thresholds:
- 0.95: Strong keyword evidence (duration with context)
- 0.90: Moderate keyword evidence (location, severity, allergy)
- Less than 0.70: ML uncertain, prefer keyword-based prediction
Result: Improved accuracy from 92.61% to ~94% on validation set.
Dialog State Management
14-Stage Conversation Flow
Stateful anamnesis interview:
- Greeting, 2. Nama, 3. Nama Panggilan, 4. Umur, 5. Jenis Kelamin, 6. Keluhan Utama, 7. Gejala, 8. Durasi, 9. Lokasi, 10. Severity, 11. Riwayat Penyakit, 12. Riwayat Obat, 13. Alergi, 14. Faktor Risiko, 15. Summary
Smart Prefilling Algorithm
Purpose: Auto-fill future stages if user provides information early.
Example scenario:
User (at keluhan_utama stage): "Saya sakit kepala parah sudah 3 hari di bagian kanan"
Extracted entities:
- Symptom: "sakit kepala"
- Severity: "parah" (berat)
- Duration: "sudah 3 hari"
- Location: "kepala"
Prefilling action:
- Auto-fill durasi stage (stage 8)
- Auto-fill lokasi stage (stage 9)
- Auto-fill severity stage (stage 10)
- Skip these stages in conversation
Impact: Reduces average anamnesis time by ~30% (tested with healthcare workers).
Full-Stack Implementation
Backend API (Flask)
Core Endpoints:
POST /chat - Main conversation endpoint
- Intent classification
- Entity extraction
- Dialog state update
- Smart prefilling
- Response generation
POST /reset - Reset conversation GET /health - Health check
Session Management: In-memory storage with UUID session IDs
Frontend (Next.js)
Tech stack:
- Next.js 16.0.6 with App Router
- TypeScript 5.x
- Tailwind CSS v4
- Axios for HTTP requests
- jsPDF for PDF export
Key features:
- Message history display
- User input field
- Loading indicator
- PDF export button
- Dark/Light mode toggle
Cloud Deployment
Backend (Railway)
- Python 3.10+ environment
- Gunicorn WSGI server
- Automatic HTTPS
- Models loaded at startup (in-memory)
Frontend (Vercel)
- Global CDN distribution
- Automatic HTTPS
- Next.js 16 (React 19)
- Static + SSR pages
Uptime: 24/7 availability with automatic health checks.
Results and Evaluation
User Testing with Healthcare Workers
Participants: 5 Puskesmas healthcare workers (3 nurses, 2 medical assistants)
Workflow Efficiency:
| Metric | Manual (Paper) | Chatbot | Improvement |
|---|---|---|---|
| Avg anamnesis time | 12-15 min | 8-9 min | ~30% faster |
| Data completeness | 78% fields filled | 95% fields filled | +17% |
| Errors (missing/wrong) | 12% | 4% | -67% errors |
User Satisfaction Ratings (1-5 scale):
- Ease of use: 4.6/5.0
- Accuracy: 4.4/5.0
- Speed: 4.8/5.0
- Completeness: 4.7/5.0
- Would recommend: 5/5 (all users)
Qualitative feedback:
- "Smart prefilling saves a lot of time - no need to repeat questions"
- "Indonesian language support is crucial, patients don't speak formal medical terms"
- "PDF export makes doctor handoff seamless"
System Performance Metrics
Production deployment (1 week monitoring):
| Metric | Value |
|---|---|
| API latency (p50) | 120ms |
| API latency (p95) | 280ms |
| API latency (p99) | 450ms |
| Frontend page load | 1.2s (Vercel CDN) |
| Backend uptime | 99.8% |
| Session duration (avg) | 8.5 minutes |
| Messages per session (avg) | 18 messages |
Cost analysis:
- Gemini API (data generation): $3.50 one-time
- Railway (backend): FREE tier
- Vercel (frontend): FREE tier
- Total monthly cost: $0
Key Insights and Lessons Learned
NLP for Low-Resource Languages
-
LLM-Powered Data Generation Works: Gemini Flash 2.0 produces high-quality Indonesian medical data with proper prompt engineering. Cost-effective ($3.50 vs ~$1,000+ manual annotation).
-
Domain-Specific Dictionaries Critical: Custom dictionaries outperform generic Indonesian NLP libraries for medical domain. Dictionary-based NER achieves 90%+ F1-score without expensive training.
-
Slang Normalization Essential: Real patient speech uses colloquial Indonesian ("gak", "udah"). Custom normalization significantly improves feature consistency.
-
Context-Aware Extraction Needed: Naive regex patterns fail. Context keywords ("sudah", "sejak") prevent false positives.
-
Classical ML Still Effective: Naive Bayes achieves 92.61% accuracy with 100x faster inference than BERT and zero GPU cost.
Production ML System Design
-
Hybrid Prediction Improves Reliability: Combining ML with keyword-based rules reduces error rate from ~7% to ~4%.
-
Confidence Thresholding Crucial: Setting thresholds enables "ask clarification" fallback instead of wrong predictions.
-
Smart Prefilling Significantly Improves UX: Auto-filling future stages speeds workflow by ~30%.
-
Session Persistence Matters: In-memory acceptable for demo, but production requires Redis/database.
Future Enhancements
Short-term (3-6 months)
-
Fine-tune Indonesian BERT: Use IndoBERT pre-trained model. Expected accuracy: 95-97%.
-
Sequence Labeling for NER: Replace dictionary-based NER with BiLSTM-CRF for token-level entity extraction.
-
Multi-Intent Classification: Handle complex messages with multiple intents.
Long-term (6-12 months)
-
Voice Input: Integrate Web Speech API for hands-free anamnesis.
-
EMR System Integration: Export to hospital systems (HL7 FHIR standard).
-
Symptom-Disease Inference: Suggest probable diagnoses based on symptom patterns.
-
Publish Dataset: Release 14K-sample dataset to research community.
Conclusion
Chatbot PUSTU demonstrates that production-grade healthcare NLP systems can be built for low-resource languages (Indonesian medical terminology) using:
- LLM-powered data generation (Gemini Flash 2.0)
- Custom NLP preprocessing built from scratch
- Dictionary-based NER without external libraries
- Classical ML techniques (Naive Bayes + TF-IDF)
- Stateful dialog management with smart prefilling
- Full-stack deployment on free-tier cloud platforms
Achieved metrics:
- 92.61% intent classification accuracy
- 14,000 balanced training samples (automated via LLM)
- 90%+ NER F1-scores
- 30% workflow time reduction
- 24/7 cloud deployment (zero cost)
Impact: Streamlined patient anamnesis workflow for Indonesian Puskesmas healthcare workers, demonstrating feasibility of domain-specific NLP for low-resource languages at minimal cost.
Live Demo: https://pustu-anamnesis-chatbot.vercel.app/
Source Code: GitHub Repository
Project Metrics
92.61% intent classification accuracy on 2,800 test samples
14,000 balanced training samples generated via Gemini Flash 2.0 API
97 symptom types + 23 body locations in custom NER system
14-stage stateful dialog management with smart prefilling
94 slang mappings + 93 stopwords in custom preprocessing pipeline
24/7 cloud deployment (Railway + Vercel)
Credits & Acknowledgments
Gemini Flash 2.0 API by Google for training data generation
Scikit-learn library for Multinomial Naive Bayes classifier
Flask web framework for REST API backend
Next.js 16 with TypeScript for modern frontend
Project Tags
Related Projects
View all projects →
Aspect-Based Sentiment Analysis on Financial News
Fine-tuned RoBERTa-base model for aspect-based sentiment analysis on 10,686 financial news headlines achieving 86.67% accuracy on entity-level sentiment classification with comprehensive handling of severe class imbalance through weighted loss and regularization techniques.

E-Commerce Trust Simulation with LLM-Powered Agents
Agent-based simulation using MESA framework with 7,580 LLM-powered autonomous agents to quantify fake review manipulation impact on e-commerce conversion rates, demonstrating +54-72pp increase in targeted low-quality products through rigorous statistical validation (Chi-Square = 121-177, p less than 0.0001).

LockIn - Password Manager
A secure, zero-knowledge password manager web application with client-side encryption, designed to protect user privacy while providing seamless password management across any browser.
