Medical Anamnesis Chatbot with NLP (Chatbot PUSTU)
Production-ready medical chatbot achieving 92.61% intent classification accuracy using Multinomial Naive Bayes for Indonesian Puskesmas healthcare anamnesis workflow. Automated training data generation via Gemini Flash 2.0 API with custom NLP preprocessing pipeline built from scratch.
Role
Full Stack Developer & NLP Engineer
Client
Academic Project - Natural Language Processing Course
Team
2-person Team
Timeline
5 month • 2025

Skills & Tools
Skills Applied
Tools & Software
Challenges
Building production NLP system for Indonesian medical terminology without external Indonesian NLP libraries. Generating high-quality balanced training data across 14 intent classes with medical domain vocabulary. Implementing stateful 14-stage dialog management with context-aware entity extraction and smart prefilling algorithm.
Solutions
Automated training data generation using Gemini Flash 2.0 with custom prompt engineering for Indonesian medical context (14,000 samples). Built NLP preprocessing pipeline from scratch with custom slang normalization (94 mappings) and stopword filtering (93 words). Implemented dictionary-based NER system using regex patterns and context-aware extraction without external NLP libraries. Hybrid prediction combining ML model outputs with keyword boosting (0.90-0.95 confidence thresholds).
Impact
Successfully streamlined patient anamnesis workflow for Indonesian Puskesmas healthcare workers. Achieved production-grade 92.61% accuracy on medical intent classification. 24/7 cloud deployment enables immediate adoption without infrastructure requirements. Demonstrates feasibility of building domain-specific NLP systems for low-resource languages using LLM-powered data generation and classical ML techniques.
Project Overview
Chatbot PUSTU is a full-stack medical anamnesis chatbot designed for Indonesian Puskesmas (community health centers) healthcare workers. Built as a Natural Language Processing course project, this production-ready system uses Multinomial Naive Bayes with TF-IDF vectorization to achieve 92.61% intent classification accuracy across 14 medical intent classes for Indonesian language patient interviews.
The system demonstrates end-to-end NLP engineering: from LLM-powered training data generation (Gemini Flash 2.0 API) to custom preprocessing pipeline built from scratch, dictionary-based NER system, stateful dialog management, and full-stack deployment on Railway (Flask backend) and Vercel (Next.js frontend) with 24/7 availability.
Healthcare Problem Statement
Indonesian Puskesmas face significant workflow inefficiencies in patient anamnesis:
- Manual Data Collection: Time-consuming patient history documentation (15+ minutes per patient)
- Inconsistent Forms: Non-standardized anamnesis records across healthcare workers
- Language Barriers: Indonesian medical terminology lacks robust NLP tools and libraries
- Limited Resources: Small health centers cannot afford expensive EMR systems
- Workflow Bottleneck: Doctor time consumed by administrative data entry instead of diagnosis
Technical Architecture
System Overview
Three-tier architecture with ML-powered backend:
Frontend (Vercel)
- Next.js 16 + TypeScript
- Chat UI with conversation history
- Dark/Light mode toggle
- PDF export (client-side jsPDF)
- Session state management
Backend (Railway)
- Flask + Gunicorn
- Intent classification endpoint
- Entity extraction (NER)
- Dialog state management
- Smart prefilling algorithm
- Session persistence (in-memory)
ML Pipeline
- Multinomial Naive Bayes (alpha=0.1)
- TF-IDF Vectorizer (5000 features, n-grams 1-2)
- Custom preprocessing (slang + stopwords)
- Dictionary-based NER (97 symptoms, 23 locations)
- Hybrid prediction (ML + keyword boosting)
Technology Stack
Machine Learning Pipeline
- Algorithm: Multinomial Naive Bayes (sklearn)
- Vectorization: TF-IDF (5000 max features, n-grams 1-2)
- Hyperparameters: Laplace smoothing alpha=0.1
- Custom NLP: Regex-based preprocessing, dictionary-based NER
Training Data Generation
- LLM API: Gemini Flash 2.0 (Google AI Studio)
- Prompt Engineering: Custom prompts for Indonesian medical terminology
- Dataset Size: 14,000 samples (1,000 per intent x 14 classes)
- Generation Cost: ~$3.50 total (FREE tier)
Training Data Generation with Gemini Flash 2.0
LLM-Powered Dataset Creation
Challenge: No existing Indonesian medical anamnesis dataset available for training.
Solution: Automated generation using Gemini Flash 2.0 API with custom prompt engineering.
Intent Definitions
14 medical conversation intent classes:
- keluhan_utama: Patient's chief complaint with symptoms
- jawab_gejala_penyerta: Additional accompanying symptoms
- jawab_durasi: Duration of symptoms (days/weeks/months)
- jawab_lokasi: Body location of complaint
- jawab_severity: Severity level (mild/moderate/severe)
- jawab_riwayat_penyakit: Previous medical history
- jawab_riwayat_obat: Current medications
- jawab_alergi: Drug/food allergies
- jawab_faktor_risiko: Lifestyle risk factors
- sapaan: Greeting from patient
- ucapan_terima_kasih: Thank you from patient
- konfirmasi: Yes/confirmation response
- penyangkalan: No/denial response
- tidak_jelas: Unclear/confused response
Prompt Engineering Strategy
Key prompt design decisions:
- Patient persona: "Puskesmas patient" for realistic Indonesian medical language
- Medical context: Reference common conditions (ISPA, Gastritis) for relevant terminology
- Natural language: Explicit instruction for colloquial Indonesian, not formal medical jargon
- Variability: Encourage sentence diversity to prevent model overfitting
- Format constraints: Prevent numbering/bullets that complicate parsing
Final Dataset Statistics
- Total samples: 14,000 (balanced)
- Intent classes: 14 (1,000 samples each)
- Average sentence length: 51.3 characters raw, 38.7 processed
- Language: Indonesian (Bahasa Indonesia)
- Domain: Medical anamnesis (Puskesmas context)
- Generation time: ~4 hours (with FREE tier rate limits)
NLP Preprocessing Pipeline (From Scratch)
Why Build From Scratch?
Challenge: Existing Indonesian NLP libraries (NLTK, spaCy) lack medical domain vocabulary and Indonesian slang normalization for patient speech.
Solution: Custom preprocessing pipeline with domain-specific dictionaries.
Text Preprocessing Pipeline
Custom preprocessing pipeline combining lowercase normalization, punctuation removal, tokenization, slang normalization (94 mappings), and stopword filtering (93 words).
Example transformation:
- Input: "Dok saya batuk gak sembuh-sembuh udah 3 hari"
- Output: "batuk tidak sembuh sudah 3 hari"
Slang Normalization
Dictionary: 94 Indonesian slang mappings
Key mappings (medical context):
- gak to tidak
- udah to sudah
- gimana to bagaimana
- kenapa to mengapa
Rationale: Indonesian patients use colloquial speech. Normalization improves TF-IDF feature consistency.
Stopword Filtering
Dictionary: 93 Indonesian stopwords
Custom medical stopwords: Removed common but uninformative words ("dok", "bu", "pak") while preserving medical terms.
Custom Named Entity Recognition (NER)
Dictionary-Based NER System
Why not spaCy/Stanza: No pre-trained Indonesian medical NER models available.
Solution: Dictionary-based keyword matching with regex patterns.
Entity Dictionaries
Symptom Dictionary (97 symptom types)
- demam: systemic (synonyms: panas, meriang, demam tinggi)
- batuk: respiratory (synonyms: batuk kering, batuk berdahak)
- nyeri: pain (synonyms: sakit, perih, nyeri hebat)
Body Location Dictionary (23 body locations)
- kepala, dada, perut, tenggorokan, etc.
Severity Keywords (3 levels)
- ringan: ringan, sedikit, agak, lumayan
- sedang: sedang, biasa, normal
- berat: berat, parah, sangat, sekali, hebat
Context-Aware Duration Extraction
Critical feature: Context-aware extraction using keywords like 'sudah', 'sejak', 'selama', 'sekitar' to prevent false positives.
Examples:
- "Batuk sudah 3 hari" -> Extracts 'sudah 3 hari' (correct)
- "Umur saya 28 tahun" -> Correctly rejects (not duration)
NER Performance
Validation (500-sample manual test):
| Entity Type | Precision | Recall | F1-Score |
|---|---|---|---|
| Symptoms | 94.2% | 87.3% | 90.6% |
| Body Locations | 96.1% | 89.7% | 92.8% |
| Duration | 92.5% | 88.1% | 90.2% |
| Severity | 89.3% | 85.6% | 87.4% |
Intent Classification Model
Multinomial Naive Bayes Architecture
Why Naive Bayes?
- Fast inference: Less than 10ms prediction latency (critical for real-time chat)
- Low training time: ~2 seconds for 14,000 samples
- Handles sparse TF-IDF: Designed for high-dimensional text features
- Probabilistic outputs: Enables confidence thresholding
- Interpretable: Can inspect feature importance
TF-IDF Vectorization
Configuration: 5000 max features, unigrams + bigrams (1-2), min document frequency = 2, sublinear term frequency scaling.
Feature examples:
- Unigrams: batuk, demam, pusing, sakit, sudah, hari
- Bigrams: batuk kering, demam tinggi, sakit kepala, sudah hari
Model Evaluation
Test set performance (2,800 samples):
Overall Metrics
- Accuracy: 92.61%
- Precision (macro-avg): 93%
- Recall (macro-avg): 93%
- F1-Score (macro-avg): 93%
Top-performing intents (F1-score):
- sapaan: 98%
- jawab_riwayat_obat: 98%
- tidak_jelas: 97%
- penyangkalan: 96%
- ucapan_terima_kasih: 96%
Challenging intents:
- keluhan_utama: 84% (often confused with jawab_gejala_penyerta)
- jawab_riwayat_penyakit: 88% (overlaps with medical history intents)
- jawab_severity: 89% (severity keywords ambiguous)
Hybrid Prediction System
Challenge: ML model alone sometimes overconfident on ambiguous inputs.
Solution: Combine ML predictions with keyword-based confidence boosting.
Confidence thresholds:
- 0.95: Strong keyword evidence (duration with context)
- 0.90: Moderate keyword evidence (location, severity, allergy)
- Less than 0.70: ML uncertain, prefer keyword-based prediction
Result: Improved accuracy from 92.61% to ~94% on validation set.
Dialog State Management
14-Stage Conversation Flow
Stateful anamnesis interview:
- Greeting, 2. Nama, 3. Nama Panggilan, 4. Umur, 5. Jenis Kelamin, 6. Keluhan Utama, 7. Gejala, 8. Durasi, 9. Lokasi, 10. Severity, 11. Riwayat Penyakit, 12. Riwayat Obat, 13. Alergi, 14. Faktor Risiko, 15. Summary
Smart Prefilling Algorithm
Purpose: Auto-fill future stages if user provides information early.
Example: User says "Saya sakit kepala parah sudah 3 hari di bagian kanan" at initial stage. System extracts symptom (sakit kepala), severity (parah), duration (3 hari), and location (kepala), then auto-fills stages 8-10 and skips those questions.
Impact: Reduces average anamnesis time by ~30% (tested with healthcare workers).
Full-Stack Implementation
Backend API (Flask)
Core Endpoints:
POST /chat - Main conversation endpoint
- Intent classification
- Entity extraction
- Dialog state update
- Smart prefilling
- Response generation
POST /reset - Reset conversation GET /health - Health check
Session Management: In-memory storage with UUID session IDs
Frontend (Next.js)
Tech stack:
- Next.js 16.0.6 with App Router
- TypeScript 5.x
- Tailwind CSS v4
- Axios for HTTP requests
- jsPDF for PDF export
Key features:
- Message history display
- User input field
- Loading indicator
- PDF export button
- Dark/Light mode toggle
Cloud Deployment
Backend (Railway)
- Python 3.10+ environment
- Gunicorn WSGI server
- Automatic HTTPS
- Models loaded at startup (in-memory)
Frontend (Vercel)
- Global CDN distribution
- Automatic HTTPS
- Next.js 16 (React 19)
- Static + SSR pages
Uptime: 24/7 availability with automatic health checks.
Key Insights and Lessons Learned
NLP for Low-Resource Languages
-
LLM-Powered Data Generation Works: Gemini Flash 2.0 produces high-quality Indonesian medical data with proper prompt engineering. Cost-effective ($3.50 vs ~$1,000+ manual annotation).
-
Domain-Specific Dictionaries Critical: Custom dictionaries outperform generic Indonesian NLP libraries for medical domain. Dictionary-based NER achieves 90%+ F1-score without expensive training.
-
Slang Normalization Essential: Real patient speech uses colloquial Indonesian ("gak", "udah"). Custom normalization significantly improves feature consistency.
-
Context-Aware Extraction Needed: Naive regex patterns fail. Context keywords ("sudah", "sejak") prevent false positives.
-
Classical ML Still Effective: Naive Bayes achieves 92.61% accuracy with 100x faster inference than BERT and zero GPU cost.
Production ML System Design
-
Hybrid Prediction Improves Reliability: Combining ML with keyword-based rules reduces error rate from ~7% to ~4%.
-
Confidence Thresholding Crucial: Setting thresholds enables "ask clarification" fallback instead of wrong predictions.
-
Smart Prefilling Significantly Improves UX: Auto-filling future stages speeds workflow by ~30%.
-
Session Persistence Matters: In-memory acceptable for demo, but production requires Redis/database.
Conclusion
Chatbot PUSTU demonstrates that production-grade healthcare NLP systems can be built for low-resource languages (Indonesian medical terminology) using:
- LLM-powered data generation (Gemini Flash 2.0)
- Custom NLP preprocessing built from scratch
- Dictionary-based NER without external libraries
- Classical ML techniques (Naive Bayes + TF-IDF)
- Stateful dialog management with smart prefilling
- Full-stack deployment on free-tier cloud platforms
Achieved metrics:
- 92.61% intent classification accuracy
- 14,000 balanced training samples (automated via LLM)
- 90%+ NER F1-scores
- 30% workflow time reduction
- 24/7 cloud deployment (zero cost)
Impact: Streamlined patient anamnesis workflow for Indonesian Puskesmas healthcare workers, demonstrating feasibility of domain-specific NLP for low-resource languages at minimal cost.
Live Demo: https://pustu-anamnesis-chatbot.vercel.app/
Source Code: GitHub Repository
Course: Natural Language Processing, Hasanuddin University, 2025
Project Metrics
92.61% intent classification accuracy on 2,800 test samples
14,000 balanced training samples generated via Gemini Flash 2.0 API
97 symptom types + 23 body locations in custom NER system
14-stage stateful dialog management with smart prefilling
94 slang mappings + 93 stopwords in custom preprocessing pipeline
24/7 cloud deployment (Railway + Vercel)
Credits & Acknowledgments
Gemini Flash 2.0 API by Google for training data generation
Scikit-learn library for Multinomial Naive Bayes classifier
Flask web framework for REST API backend
Next.js 16 with TypeScript for modern frontend
Project Tags
Related Projects
View all projects →
Aspect-Based Sentiment Analysis on Financial News
Fine-tuned RoBERTa-base model for aspect-based sentiment analysis on 10,686 financial news headlines achieving 86.67% accuracy on entity-level sentiment classification with comprehensive handling of severe class imbalance through weighted loss and regularization techniques.

MyFriends - Emergency SOS & Contact Management App
Production-ready emergency SOS app with multi-layered persistent notification system (foreground + background + 60 scheduled alarms), real-time location sharing, and comprehensive contact management using Flutter and Firebase.

Restaurant Management System - Backend API
Production-ready RESTful API for restaurant management with comprehensive RBAC, JWT authentication (access + refresh tokens), advanced filtering & pagination, order workflow state machine, and real-time table status management. Deployed on AWS EC2 with Elastic IP for stable endpoint.
