Lompat ke konten utama
Back to Projects
Healthcare NLP ApplicationProject20255 month

Medical Anamnesis Chatbot with NLP (Chatbot PUSTU)

Production-ready medical chatbot achieving 92.61% intent classification accuracy using Multinomial Naive Bayes for Indonesian Puskesmas healthcare anamnesis workflow. Automated training data generation via Gemini Flash 2.0 API with custom NLP preprocessing pipeline built from scratch.

Role

Full Stack Developer & NLP Engineer

Client

Academic Project - Natural Language Processing Course

Team

2-person Team

Timeline

5 month • 2025

Medical Anamnesis Chatbot with NLP (Chatbot PUSTU) — project cover

Skills & Tools

Skills Applied

Natural Language ProcessingMachine LearningFull Stack DevelopmentDialog State ManagementCloud Deployment

Tools & Software

PythonScikit-learnFlaskNext.jsTypeScriptGemini Flash 2.0 APITF-IDFGitRailwayVercel

Challenges

Building production NLP system for Indonesian medical terminology without external Indonesian NLP libraries. Generating high-quality balanced training data across 14 intent classes with medical domain vocabulary. Implementing stateful 14-stage dialog management with context-aware entity extraction and smart prefilling algorithm.

Solutions

Automated training data generation using Gemini Flash 2.0 with custom prompt engineering for Indonesian medical context (14,000 samples). Built NLP preprocessing pipeline from scratch with custom slang normalization (94 mappings) and stopword filtering (93 words). Implemented dictionary-based NER system using regex patterns and context-aware extraction without external NLP libraries. Hybrid prediction combining ML model outputs with keyword boosting (0.90-0.95 confidence thresholds).

Impact

Successfully streamlined patient anamnesis workflow for Indonesian Puskesmas healthcare workers. Achieved production-grade 92.61% accuracy on medical intent classification. 24/7 cloud deployment enables immediate adoption without infrastructure requirements. Demonstrates feasibility of building domain-specific NLP systems for low-resource languages using LLM-powered data generation and classical ML techniques.

Project Overview

Chatbot PUSTU is a full-stack medical anamnesis chatbot designed for Indonesian Puskesmas (community health centers) healthcare workers. Built as a Natural Language Processing course project, this production-ready system uses Multinomial Naive Bayes with TF-IDF vectorization to achieve 92.61% intent classification accuracy across 14 medical intent classes for Indonesian language patient interviews.

The system demonstrates end-to-end NLP engineering: from LLM-powered training data generation (Gemini Flash 2.0 API) to custom preprocessing pipeline built from scratch, dictionary-based NER system, stateful dialog management, and full-stack deployment on Railway (Flask backend) and Vercel (Next.js frontend) with 24/7 availability.

Healthcare Problem Statement

Indonesian Puskesmas face significant workflow inefficiencies in patient anamnesis:

  • Manual Data Collection: Time-consuming patient history documentation (15+ minutes per patient)
  • Inconsistent Forms: Non-standardized anamnesis records across healthcare workers
  • Language Barriers: Indonesian medical terminology lacks robust NLP tools and libraries
  • Limited Resources: Small health centers cannot afford expensive EMR systems
  • Workflow Bottleneck: Doctor time consumed by administrative data entry instead of diagnosis

Technical Architecture

System Overview

Three-tier architecture with ML-powered backend:

Frontend (Vercel)

  • Next.js 16 + TypeScript
  • Chat UI with conversation history
  • Dark/Light mode toggle
  • PDF export (client-side jsPDF)
  • Session state management

Backend (Railway)

  • Flask + Gunicorn
  • Intent classification endpoint
  • Entity extraction (NER)
  • Dialog state management
  • Smart prefilling algorithm
  • Session persistence (in-memory)

ML Pipeline

  • Multinomial Naive Bayes (alpha=0.1)
  • TF-IDF Vectorizer (5000 features, n-grams 1-2)
  • Custom preprocessing (slang + stopwords)
  • Dictionary-based NER (97 symptoms, 23 locations)
  • Hybrid prediction (ML + keyword boosting)

Technology Stack

Machine Learning Pipeline

  • Algorithm: Multinomial Naive Bayes (sklearn)
  • Vectorization: TF-IDF (5000 max features, n-grams 1-2)
  • Hyperparameters: Laplace smoothing alpha=0.1
  • Custom NLP: Regex-based preprocessing, dictionary-based NER

Training Data Generation

  • LLM API: Gemini Flash 2.0 (Google AI Studio)
  • Prompt Engineering: Custom prompts for Indonesian medical terminology
  • Dataset Size: 14,000 samples (1,000 per intent × 14 classes)
  • Generation Cost: ~$3.50 total (FREE tier)

Training Data Generation with Gemini Flash 2.0

LLM-Powered Dataset Creation

Challenge: No existing Indonesian medical anamnesis dataset available for training.

Solution: Automated generation using Gemini Flash 2.0 API with custom prompt engineering.

Intent Definitions

14 medical conversation intent classes:

  • keluhan_utama: Patient's chief complaint with symptoms
  • jawab_gejala_penyerta: Additional accompanying symptoms
  • jawab_durasi: Duration of symptoms (days/weeks/months)
  • jawab_lokasi: Body location of complaint
  • jawab_severity: Severity level (mild/moderate/severe)
  • jawab_riwayat_penyakit: Previous medical history
  • jawab_riwayat_obat: Current medications
  • jawab_alergi: Drug/food allergies
  • jawab_faktor_risiko: Lifestyle risk factors
  • sapaan: Greeting from patient
  • ucapan_terima_kasih: Thank you from patient
  • konfirmasi: Yes/confirmation response
  • penyangkalan: No/denial response
  • tidak_jelas: Unclear/confused response

Prompt Engineering Strategy

Key prompt design decisions:

  • Patient persona: "Puskesmas patient" for realistic Indonesian medical language
  • Medical context: Reference common conditions (ISPA, Gastritis) for relevant terminology
  • Natural language: Explicit instruction for colloquial Indonesian, not formal medical jargon
  • Variability: Encourage sentence diversity to prevent model overfitting
  • Format constraints: Prevent numbering/bullets that complicate parsing

Final Dataset Statistics

  • Total samples: 14,000 (balanced)
  • Intent classes: 14 (1,000 samples each)
  • Average sentence length: 51.3 characters raw, 38.7 processed
  • Language: Indonesian (Bahasa Indonesia)
  • Domain: Medical anamnesis (Puskesmas context)
  • Generation time: ~4 hours (with FREE tier rate limits)

NLP Preprocessing Pipeline (From Scratch)

Why Build From Scratch?

Challenge: Existing Indonesian NLP libraries (NLTK, spaCy) lack medical domain vocabulary and Indonesian slang normalization for patient speech.

Solution: Custom preprocessing pipeline with domain-specific dictionaries.

Text Preprocessing

def preprocess(text, slang_dict, stopwords):
    text = text.lower()                              # Lowercase normalization
    text = re.sub(r'[^\w\s]', ' ', text)            # Remove punctuation
    words = text.split()                             # Tokenization
    words = [slang_dict.get(w, w) for w in words]   # Slang normalization
    words = [w for w in words if w not in stopwords] # Stopword removal
    return ' '.join(words)

Example transformation:

Input:  "Dok saya batuk gak sembuh-sembuh udah 3 hari"
Output: "batuk tidak sembuh sudah 3 hari"

Slang Normalization

Dictionary: 94 Indonesian slang mappings

Key mappings (medical context):

  • gak to tidak
  • udah to sudah
  • gimana to bagaimana
  • kenapa to mengapa

Rationale: Indonesian patients use colloquial speech. Normalization improves TF-IDF feature consistency.

Stopword Filtering

Dictionary: 93 Indonesian stopwords

Custom medical stopwords: Removed common but uninformative words ("dok", "bu", "pak") while preserving medical terms.

Custom Named Entity Recognition (NER)

Dictionary-Based NER System

Why not spaCy/Stanza: No pre-trained Indonesian medical NER models available.

Solution: Dictionary-based keyword matching with regex patterns.

Entity Dictionaries

Symptom Dictionary (97 symptom types)

  • demam: systemic (synonyms: panas, meriang, demam tinggi)
  • batuk: respiratory (synonyms: batuk kering, batuk berdahak)
  • nyeri: pain (synonyms: sakit, perih, nyeri hebat)

Body Location Dictionary (23 body locations)

  • kepala, dada, perut, tenggorokan, etc.

Severity Keywords (3 levels)

  • ringan: ringan, sedikit, agak, lumayan
  • sedang: sedang, biasa, normal
  • berat: berat, parah, sangat, sekali, hebat

Context-Aware Duration Extraction

Critical feature: Context-aware extraction prevents false positives (e.g., "28 tahun" for age should NOT be extracted as duration).

# Pattern 1: Duration with context keywords
duration_contexts = ['sudah', 'sejak', 'selama', 'sekitar']
pattern = rf'{context}\s+(\d+)\s*(?:hari|minggu|bulan|tahun)'

# Example:
Input: "Batuk sudah 3 hari" outputs 'sudah 3 hari'
Input: "Umur saya 28 tahun" outputs None (correctly rejects)

NER Performance

Validation (500-sample manual test):

Entity TypePrecisionRecallF1-Score
Symptoms94.2%87.3%90.6%
Body Locations96.1%89.7%92.8%
Duration92.5%88.1%90.2%
Severity89.3%85.6%87.4%

Intent Classification Model

Multinomial Naive Bayes Architecture

Why Naive Bayes?

  1. Fast inference: Less than 10ms prediction latency (critical for real-time chat)
  2. Low training time: ~2 seconds for 14,000 samples
  3. Handles sparse TF-IDF: Designed for high-dimensional text features
  4. Probabilistic outputs: Enables confidence thresholding
  5. Interpretable: Can inspect feature importance

TF-IDF Configuration

vectorizer = TfidfVectorizer(
    max_features=5000,        # Top 5000 most informative terms
    ngram_range=(1, 2),       # Unigrams + bigrams for context
    min_df=2,                 # Ignore very rare terms
    sublinear_tf=True         # Log-scale term frequency
)

Feature examples:

  • Unigrams: batuk, demam, pusing, sakit, sudah, hari
  • Bigrams: batuk kering, demam tinggi, sakit kepala, sudah hari

Model Evaluation

Test set performance (2,800 samples):

Overall Metrics

  • Accuracy: 92.61%
  • Precision (macro-avg): 93%
  • Recall (macro-avg): 93%
  • F1-Score (macro-avg): 93%

Top-performing intents (F1-score):

  • sapaan: 98%
  • jawab_riwayat_obat: 98%
  • tidak_jelas: 97%
  • penyangkalan: 96%
  • ucapan_terima_kasih: 96%

Challenging intents:

  • keluhan_utama: 84% (often confused with jawab_gejala_penyerta)
  • jawab_riwayat_penyakit: 88% (overlaps with medical history intents)
  • jawab_severity: 89% (severity keywords ambiguous)

Hybrid Prediction System

Challenge: ML model alone sometimes overconfident on ambiguous inputs.

Solution: Combine ML predictions with keyword-based confidence boosting.

Confidence thresholds:

  • 0.95: Strong keyword evidence (duration with context)
  • 0.90: Moderate keyword evidence (location, severity, allergy)
  • Less than 0.70: ML uncertain, prefer keyword-based prediction

Result: Improved accuracy from 92.61% to ~94% on validation set.

Dialog State Management

14-Stage Conversation Flow

Stateful anamnesis interview:

  1. Greeting, 2. Nama, 3. Nama Panggilan, 4. Umur, 5. Jenis Kelamin, 6. Keluhan Utama, 7. Gejala, 8. Durasi, 9. Lokasi, 10. Severity, 11. Riwayat Penyakit, 12. Riwayat Obat, 13. Alergi, 14. Faktor Risiko, 15. Summary

Smart Prefilling Algorithm

Purpose: Auto-fill future stages if user provides information early.

Example scenario:

User (at keluhan_utama stage): "Saya sakit kepala parah sudah 3 hari di bagian kanan"

Extracted entities:
- Symptom: "sakit kepala"
- Severity: "parah" (berat)
- Duration: "sudah 3 hari"
- Location: "kepala"

Prefilling action:
- Auto-fill durasi stage (stage 8)
- Auto-fill lokasi stage (stage 9)
- Auto-fill severity stage (stage 10)
- Skip these stages in conversation

Impact: Reduces average anamnesis time by ~30% (tested with healthcare workers).

Full-Stack Implementation

Backend API (Flask)

Core Endpoints:

POST /chat - Main conversation endpoint

  • Intent classification
  • Entity extraction
  • Dialog state update
  • Smart prefilling
  • Response generation

POST /reset - Reset conversation GET /health - Health check

Session Management: In-memory storage with UUID session IDs

Frontend (Next.js)

Tech stack:

  • Next.js 16.0.6 with App Router
  • TypeScript 5.x
  • Tailwind CSS v4
  • Axios for HTTP requests
  • jsPDF for PDF export

Key features:

  • Message history display
  • User input field
  • Loading indicator
  • PDF export button
  • Dark/Light mode toggle

Cloud Deployment

Backend (Railway)

  • Python 3.10+ environment
  • Gunicorn WSGI server
  • Automatic HTTPS
  • Models loaded at startup (in-memory)

Frontend (Vercel)

  • Global CDN distribution
  • Automatic HTTPS
  • Next.js 16 (React 19)
  • Static + SSR pages

Uptime: 24/7 availability with automatic health checks.

Results and Evaluation

User Testing with Healthcare Workers

Participants: 5 Puskesmas healthcare workers (3 nurses, 2 medical assistants)

Workflow Efficiency:

MetricManual (Paper)ChatbotImprovement
Avg anamnesis time12-15 min8-9 min~30% faster
Data completeness78% fields filled95% fields filled+17%
Errors (missing/wrong)12%4%-67% errors

User Satisfaction Ratings (1-5 scale):

  • Ease of use: 4.6/5.0
  • Accuracy: 4.4/5.0
  • Speed: 4.8/5.0
  • Completeness: 4.7/5.0
  • Would recommend: 5/5 (all users)

Qualitative feedback:

  • "Smart prefilling saves a lot of time - no need to repeat questions"
  • "Indonesian language support is crucial, patients don't speak formal medical terms"
  • "PDF export makes doctor handoff seamless"

System Performance Metrics

Production deployment (1 week monitoring):

MetricValue
API latency (p50)120ms
API latency (p95)280ms
API latency (p99)450ms
Frontend page load1.2s (Vercel CDN)
Backend uptime99.8%
Session duration (avg)8.5 minutes
Messages per session (avg)18 messages

Cost analysis:

  • Gemini API (data generation): $3.50 one-time
  • Railway (backend): FREE tier
  • Vercel (frontend): FREE tier
  • Total monthly cost: $0

Key Insights and Lessons Learned

NLP for Low-Resource Languages

  1. LLM-Powered Data Generation Works: Gemini Flash 2.0 produces high-quality Indonesian medical data with proper prompt engineering. Cost-effective ($3.50 vs ~$1,000+ manual annotation).

  2. Domain-Specific Dictionaries Critical: Custom dictionaries outperform generic Indonesian NLP libraries for medical domain. Dictionary-based NER achieves 90%+ F1-score without expensive training.

  3. Slang Normalization Essential: Real patient speech uses colloquial Indonesian ("gak", "udah"). Custom normalization significantly improves feature consistency.

  4. Context-Aware Extraction Needed: Naive regex patterns fail. Context keywords ("sudah", "sejak") prevent false positives.

  5. Classical ML Still Effective: Naive Bayes achieves 92.61% accuracy with 100x faster inference than BERT and zero GPU cost.

Production ML System Design

  1. Hybrid Prediction Improves Reliability: Combining ML with keyword-based rules reduces error rate from ~7% to ~4%.

  2. Confidence Thresholding Crucial: Setting thresholds enables "ask clarification" fallback instead of wrong predictions.

  3. Smart Prefilling Significantly Improves UX: Auto-filling future stages speeds workflow by ~30%.

  4. Session Persistence Matters: In-memory acceptable for demo, but production requires Redis/database.

Future Enhancements

Short-term (3-6 months)

  1. Fine-tune Indonesian BERT: Use IndoBERT pre-trained model. Expected accuracy: 95-97%.

  2. Sequence Labeling for NER: Replace dictionary-based NER with BiLSTM-CRF for token-level entity extraction.

  3. Multi-Intent Classification: Handle complex messages with multiple intents.

Long-term (6-12 months)

  1. Voice Input: Integrate Web Speech API for hands-free anamnesis.

  2. EMR System Integration: Export to hospital systems (HL7 FHIR standard).

  3. Symptom-Disease Inference: Suggest probable diagnoses based on symptom patterns.

  4. Publish Dataset: Release 14K-sample dataset to research community.

Conclusion

Chatbot PUSTU demonstrates that production-grade healthcare NLP systems can be built for low-resource languages (Indonesian medical terminology) using:

  • LLM-powered data generation (Gemini Flash 2.0)
  • Custom NLP preprocessing built from scratch
  • Dictionary-based NER without external libraries
  • Classical ML techniques (Naive Bayes + TF-IDF)
  • Stateful dialog management with smart prefilling
  • Full-stack deployment on free-tier cloud platforms

Achieved metrics:

  • 92.61% intent classification accuracy
  • 14,000 balanced training samples (automated via LLM)
  • 90%+ NER F1-scores
  • 30% workflow time reduction
  • 24/7 cloud deployment (zero cost)

Impact: Streamlined patient anamnesis workflow for Indonesian Puskesmas healthcare workers, demonstrating feasibility of domain-specific NLP for low-resource languages at minimal cost.


Live Demo: https://pustu-anamnesis-chatbot.vercel.app/

Source Code: GitHub Repository

Project Metrics

92.61% intent classification accuracy on 2,800 test samples

14,000 balanced training samples generated via Gemini Flash 2.0 API

97 symptom types + 23 body locations in custom NER system

14-stage stateful dialog management with smart prefilling

94 slang mappings + 93 stopwords in custom preprocessing pipeline

24/7 cloud deployment (Railway + Vercel)

Credits & Acknowledgments

Gemini Flash 2.0 API by Google for training data generation

Scikit-learn library for Multinomial Naive Bayes classifier

Flask web framework for REST API backend

Next.js 16 with TypeScript for modern frontend

Project Tags

Related Projects

View all projects →