Healthcare NLP ApplicationProject20255 month

Medical Anamnesis Chatbot with NLP (Chatbot PUSTU)

Production-ready medical chatbot achieving 92.61% intent classification accuracy using Multinomial Naive Bayes for Indonesian Puskesmas healthcare anamnesis workflow. Automated training data generation via Gemini Flash 2.0 API with custom NLP preprocessing pipeline built from scratch.

View Live GitHub

Role

Full Stack Developer & NLP Engineer

Client

Academic Project - Natural Language Processing Course

Team

2-person Team

Timeline

5 month • 2025

Skills & Tools

Skills Applied

Natural Language ProcessingMachine LearningFull Stack DevelopmentDialog State ManagementCloud Deployment

Tools & Software

PythonScikit-learnFlaskNext.jsTypeScriptGemini Flash 2.0 APITF-IDFGitRailwayVercel

Challenges

Building production NLP system for Indonesian medical terminology without external Indonesian NLP libraries. Generating high-quality balanced training data across 14 intent classes with medical domain vocabulary. Implementing stateful 14-stage dialog management with context-aware entity extraction and smart prefilling algorithm.

Solutions

Automated training data generation using Gemini Flash 2.0 with custom prompt engineering for Indonesian medical context (14,000 samples). Built NLP preprocessing pipeline from scratch with custom slang normalization (94 mappings) and stopword filtering (93 words). Implemented dictionary-based NER system using regex patterns and context-aware extraction without external NLP libraries. Hybrid prediction combining ML model outputs with keyword boosting (0.90-0.95 confidence thresholds).

Impact

Successfully streamlined patient anamnesis workflow for Indonesian Puskesmas healthcare workers. Achieved production-grade 92.61% accuracy on medical intent classification. 24/7 cloud deployment enables immediate adoption without infrastructure requirements. Demonstrates feasibility of building domain-specific NLP systems for low-resource languages using LLM-powered data generation and classical ML techniques.

Project Overview

Chatbot PUSTU is a full-stack medical anamnesis chatbot designed for Indonesian Puskesmas (community health centers) healthcare workers. Built as a Natural Language Processing course project, this production-ready system uses Multinomial Naive Bayes with TF-IDF vectorization to achieve 92.61% intent classification accuracy across 14 medical intent classes for Indonesian language patient interviews.

The system demonstrates end-to-end NLP engineering: from LLM-powered training data generation (Gemini Flash 2.0 API) to custom preprocessing pipeline built from scratch, dictionary-based NER system, stateful dialog management, and full-stack deployment on Railway (Flask backend) and Vercel (Next.js frontend) with 24/7 availability.

Healthcare Problem Statement

Indonesian Puskesmas face significant workflow inefficiencies in patient anamnesis:

Manual Data Collection: Time-consuming patient history documentation (15+ minutes per patient)
Inconsistent Forms: Non-standardized anamnesis records across healthcare workers
Language Barriers: Indonesian medical terminology lacks robust NLP tools and libraries
Limited Resources: Small health centers cannot afford expensive EMR systems
Workflow Bottleneck: Doctor time consumed by administrative data entry instead of diagnosis

Technical Architecture

System Overview

Three-tier architecture with ML-powered backend:

Frontend (Vercel)

Next.js 16 + TypeScript
Chat UI with conversation history
Dark/Light mode toggle
PDF export (client-side jsPDF)
Session state management

Backend (Railway)

Flask + Gunicorn
Intent classification endpoint
Entity extraction (NER)
Dialog state management
Smart prefilling algorithm
Session persistence (in-memory)

ML Pipeline

Multinomial Naive Bayes (alpha=0.1)
TF-IDF Vectorizer (5000 features, n-grams 1-2)
Custom preprocessing (slang + stopwords)
Dictionary-based NER (97 symptoms, 23 locations)
Hybrid prediction (ML + keyword boosting)

Technology Stack

Machine Learning Pipeline

Algorithm: Multinomial Naive Bayes (sklearn)
Vectorization: TF-IDF (5000 max features, n-grams 1-2)
Hyperparameters: Laplace smoothing alpha=0.1
Custom NLP: Regex-based preprocessing, dictionary-based NER

Training Data Generation

LLM API: Gemini Flash 2.0 (Google AI Studio)
Prompt Engineering: Custom prompts for Indonesian medical terminology
Dataset Size: 14,000 samples (1,000 per intent × 14 classes)
Generation Cost: ~$3.50 total (FREE tier)

Training Data Generation with Gemini Flash 2.0

LLM-Powered Dataset Creation

Challenge: No existing Indonesian medical anamnesis dataset available for training.

Solution: Automated generation using Gemini Flash 2.0 API with custom prompt engineering.

Intent Definitions

14 medical conversation intent classes:

keluhan_utama: Patient's chief complaint with symptoms
jawab_gejala_penyerta: Additional accompanying symptoms
jawab_durasi: Duration of symptoms (days/weeks/months)
jawab_lokasi: Body location of complaint
jawab_severity: Severity level (mild/moderate/severe)
jawab_riwayat_penyakit: Previous medical history
jawab_riwayat_obat: Current medications
jawab_alergi: Drug/food allergies
jawab_faktor_risiko: Lifestyle risk factors
sapaan: Greeting from patient
ucapan_terima_kasih: Thank you from patient
konfirmasi: Yes/confirmation response
penyangkalan: No/denial response
tidak_jelas: Unclear/confused response

Prompt Engineering Strategy

Key prompt design decisions:

Patient persona: "Puskesmas patient" for realistic Indonesian medical language
Medical context: Reference common conditions (ISPA, Gastritis) for relevant terminology
Natural language: Explicit instruction for colloquial Indonesian, not formal medical jargon
Variability: Encourage sentence diversity to prevent model overfitting
Format constraints: Prevent numbering/bullets that complicate parsing

Final Dataset Statistics

Total samples: 14,000 (balanced)
Intent classes: 14 (1,000 samples each)
Average sentence length: 51.3 characters raw, 38.7 processed
Language: Indonesian (Bahasa Indonesia)
Domain: Medical anamnesis (Puskesmas context)
Generation time: ~4 hours (with FREE tier rate limits)

NLP Preprocessing Pipeline (From Scratch)

Why Build From Scratch?

Challenge: Existing Indonesian NLP libraries (NLTK, spaCy) lack medical domain vocabulary and Indonesian slang normalization for patient speech.

Solution: Custom preprocessing pipeline with domain-specific dictionaries.

Text Preprocessing

def preprocess(text, slang_dict, stopwords):
    text = text.lower()                              # Lowercase normalization
    text = re.sub(r'[^\w\s]', ' ', text)            # Remove punctuation
    words = text.split()                             # Tokenization
    words = [slang_dict.get(w, w) for w in words]   # Slang normalization
    words = [w for w in words if w not in stopwords] # Stopword removal
    return ' '.join(words)

Example transformation:

Input:  "Dok saya batuk gak sembuh-sembuh udah 3 hari"
Output: "batuk tidak sembuh sudah 3 hari"

Slang Normalization

Dictionary: 94 Indonesian slang mappings

Key mappings (medical context):

gak to tidak
udah to sudah
gimana to bagaimana
kenapa to mengapa

Rationale: Indonesian patients use colloquial speech. Normalization improves TF-IDF feature consistency.

Stopword Filtering

Dictionary: 93 Indonesian stopwords

Custom medical stopwords: Removed common but uninformative words ("dok", "bu", "pak") while preserving medical terms.

Custom Named Entity Recognition (NER)

Dictionary-Based NER System

Why not spaCy/Stanza: No pre-trained Indonesian medical NER models available.

Solution: Dictionary-based keyword matching with regex patterns.

Entity Dictionaries

Symptom Dictionary (97 symptom types)

demam: systemic (synonyms: panas, meriang, demam tinggi)
batuk: respiratory (synonyms: batuk kering, batuk berdahak)
nyeri: pain (synonyms: sakit, perih, nyeri hebat)

Body Location Dictionary (23 body locations)

kepala, dada, perut, tenggorokan, etc.

Severity Keywords (3 levels)

ringan: ringan, sedikit, agak, lumayan
sedang: sedang, biasa, normal
berat: berat, parah, sangat, sekali, hebat

Context-Aware Duration Extraction

Critical feature: Context-aware extraction prevents false positives (e.g., "28 tahun" for age should NOT be extracted as duration).

# Pattern 1: Duration with context keywords
duration_contexts = ['sudah', 'sejak', 'selama', 'sekitar']
pattern = rf'{context}\s+(\d+)\s*(?:hari|minggu|bulan|tahun)'

# Example:
Input: "Batuk sudah 3 hari" outputs 'sudah 3 hari'
Input: "Umur saya 28 tahun" outputs None (correctly rejects)

NER Performance

Validation (500-sample manual test):

Entity Type	Precision	Recall	F1-Score
Symptoms	94.2%	87.3%	90.6%
Body Locations	96.1%	89.7%	92.8%
Duration	92.5%	88.1%	90.2%
Severity	89.3%	85.6%	87.4%

Intent Classification Model

Multinomial Naive Bayes Architecture

Why Naive Bayes?

Fast inference: Less than 10ms prediction latency (critical for real-time chat)
Low training time: ~2 seconds for 14,000 samples
Handles sparse TF-IDF: Designed for high-dimensional text features
Probabilistic outputs: Enables confidence thresholding
Interpretable: Can inspect feature importance

TF-IDF Configuration

vectorizer = TfidfVectorizer(
    max_features=5000,        # Top 5000 most informative terms
    ngram_range=(1, 2),       # Unigrams + bigrams for context
    min_df=2,                 # Ignore very rare terms
    sublinear_tf=True         # Log-scale term frequency
)

Feature examples:

Unigrams: batuk, demam, pusing, sakit, sudah, hari
Bigrams: batuk kering, demam tinggi, sakit kepala, sudah hari

Model Evaluation

Test set performance (2,800 samples):

Overall Metrics

Accuracy: 92.61%
Precision (macro-avg): 93%
Recall (macro-avg): 93%
F1-Score (macro-avg): 93%

Top-performing intents (F1-score):

sapaan: 98%
jawab_riwayat_obat: 98%
tidak_jelas: 97%
penyangkalan: 96%
ucapan_terima_kasih: 96%

Challenging intents:

keluhan_utama: 84% (often confused with jawab_gejala_penyerta)
jawab_riwayat_penyakit: 88% (overlaps with medical history intents)
jawab_severity: 89% (severity keywords ambiguous)

Hybrid Prediction System

Challenge: ML model alone sometimes overconfident on ambiguous inputs.

Solution: Combine ML predictions with keyword-based confidence boosting.

Confidence thresholds:

0.95: Strong keyword evidence (duration with context)
0.90: Moderate keyword evidence (location, severity, allergy)
Less than 0.70: ML uncertain, prefer keyword-based prediction

Result: Improved accuracy from 92.61% to ~94% on validation set.

Dialog State Management

14-Stage Conversation Flow

Stateful anamnesis interview:

Greeting, 2. Nama, 3. Nama Panggilan, 4. Umur, 5. Jenis Kelamin, 6. Keluhan Utama, 7. Gejala, 8. Durasi, 9. Lokasi, 10. Severity, 11. Riwayat Penyakit, 12. Riwayat Obat, 13. Alergi, 14. Faktor Risiko, 15. Summary

Smart Prefilling Algorithm

Purpose: Auto-fill future stages if user provides information early.

Example scenario:

User (at keluhan_utama stage): "Saya sakit kepala parah sudah 3 hari di bagian kanan"

Extracted entities:
- Symptom: "sakit kepala"
- Severity: "parah" (berat)
- Duration: "sudah 3 hari"
- Location: "kepala"

Prefilling action:
- Auto-fill durasi stage (stage 8)
- Auto-fill lokasi stage (stage 9)
- Auto-fill severity stage (stage 10)
- Skip these stages in conversation

Impact: Reduces average anamnesis time by ~30% (tested with healthcare workers).

Full-Stack Implementation

Backend API (Flask)

Core Endpoints:

POST /chat - Main conversation endpoint

Intent classification
Entity extraction
Dialog state update
Smart prefilling
Response generation

POST /reset - Reset conversation GET /health - Health check

Session Management: In-memory storage with UUID session IDs

Frontend (Next.js)

Tech stack:

Next.js 16.0.6 with App Router
TypeScript 5.x
Tailwind CSS v4
Axios for HTTP requests
jsPDF for PDF export

Key features:

Message history display
User input field
Loading indicator
PDF export button
Dark/Light mode toggle

Cloud Deployment

Backend (Railway)

Python 3.10+ environment
Gunicorn WSGI server
Automatic HTTPS
Models loaded at startup (in-memory)

Frontend (Vercel)

Global CDN distribution
Automatic HTTPS
Next.js 16 (React 19)
Static + SSR pages

Uptime: 24/7 availability with automatic health checks.

Results and Evaluation

User Testing with Healthcare Workers

Participants: 5 Puskesmas healthcare workers (3 nurses, 2 medical assistants)

Workflow Efficiency:

Metric	Manual (Paper)	Chatbot	Improvement
Avg anamnesis time	12-15 min	8-9 min	~30% faster
Data completeness	78% fields filled	95% fields filled	+17%
Errors (missing/wrong)	12%	4%	-67% errors

User Satisfaction Ratings (1-5 scale):

Ease of use: 4.6/5.0
Accuracy: 4.4/5.0
Speed: 4.8/5.0
Completeness: 4.7/5.0
Would recommend: 5/5 (all users)

Qualitative feedback:

"Smart prefilling saves a lot of time - no need to repeat questions"
"Indonesian language support is crucial, patients don't speak formal medical terms"
"PDF export makes doctor handoff seamless"

System Performance Metrics

Production deployment (1 week monitoring):

Metric	Value
API latency (p50)	120ms
API latency (p95)	280ms
API latency (p99)	450ms
Frontend page load	1.2s (Vercel CDN)
Backend uptime	99.8%
Session duration (avg)	8.5 minutes
Messages per session (avg)	18 messages

Cost analysis:

Gemini API (data generation): $3.50 one-time
Railway (backend): FREE tier
Vercel (frontend): FREE tier
Total monthly cost: $0

Key Insights and Lessons Learned

NLP for Low-Resource Languages

LLM-Powered Data Generation Works: Gemini Flash 2.0 produces high-quality Indonesian medical data with proper prompt engineering. Cost-effective ($3.50 vs ~$1,000+ manual annotation).
Domain-Specific Dictionaries Critical: Custom dictionaries outperform generic Indonesian NLP libraries for medical domain. Dictionary-based NER achieves 90%+ F1-score without expensive training.
Slang Normalization Essential: Real patient speech uses colloquial Indonesian ("gak", "udah"). Custom normalization significantly improves feature consistency.
Context-Aware Extraction Needed: Naive regex patterns fail. Context keywords ("sudah", "sejak") prevent false positives.
Classical ML Still Effective: Naive Bayes achieves 92.61% accuracy with 100x faster inference than BERT and zero GPU cost.

Production ML System Design

Hybrid Prediction Improves Reliability: Combining ML with keyword-based rules reduces error rate from ~7% to ~4%.
Confidence Thresholding Crucial: Setting thresholds enables "ask clarification" fallback instead of wrong predictions.
Smart Prefilling Significantly Improves UX: Auto-filling future stages speeds workflow by ~30%.
Session Persistence Matters: In-memory acceptable for demo, but production requires Redis/database.

Future Enhancements

Short-term (3-6 months)

Fine-tune Indonesian BERT: Use IndoBERT pre-trained model. Expected accuracy: 95-97%.
Sequence Labeling for NER: Replace dictionary-based NER with BiLSTM-CRF for token-level entity extraction.
Multi-Intent Classification: Handle complex messages with multiple intents.

Long-term (6-12 months)

Voice Input: Integrate Web Speech API for hands-free anamnesis.
EMR System Integration: Export to hospital systems (HL7 FHIR standard).
Symptom-Disease Inference: Suggest probable diagnoses based on symptom patterns.
Publish Dataset: Release 14K-sample dataset to research community.

Conclusion

Chatbot PUSTU demonstrates that production-grade healthcare NLP systems can be built for low-resource languages (Indonesian medical terminology) using:

LLM-powered data generation (Gemini Flash 2.0)
Custom NLP preprocessing built from scratch
Dictionary-based NER without external libraries
Classical ML techniques (Naive Bayes + TF-IDF)
Stateful dialog management with smart prefilling
Full-stack deployment on free-tier cloud platforms

Achieved metrics:

92.61% intent classification accuracy
14,000 balanced training samples (automated via LLM)
90%+ NER F1-scores
30% workflow time reduction
24/7 cloud deployment (zero cost)

Impact: Streamlined patient anamnesis workflow for Indonesian Puskesmas healthcare workers, demonstrating feasibility of domain-specific NLP for low-resource languages at minimal cost.

Live Demo: https://pustu-anamnesis-chatbot.vercel.app/

Source Code: GitHub Repository

Project Metrics

92.61% intent classification accuracy on 2,800 test samples

14,000 balanced training samples generated via Gemini Flash 2.0 API

97 symptom types + 23 body locations in custom NER system

14-stage stateful dialog management with smart prefilling

94 slang mappings + 93 stopwords in custom preprocessing pipeline

24/7 cloud deployment (Railway + Vercel)

Credits & Acknowledgments

Gemini Flash 2.0 API by Google for training data generation

Scikit-learn library for Multinomial Naive Bayes classifier

Flask web framework for REST API backend

Next.js 16 with TypeScript for modern frontend

Project Tags

#nlp #healthcare #full-stack

Related Projects

View all projects →

Natural Language Processing

Aspect-Based Sentiment Analysis on Financial News

Fine-tuned RoBERTa-base model for aspect-based sentiment analysis on 10,686 financial news headlines achieving 86.67% accuracy on entity-level sentiment classification with comprehensive handling of severe class imbalance through weighted loss and regularization techniques.

#nlp#sentiment-analysis#deep-learning

2025View details

Simulation & Modeling

E-Commerce Trust Simulation with LLM-Powered Agents

Agent-based simulation using MESA framework with 7,580 LLM-powered autonomous agents to quantify fake review manipulation impact on e-commerce conversion rates, demonstrating +54-72pp increase in targeted low-quality products through rigorous statistical validation (Chi-Square = 121-177, p less than 0.0001).

#agent-based-modeling#llm#statistical-analysis

2025View details

Web Security Application

LockIn - Password Manager

A secure, zero-knowledge password manager web application with client-side encryption, designed to protect user privacy while providing seamless password management across any browser.

#security#password-manager#web-crypto

2025View details