Joshua | Reliability Engineer & GPU Infrastructure Expert

Joshua

Systems Stability Engineer

Vintage Chip Collector

Hardware DIYer

Retro Gaming Hobbyist

Experience & Education

Speciality

Failure Prediction Innovator upholding "stability before scale." Created GPU cluster health metrics now adopted industry-wide.

Experience

1.Reliability Engineer, NVIDIA DGX Systems (4 years)
2.Lead architect for national supercomputing center disaster recovery
3.Core developer of WhaleFlux Self-Healing System

Education

1.MS High-Performance Computing, MIT
2.BSc Electrical Engineering, UC Berkeley

Posts

GPU Management: Slashing Costs in Gemini Fine-Tuning

GPU Management: Slashing Costs in Gemini Fine-Tuning

Joshua 7 月 17, 2025

Mastering PEFT Fine-Tuning: How PEFT & WhaleFlux Slash LLM Tuning Costs & Boost Performance

Mastering PEFT Fine-Tuning: How PEFT & WhaleFlux Slash LLM Tuning Costs & Boost Performance

Joshua 7 月 17, 2025

Cluster Model: Integrating Computational Management and Data Clustering

Cluster Model: Integrating Computational Management and Data Clustering

Joshua 7 月 17, 2025

AI Inference: From Training to Practical Use

AI Inference: From Training to Practical Use

Joshua 7 月 15, 2025

Optimize Your End-to-End ML Workflow: From Experimentation to Deployment

Optimize Your End-to-End ML Workflow: From Experimentation to Deployment

Joshua 7 月 14, 2025

Quantization in Machine Learning：Shrink ML Models, Cut Costs, Boost Speed

Quantization in Machine Learning：Shrink ML Models, Cut Costs, Boost Speed

Joshua 7 月 14, 2025

Fine-Tuning LLMs Without Supercomputers: How GPU Optimization Unlocks Cost-Effective Customization

Fine-Tuning LLMs Without Supercomputers: How GPU Optimization Unlocks Cost-Effective Customization

Joshua 7 月 10, 2025

Real-Time Alerts for GPU Clusters: Stop Costly AI Downtime Before It Starts

Real-Time Alerts for GPU Clusters: Stop Costly AI Downtime Before It Starts

Joshua 7 月 10, 2025

Full-Stack Observability: The Secret Weapon for Efficient AI/GPU Operations

Full-Stack Observability: The Secret Weapon for Efficient AI/GPU Operations

Joshua 7 月 10, 2025

« Previous
1
…
7
8
9
10
11
Next »