CUDA Fundamentals: Tiled Matrix Multiply & Bitonic Sort
Writing real GPU kernels, exploring shared memory tiling, parallel sorting algorithms, and performance optimization on an NVIDIA H100.
The notes, thoughts, and projects I've been working on.
What I Learned in GPU Hardware and Software (CS 8803) - A Retrospective
Writing real GPU kernels, exploring shared memory tiling, parallel sorting algorithms, and performance optimization on an NVIDIA H100.
Building a cycle-level GPU simulator from the inside, implementing GTO and CCWS warp schedulers, modeling compute instruction latencies, and extending to tensor core simulation.
A simple if statement can halve GPU throughput, exploring how static dataflow analysis detects branch divergence by building a def-use chain over a SASS control flow graph.
Writing a FlashAttention CUDA kernel from scratch, tiling the attention matrix to avoid materializing N×N memory, building a KV cache for token generation, and running GPT-2 with custom kernels end-to-end.
What I Learned in Graduate Introduction to Operating Systems (GIOS) - A Retrospective
A project that explores multithreading, sockets, and the basics of operating systems
Exploring Inter-Process Communication and synchronization in C by building a high-performance proxy and cache.
An exploration of remote communication, data consistency, and synchronization in a C++ distributed file system project.