3phases
4github projects
6months
Phase 1C++ systems foundationWeeks 1-6
Learn
Modern C++ (C++17/20): memory ownership, RAII, smart pointers
Resource: "A Tour of C++" by Bjarne Stroustrup. 2 chapters/week.
CMake build system: how to build, link, and structure C++ projects
You'll need this before any real project compiles correctly.
Multithreading: std::thread, mutexes, condition variables, atomics
Everything in inference is concurrent. This is non-negotiable.
Sockets and HTTP from scratch: BSD socket API in C++
Don't use a library yet. Write raw send/recv. Understand the protocol.
Project 1: HTTP server in C++
tokoro: HTTP/1.1 server in C++
Build a multi-threaded HTTP server that serves static files and handles concurrent connections. No libraries: raw POSIX sockets. Implement: TCP accept loop, HTTP parser, thread pool, keep-alive.
Why this matters: every inference server is a networked C++ process. This teaches you what's below frameworks like NestJS, and the project is immediately legible to AI infrastructure engineers.
Milestones by week 6
Server handles 1000 concurrent connections
HTTP parser handles chunked encoding
README with architecture diagram and benchmark results
Published to GitHub with CI via GitHub Actions
Phase 2AI inference internalsWeeks 7-16
Learn
How LLM inference works: tokenization, attention, KV cache, batching
Read the llama.cpp source. Understand every struct. Don't just run it.
GGUF format: how model weights are quantized and stored on disk
Open a .gguf file in a hex editor. Map the header. Then load it in code.
Profiling C++: perf, Valgrind, gprof, cache miss analysis
You can't optimize what you haven't measured. This skill separates real systems engineers.
SIMD basics: SSE2/AVX2 intrinsics for vectorized float math
Optional but powerful. llama.cpp uses SIMD heavily. Even reading it builds intuition.
Project 2: inference server in C++
vahan: LLM inference server in C++
Build an HTTP server using your tokoro base or cpp-httplib that loads a GGUF model via llama.cpp as a library, accepts prompts, streams tokens via SSE, and handles concurrent request queuing. Expose /generate and /health endpoints. Add latency metrics.
Project 3: inference benchmarking CLI
drishti: inference benchmark CLI
A CLI tool in C++ or Python that stress-tests any OpenAI-compatible inference endpoint. Measures TTFT, throughput, p50/p95/p99 latency, and concurrent load. Outputs structured JSON reports. Genuinely useful to the community.
Milestones by week 16
vahan streams Llama 3.2 3B locally over HTTP
Handles 4 concurrent requests with queuing
drishti publishes benchmark report for 3 popular inference endpoints
Blog post: "What I learned reading the llama.cpp source"
Phase 3Specialise and signalWeeks 17-26
Go deep on one track
Contribute a real PR to llama.cpp, vLLM, or whisper.cpp
Not a docs fix. A bug fix, a perf improvement, or a missing feature. One merged PR > 10 side projects.
If targeting ElevenLabs: build real-time voice pipeline
STT -> LLM -> TTS, full duplex, WebSocket, interruption handling. Hard latency budget: <800ms.
If targeting Anthropic: study PagedAttention, speculative decoding
Read the vLLM paper. Then implement a toy speculative decoder in C++.
Project 4: the flagship
shabda: real-time voice AI pipeline
End-to-end: microphone input -> Whisper STT -> local LLM -> TTS -> speaker output. WebSocket server in C++. Full duplex with interruption handling. Latency budget per stage. Deployed as a demo anyone can run. This is the project that gets you the recruiter email.
Signal the work publicly
Write 1 technical post/week on LinkedIn about what you're building
Not summaries of articles. Your own findings, failures, benchmarks. "I ran X and found Y."
Publish 2 long-form blog posts on your personal site
"How I built a streaming inference server in C++" and "Benchmarking open-source LLM inference"
All 4 projects starred, documented, and CI-passing on GitHub
Recruiters look at GitHub before they look at your resume. Make it easy for them.
Milestones by week 26
One merged PR in a major open-source inference project
shabda demo runs end-to-end in under 800ms
500+ GitHub stars across all projects combined
Resume updated: inference infra as the lead skill
First recruiter outreach from a target company