Llama.cpp

Project Overview An academic project built with a team of 4 students to deploy private, offline artificial intelligence. The objective was to design a system that automates downloading and running large language models locally with hardware acceleration. Key Implementation Details Containerized Orchestration: Developed an integrated Docker Compose environment that automates installing and running Llama.cpp libraries. Hardware Pass-through: Configured NVIDIA Container Toolkit integration to leverage CUDA on host GPUs, accelerating inference speeds. Quantization Pipeline: Automated model quantization scripts to output GGUF formatted models, reducing local RAM overhead while maintaining response quality. 🔗 GitHub Repository