Project Overview
An academic project built with a team of 4 students to deploy private, offline artificial intelligence. The objective was to design a system that automates downloading and running large language models locally with hardware acceleration.
Key Implementation Details
- Containerized Orchestration: Developed an integrated Docker Compose environment that automates installing and running Llama.cpp libraries.
- Hardware Pass-through: Configured NVIDIA Container Toolkit integration to leverage CUDA on host GPUs, accelerating inference speeds.
- Quantization Pipeline: Automated model quantization scripts to output GGUF formatted models, reducing local RAM overhead while maintaining response quality.
🔗 GitHub Repository
