Project Overview

An academic project built with a team of 4 students to deploy private, offline artificial intelligence. The objective was to design a system that automates downloading and running large language models locally with hardware acceleration.

Key Implementation Details

  • Containerized Orchestration: Developed an integrated Docker Compose environment that automates installing and running Llama.cpp libraries.
  • Hardware Pass-through: Configured NVIDIA Container Toolkit integration to leverage CUDA on host GPUs, accelerating inference speeds.
  • Quantization Pipeline: Automated model quantization scripts to output GGUF formatted models, reducing local RAM overhead while maintaining response quality.

🔗 GitHub Repository