Quincy is my custom made personal AI assistant. The goal of Quincy is to build a useful AI assistant that knows everything about me, without having to worry about other people having my data. Quincy is both a voice and text assistant, to allow for whatever the most convenient mode of interaction is at a given moment. This project is also an excuse to keep up with the latest models and research in the ML field.

Philosophy Of the Code

I want to have the AI model inference separated out from the backend logic. This will allow me to deploy models on whatever device is best, and allows me to not have to deal with the mass of code that would be a monolith. It also allows for models to be called by other projects that I am working on without having to interface with the logic for the assistant. The goal of the project as well is to be adaptable. The ML field is advancing rapidly, with new, better models being made every week. because of this, Quincy is more a of a framework that allows easy replacement, fine tuning, and evaluation of models. This allows Quincy to always be using the latest and greatest that the field has to offer.

Hardware

Since we are running a local AI assistant, it would be good to list what hardware constraints we are working with. My current PC is as follows:

PCPartPicker Part List

Total: $1729.93

Built it in April 2023, got the CPU from MicroCenter, 3090 was used from Ebay. Had to return the hard drive twice before I got one that worked, otherwise everything went smoothly. System is dead quiet, except if I am training a model on the 3090 for an extended period of time, but even then its no louder than the city streets outside my window.

I may upgrade the RAM in the future to 128GB, as I find myself hovering the in 50-60GB range a bit more than I would like.

Future hardware plans

In the future I would like to make a GPU server with anywhere between 3 and 8 3090s in it to handle inference for multiple models and tasks as well as being able to do training runs as well without having to take down the inference models.

I also want to get either an M2 or M4 Mac (M3 does not actually have better TPS than the M2) with the max amount of RAM or build a beefy CPU server (64 cores, 256GB+ of RAM) to host larger models that dont make to inference on consumer GPU cluster. I am most interested in running the large DeepSeek V2 coder/ chat models, as they have the ability of GPT4 while only have ~20B active params, so they will not actually be that slow given their size of over 230B total params. Current back of the envelope math says that on either system I should be able to get 30 TPS or more using Llama.cpp or MLX.

Architecture

LLMs

There are 2 main LLM’s that are used.

  • A small one
    • Currently: Llama-3.1 8b
  • A large one (not deployed yet)
    • Currently: DeepSeek Coder V2

The small model is for everyday question answering and function calling (for now, function calling will hopefully be replaced by a Deberta model in the future). It is meant to run fast, using vLLM on my 3090, it should be quick enough so that the time to first word with the voice assistant is <1 second and its TPS should be at least 2x reading speed (~50 TPS).

The large model is used for coding, which is my main use case for LLMs, as well as any agentic tasks, if I can ever get it working consistently.

TTS

Right now I am using StyleTTS, as it is a bit faster and better than xTTS. I am pretty disappointed in the ability of open source TTS right now (which kinda makes sense, open sourcing super good TTS models probably isn’t the best for humanity right now), but I would like to have a model that sounds more lifelike in the future. The main reason why people don’t release any sota TTS models is their voice cloning ability, but for my AI assistant, I dont really care, I just want it to be really good at one voice. If I ever get bored of tinkering with LLMs I will probably start trying to work on fine tuning my own high quality single speaker model.

ASR

Right now am using the nvidia/parakeet-ctc-0.6b model for automatic speech recognition. I am always torn between using one of the Nvidia models and whisper. I hate whisper because the architecture isn’t the objectively good one for ASR (conformer) and was trained in the OpenAI classic way of “just throw more data at it until its good”, but at the end of the day the model is good and has a lot of community support.

On the other hand, to use the Nvidia models, you need to use Nvidia Nemo. Anyone who has had to work with Nvidia library’s before knows that its a nightmare. Off the bat Nemo requires hundreds of dependences to be install, 99% of which I can guarantee are not necessary for running inference for the one model I want to run. You also don’t get much in the one line of code I need to write from the library. I cant change the dtype, or input single audio files. Nope. I get fp32 inference and an array of audio files to pass in. Someone (most likely, me in the future) needs to port the poor model out of there and into transformers or something.

At the end of the day though, the Nvidia model is faster and has better accuracy, so it gets the nod in production.