4d ago

Specialize LLM

Hi, I'm not too informed about LLMs so I'll appreciate any correction to what I might be getting wrong. I have a collection of books I would like to train an LLM on so I could use it as a quick source of information on the topics covered by the books. Is this feasible?

8 comments

It is indeed possible! The nerd speak for what you want to do is 'finetune training with a dataset' the dataset being your books. Its a non-trivial task that takes setup and money to pay a training provider to use their compute. There are no gaurentees it will come out the way you want on first bake either.
A soft version of this thats the big talk right now is RAG which is essentially a way for your llm to call and reference an external dataset to recall information into its active context. Its a useful tool worth looking into much easier and cheaper than model training but while your model can recall information with RAG it won't really be able to build an internal understanding of that information within its abstraction space. Like being able to recall a piece of information vs internally understanding the concepts its trying to convey. RAG is for wrote memorization, training is for deeper abstraction space mapping
- Would you recommend fine-tuning over RAG to improve domain specific performance, my end goal would be a small, efficient and very specialised LLM to help get info on the contents of the books (all of them are about the same topic, from different povs and authors)
  
  I would receommend you read over the work of the person who finetuned a mistral model on many us army field guides to understand what fine tuning on a lot of books to bake in knowledge looks like.
  If you are a newbie just learning how this technology works I would suggest trying to get RAG working with a small model and one or two books converted to a big text file just to see how it works. Because its cheap/free t9 just do some tool calling and fill up a models context.
  Once you have a little more experience and if you are financially well off to the point 1-2 thousand dollars to train a model is who-cares whatever play money to you then go for finetuning.

The easiest option for a layperson is retrieval augmented generation, or RAG. Basically you encode your books and upload them into a special kind of database and then tell a regular base model LLM to check the data when making an answer. I know ChatGPT has a built in UI for this (and maybe anthropic too) but you can also build something out using Langchain or OpenWebUi and the model of your choice.
The next step up from there is fine tuning, where you kinda retrain a base model on your books. This is more complex and time consuming but can give more nuanced answers. It’s often done in combination with RAG for particularly large bodies of information.
- Umm, fine-tuning the model that makes the embeddings, right? Or is there an API for messing with the generative AI somewhere? Or are we assuming that newbie has a lot of compute resources? And they would have to use the generative model to create queries for their passages as well, right?
  I would try something like
  Guides | RAGFlow - https://ragflow.io/docs/dev/category/guides
  or a similar tool.
  Edit: not for fine-tuning, just to get started. Local models, RAG, your books are your knowledge base
  
  Making your own embeddings is for RAG. Most base model providers have standardized on OpenAIs embeddings scheme, but there are many ways. Typically you embed a few tokens worth of data at a time and store that in your vector database. This lets your AI later do some vector math (usually cosine similarity search) to see how similar (related) the embeddings are to each other and to what you asked about. There are fine tuning schemes where you make embeddings before the tuning as well but most people today use whatever fine tuning services their base model provider offers, which usually has some layers of abstraction.
- And as far as I know people do fine-tuning so it picks up on the style of writing and things like that, for example to mimick an author, or specifics of a genre. I'd say to just fetch facts from a pile of text, RAG would be the easier approach. It depends on the use-case, the collection of books, however. Fine-tuning is definitely a thing people do as well.

https://www.youtube.com/watch?v=5qlLJrv_q-Q
I have not watched this video but the description sounds like something you would benefit from

8 comments