Landmark Attention Oobabooga Support + GPTQ Quantized Models!
Landmark Attention Oobabooga Support + GPTQ Quantized Models!
Models: https://huggingface.co/TheBloke/WizardLM-7B-Landmark
https://huggingface.co/TheBloke/Minotaur-13B-Landmark
Repo: https://github.com/eugenepentland/landmark-attention-qlora
Notes when using the models
Trust-remote-code must be enabled for the attention model to work correctly.
Add bos_token must be disabled in the parameters tab
Truncat the prompt must be increased to allow for a larger context. The slider goes up to a max of 8192, but the models can handle larger contexts as long as you have memory. If you want to go higher, go to text-generation-webui/modules/shared.py and increase truncation_length_max to whatever you want it to be.
You may need to set the repetition_penalty when asking questions about a long context to get the correct answer.
Performance Notes:
Inference in a long context is slow. On the RTX Quadro 8000 I'm testing, it takes about a minute to get an answer for 10k context. This is working on being improved.
Remember that the model only has good performance at the base model for complex queries. Sometimes you may not get the answer you are looking for, but it's worth testing if the base model would be able to answer the question within the 2k context.