Skip Navigation

How reliable are modern LLMs?

I wanted to extract some crime statistics broken by the type of crime and different populations, all of course normalized by the population size. I got a nice set of tables summarizing the data for each year that I requested.

When I shared these summaries I was told this is entirely unreliable due to hallucinations. So my question to you is how common of a problem this is?

I compared results from Chat GPT-4, Copilot and Grok and the results are the same (Gemini says the data is unavailable, btw :)

So is are LLMs reliable for research like that?

21 comments
  • Absolutely not.

    • Definitely. The thing you might want to consider as well is what you are using it for. Is it professional? Not reliable enough. Is it to try to understand things a bit better? Well, it's hard to say if it's reliable enough, but it's heavily biased just as any source might be, so you have to take that into account.

      I don't have the experience to tell you how to suss out its biases. Sometimes, you can push it in one direction or another with your wording. Or with follow-up questions. Hallucinations are a thing but not the only concern. Cherrypicking, lack of expertise, the bias of the company behind the llm, what data the llm was trained on, etc.

      I have a hard time understanding what a good way to double-check your llm is. I think this is a skill we are currently learning, as we have been learning how to sus out the bias in a headline or an article based on its author, publication, platform, etc. But for llms, it feels fuzzier right now. For certain issues, it may be less reliable than others as well. Anyways, that's my ramble on the issue. Wish I had a better answer, if only I could ask someone smarter than me.


      Oh, here's gpt4o's take.

      When considering the accuracy and biases of large language models (LLMs) like GPT, there are several key factors to keep in mind:

      1. Training Data and Biases

      • Source of Data: LLMs are trained on vast amounts of data from the internet, books, articles, and other text sources. The quality and nature of this data can greatly influence the model's output. Biases present in the training data can lead to biased outputs. For example, if the data contains biased or prejudiced views, the model may unintentionally reflect these biases in its responses.
      • Historical and Cultural Biases: Since data often reflects historical contexts and cultural norms, models might reproduce or amplify existing stereotypes and biases related to gender, race, religion, or other social categories.

      2. Accuracy and Hallucinations

      • Factual Inaccuracies: LLMs do not have an understanding of facts; they generate text based on patterns observed during training. They may provide incorrect or misleading information if the topic is not well represented in their training data or if the data is outdated.
      • Hallucinations: LLMs can "hallucinate" details, meaning they can generate plausible-sounding information that is entirely fabricated. This can occur when the model attempts to fill in gaps in its knowledge or when asked about niche or obscure topics.

      3. Context and Ambiguity

      • Understanding Context: While LLMs can generate contextually appropriate responses, they might struggle with nuanced understanding, especially in cases where subtle differences in wording or context significantly change the meaning. Ambiguity in a prompt or query can lead to varied interpretations and outputs.
      • Context Window Limitations: LLMs have a fixed context window, meaning they can only "remember" a certain amount of preceding text. This limitation can affect their ability to maintain context over long conversations or complex topics.

      4. Updates and Recency

      • Outdated Information: Because LLMs are trained on static datasets, they may not have up-to-date information about recent events, scientific discoveries, or new societal changes unless explicitly fine-tuned or updated.

      5. Mitigating Biases and Ensuring Accuracy

      • Awareness and Critical Evaluation: Users should be aware of potential biases and inaccuracies and approach the output critically, especially when discussing sensitive or fact-based topics.
      • Diverse and Balanced Data: Developers can mitigate biases by training models on more diverse and balanced datasets and employing techniques such as debiasing algorithms or fine-tuning with carefully curated data.
      • Human Oversight and Expertise: Where high accuracy is critical (e.g., in legal, medical, or scientific contexts), human oversight is necessary to verify the information provided by LLMs.

      6. Ethical Considerations

      • Responsible Use: Users should consider the ethical implications of using LLMs, especially in contexts where biased or inaccurate information could cause harm or reinforce stereotypes.

      In summary, while LLMs can provide valuable assistance in generating text and answering queries, their accuracy is not guaranteed, and their outputs may reflect biases present in their training data. Users should use them as tools to aid in tasks, but not as infallible sources of truth. It is essential to apply critical thinking and, when necessary, consult additional reliable sources to verify information.

  • So are LLMs reliable for research like that?

    No. Of course not. They're not reliable for anything. They don't have any kind of database of facts and don't know or attempt to know anything at all.

    They're just a more advanced version of your phone's predictive text. All they do is try to figure out which words most likely go in what order as a response to the prompt. That's it. There is no logic of any kind dictating what an LLM outputs.

  • If you're using an LLM, you should limit the output via a grammar to something like json, jsonl, or csv so you can load it into scripts and validate that the generated data matches the source data. Though at that point you might as well just parse the raw data and do it yourself. If I were you, I'd honestly use something like pandas/polars or even excel to get it done reliably without people bashing you for using the forbidden technology even if you can 100% confirm that the data is real and not hallucinated.

    I also wouldn't use any cloud LLM solution like OpenAI, Gemini, Grok, etc. Since those can change and are really hard to validate and give you little to no control of the model. I'd recommend using a local solution like running an open weight model like Mistral Nemo 2407 Instruct locally using llama.cpp or vLLM since the entire setup will not change unless you manually go in and change something. We use a custom finetuned version of Mixtral 8x7B Instruct at work in a research setting and it works very well for our purposes (translation and summarization) despite what critics think.

    Tl;dr Use pandas/polars if you want 100% reliable (Human error not accounted). LLMs require lots of work to get reliable output from

    Edit: There's lots of misunderstanding for LLMs. You're not supposed to use the bare LLM for any tasks except extremely basic ones that could be done by hand better. You need to build a system around them for your specific purpose. Using a raw LLM without a Retrieval Augmented Generation (RAG) system and complaining about hallucinations is like using the bare ass Linux kernel and complaining that you can't use it as a desktop OS. Of course an LLM will hallucinate like crazy if you give it no data. If someone told you that you have to write a 3 page paper on the eating habits of 14th century monarchs in Europe and locked you in a room with absolutely nothing except 3 pages of paper and a pencil, you'd probably write something not completely accurate. However, if you got access to the internet and a few databases, you could write something really good and accurate. LLMs are exceptionally good at summarization and translation. You have to give them data to work with first.

21 comments