3mo ago

When AI is tested on questions it can't model from pre-existing answers on the internet, it only scores 10% in the test.

qz.com

Researchers just stumped AI with their most difficult test — but for how long?

15 comments

AI scores really low on subjects it’s never read about. Real shocker there. I’d put money on humans scoring even less on subjects they’ve never heard of.
- I’d put money on humans scoring even less on subjects they’ve never heard of.
  They are testing is the ability to reason. The AI, or human, can still use the internet to find out the answer. Here's a sample question that illustrates the distinction.
  Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.
  
  Failing that question doesn't mean it can't independently reason, it just means it doesn't have the knowledge to reason about it. That question is basically do you know how many paired tendons are attached to each of those bones and can you add them up. If the ai, like 99.999% of people, don't know how many tendons are attached to those bones it can't reason the answer.
  If you give the a.i. a similar question with something it knows it can reason through it fine. For example the question:
  How many legs do 13 humans, 4 cats and 63 dogs have in total?
  Chat gpt 4o gives the answer:
  To calculate the total number of legs:
  Humans: Each human has 2 legs. 13 × 2 = 26 13×2=26 legs.
  Dogs: Each dog has 4 legs. 63 × 4 = 252 63×4=252 legs.
  Cats: Each cat has 4 legs. 4 × 4 =16 4×4= 16 legs.
  Now, add them together: 26 + 252 + 16 = 294 26+252+16=294.
  Total legs = 294.
  I guess I can't guarantee it's never seen this question before but I'd say the odds are pretty low and the odds that it's doing independent reasoning as you call it is high.
  
  That reads like the sort of thing Wolfram Alpha was designed to absolutely obliterate, if only the raw data representing each of those keywords had been loaded in.

15 comments