AI Imitates a Young Child-Like Learning to Unlock Human Understanding

Summary: The new AI model, based on the PV-RNN framework, learns to generalize language and actions in a toddler-like manner by integrating perception, identity, and language instruction. Unlike large-scale linguistic models (LLMs) that rely on large datasets, this system uses integrated interactions to achieve structure while requiring less data and integration power.
Researchers have found AI's modular, transparent design useful in studying how humans acquire cognitive skills such as integrating language and actions. The model provides insights into developing neuroscience and can lead to safe, ethical AI by focusing learning on behavior and transparent decision-making processes.
Important Facts:
- Learning Like a Toddler: AI learns to compose by integrating sensory input, language, and actions.
- Transparent design: Its structure allows researchers to study internal decision-making processes.
- Practical Benefits: It requires less data than LLMs and highlights behavioral, integrated AI development.
Source: OIST
We humans are usually the best. If you have taught a young child to see the color red by showing him a red ball, a red truck and a red rose, he will probably recognize the color tomato, even if it is the first time he sees it.
An important milestone in learning generalization is composition: the ability to compose and decompose wholes into reusable components, such as the redness of an object. How we acquire this ability is an important question in developing neuroscience – and in AI research.
The first neural networks, which later evolved into large-scale linguistic models (LLMs) that revolutionized our society, were developed to explore how information is processed in our brains.
Ironically, as these models became more complex, the information processing mechanisms within them also became increasingly fuzzy, as some models today have billions of parameters that can be read.
But now, members of the Cognitive Neurorobotics Research Unit at the Okinawa Institute of Science and Technology (OIST) have created an integrated intelligence model with a novel structure that allows researchers to access various internal regions of the neural network, and which seems to be learning. the way to combine in the same way that children do.
Their findings have now been published in Robotics for Science.
“This paper shows a possible way for neural networks to achieve convergence,” said Dr. Prasanna Vijayaraghavan, first author of the study.
“Our model achieves this not by making predictions based on large datasets, but by integrating language and vision, critical thinking, working memory, and attention – just like toddlers do.”
LLMs, based on a transformer network structure, learn mathematical relationships between words from sentences from large amounts of text data. They have access to every word in every imaginable context, and from this understanding, they predict the most likely response to the given information.
In contrast, the new model is based on the PV-RNN (Predictive coding inspired, Variational Recurrent Neural Network) framework, trained by a combined interaction that includes three simultaneous inputs related to different sensors: vision, and video of a robot arm moving colored blocks. ; proprioception, the sense of movement of our limbs, and the joint angles of a robot arm as it moves; and a language instruction such as “put red in blue.”
The model is then tasked with generating a visual prediction and corresponding joint angles in response to language instruction, or language instruction in response to sensory input.
The program is inspired by the Principle of Free Energy, which suggests that our brain always predicts sensory input based on previous experience and takes action to reduce the difference between prediction and awareness.
This difference, calculated as 'free energy', is a measure of uncertainty, and by reducing free energy, our brain maintains a stable state.
Along with limited working memory and attention span, AI exhibits human limitations, forcing it to process inputs and update its predictions sequentially rather than all at once as LLMs do.
By studying the flow of information within a model, researchers can gain insights into how it integrates various inputs to perform its simulated actions.
It is because of this modular structure that researchers have learned more about how infants can develop creativity. As Dr. Vijayaraghavan recounts, “We found that the more exposure a model has to the same word in different contexts, the better it learns that word.
This reflects real life, where a young child will learn the concept of the color red much faster if they encounter different red objects in different ways, rather than just pushing the red truck over and over again.”
“Our model requires a very small training set and very little computing power to achieve convergence. It makes more mistakes than LLMs, but it makes mistakes like the way people make mistakes,” said Dr. Vijayaraghavan.
It is precisely this feature that makes the model so useful to logic scientists, as well as to AI researchers trying to map the decision-making processes of their models.
Although it serves a different purpose than the LLMs currently in use, so it cannot be logically compared in terms of performance, PV-RNN nevertheless shows how neural networks can be programmed to provide greater insight into their information processing methods: their shallow structure allows researchers. to visualize the latent state of the network – an emerging internal representation of information stored from the past and used in present predictions.
This model also addresses the problem of stimulus poverty, which posits that the linguistic input available to children is insufficient to explain their rapid language acquisition.
Despite having a very limited dataset, especially compared to LLMs, the model is still successful in the formulation, suggesting that the underlying language in behavior may be an important facilitator of children's language learning ability.
These combined learnings may further indicate a safe and ethical approach to AI in the future, both by improving transparency, and by being able to better understand the consequences of its actions.
Learning the word 'suffering' from the perspective of language alone, as LLMs do, can have less emotional weight than PV-RNN, which learns meaning through combined experience and language.
“We are continuing our work to improve the capabilities of this model and use it to explore various areas of developing neuroscience.
“We are excited to see what future insights into cognitive development and language learning processes we can get,” said Professor Jun Tani, head of the research unit and senior author of the paper.
How we acquire the intelligence to create our society is one of the great questions in science. Although the PV-RNN is not yet the answer, it opens up new avenues of research into how information is processed in our brain.
“By looking at how the model learns to combine language and action,” Dr. Vijayaraghavan, “we are gaining insight into the basic processes that support human cognition.
“It's already teaching us a lot about integration in language learning, and it's showing the power of more efficient, transparent, and safe models.”
About this AI and learning research issues
Author: Jun Tani
Source: OIST
Contact person: Jun Tani – OIST
Image: Image posted in Neuroscience News
Actual research: Closed access.
“Development of creativity through collaborative language learning and robotics” by Prasanna Vijayaraghavan et al. Robotics for Science
Abstract
Development of creativity through collaborative language learning and robotics
People are successful in applying learned behavior to unlearned situations. An important part of this generalization behavior is our ability to compose/decompose wholes into reusable components, a feature known as composition.
One of the important questions in robotics concerns this aspect: How can language acquisition be improved in conjunction with sensory skills through collaborative learning, especially when humans only learn partial language acquisition and their associated sensory patterns?
To address this question, we propose an inspired neural network model that integrates perception, personal identity, and language into a framework for predictive coding and active interpretation on the basis of the free energy principle.
The performance and capabilities of this model have been tested through various simulation tests performed on a robotic arm.
Our results show that familiarization from learning to unstudied verb-noun naming is significantly enhanced when the training variability of the naming task is increased.
We attribute this to self-organized compositional structures in the latent language domain that is heavily influenced by sensory learning.
Ablation studies show that visual attention and working memory are essential to accurately produce visuomotor sequences to achieve linguistically represented goals.
These insights advance our understanding of the mechanisms underlying the development of composition through the interaction of linguistic and sensory information.