Perception and Conception
It’s really hard to live like Helen Keller
Vision is essential for an organism to understand the world around it beyond touch. However, it's not a simple process. More than 50 percent of the cortex, the surface of the human brain, is devoted to processing visual information. This demonstrates how much processing resources need to be devoted to it. It's the same in robots and other artificial intelligences.
Vision is very challenging, but adds enormous capability. Machine vision enables machines to perceive their surroundings, to recognize individuals and objects, to understand context, to discern the attributes of things, and to freely navigate environments. Without vision, machines could not know where to find objects to pick them off a conveyor or out of a box. Machines could not perform quality control on stock to check for damage or missing pieces. Nor could they recognize a catastrophic error, or notify a human colleague.
Advanced machine vision allows for greatly improved mobility and independence. Traditional industrial robots live in a literal cage and cannot easily be repositioned (let alone repositioning themselves). Modern robots take themselves wherever they anticipate the greatest need, with minimal oversight or correction necessary from a human guide.
We're living in a time of virtual worlds of rapidly increasing sophistication. The killer app of the metaverse isn't entertainment, rather it's teaching robots. Humans and machines can work together in a virtual sandbox, humans teaching machines how to act in a certain situation, location, or context. By learning in a virtual environment, we can quickly and cheaply demonstrate a very wide range of potential scenarios. Having gained experience, that learning can be immediately put to work in the real world, enabling machines to fold laundry, or recognize uniquely deformed empty drinks cans as trash.
The wide variety of affordable and powerful GPUs (graphics cards) has been a game changer for machine vision, due to their speed, parallel processing on thousands of cores, and relative compactness and energy efficiency. Finally we have the raw computational power to enable machines to explore the world in real time, at a high frame rate.
Deep Learning techniques have been transformative for machine vision these past ten years, especially Convolution Neural Networks, which are especially suited to vision tasks. However, in the past few years we have seen the emergence of a new generation of machine intelligence techniques, such as Transformers, which are capable of doing lots of different tasks in one model, unlike the deep by narrow focus of deep learning. Transformers are now eating up even specialist deep learning domains, and doing a better job of it.
We can expect that the future of machine vision will be a blend of onboard and remote (cloud) intelligence. Onboard will be used for time-sensitive purposes, and remote intelligence will aid recognition of context and decision making, making sense of situation updates and sending back advice.
The advances in the past few years have been enormous. There aren't many limitations to machine vision remaining. It's now mostly a matter of productizing these new developments, and building in safety and ethical constraints to help ensure that the greater mobility and autonomy possible doesn't accidentally result in greater liability.
The Cambrian Explosion 530-545 million years ago occured when eyes first evolved, enabling primitive animals to understand their environment at a distance. It seems that an ability to make sense of multiple modalities of data in physical space will create a similar rapid expansion in capability in machines.