Multimodal human-computer interaction refers to the “interaction with the virtual and physical environment through natural modes of communication”, This implies that multimodal interaction enables a more free and natural communication, interfacing users with automated systems in both input and output.
There are five senses, sight, sound, touch, taste and smell. Sight, hearing, and touch are the most important in human-computer interaction. We can receive information from the computer through sight, hearing and touch. we can send information to the computer through touch(e.g mouse), sight (e.g eye-gaze system such as using eyes movement).
Some of the alternative modes of human-computer interaction, sound, touch, handwriting and gesture.
Sound in the interface
Sound is an important contributor to usability. There is experimental evidence to suggest that the addition of audio confirmation of modes, in the form of changes in key clicks, reduces errors. The dual presentation of information through sound and vision supports the universal design, by enabling access for users with visual and hearing impairments respectively. It also enables information to be accessed in poorly lit or noisy environments. Sound can convey transient information and does not take up screen space, making it potentially useful for mobile applications. There are two types of sounds that we could use, speech and non-speech.
Speech in the interface
The term speech interface describes a software interface that employs either human speech or simulated human speech.
There have been many attempts at developing speech recognition systems, but, although commercial systems are now commonly and cheaply available, their success is still limited to single-user systems that require considerable training.
Speech Recognition Problems we can identify
- Different people speak differently(accent, intonation, stress, idiom, volume).
- The syntax of semantically similar sentences may vary.
- Background noises can interfere.
- People often “ummm…..” and “errr…..”
- Words, not enough semantics needed as well like requires intelligence to understand a sentence, context of the utterance often has to be known, also information about the subject and speaker e.g. even if “Errr…. I, um, don’t like this” is recognised, it is a fairly useless piece of information on its own.
Useful of speech recognition
- Digital assistants like Amazon’s Alexa, Apple’s Siri, Google’s Google Assistant Microsoft’s Cortana.
- Health for users with physical disabilities.
- Open use, limited vocabulary systems can work satisfactorily e.g. some voice-activated telephone systems.
- when users hands are already occupied (driving, manufacturing).
Speech synthesis is simply a form of output where a computer or other machine reads words to you out loud in a real or simulated voice played through a loudspeaker. Simply we can say speech synthesis is the computer-generated simulation of human speech.
Speech synthesis Problems we can identify
- The most difficult problem is that we are highly sensitive to variations and intonation in speech, and are therefore intolerant of imperfections in synthesized speech.
- Similar to recognition-prosody particularly.
- Intrusive - it needs headphones or creates noise in the workplace.
- Transient - spoken output cannot be reviewed or browsed easily.
Useful of speech synthesis
- Natural and familiar way of receiving information.
- Users who are blind or partially sighted, synthesized speech offers an output medium which they can access.
- Useful as a communication tool to assist people with physical disabilities that affect their speech. It produces output that is as natural as possible with as little input effort from the user as possible, perhaps using a simple switch.
The meaning of non-speech sounds can be learned regardless of language. It can be used in several ways in interactive systems.
e.g. boings, bangs, squeaks, clicks etc.
Useful of non-speech sound
- It is often used to provide transitory information, like indications of network or system changes, or errors.
- commonly used for warnings and alarms
- It can also be used to provide status information on background processes.
- It provides a second representation of actions and objects in the interface to support the visual mode and provide confirmation for the user.
- It can be used for navigation round a system, either giving redundant supporting information to the sighted user or providing the primary source of information for the visually impaired.
- Dual-mode displays: information presented along two different sensory channels, redundant presentation of information, resolution of ambiguity in one mode through information in another.
- Sound good for transient information background status information
e.g. Sound can be used as a redundant mode in the Apple Macintosh almost any user action (file selection, window active, disk insert, search error, copy complete, etc.) can have a different sound associated with it.
- Fewer typing mistakes with key clicks.
- Video games harder without sound.
Auditory icons use natural sounds to represent different types of objects and actions in the interface. Natural sounds are used because they have associated semantics which can be mapped onto similar meanings in the interaction such as breaking glass, cutting trees.
Auditory icons are used to represent desktop objects and actions. So, for example, a folder is represented by a papery noise and throwing something in the wastebasket by the sound of smashing. This helps the user to learn the sounds since they suggest familiar actions from everyday life. The problem is, not all things have associated meanings. From this, we can present additional information such as muffled sounds if an object is obscured or action is in the background, use of stereo allows positional information to be added etc.
Earcons use structured combinations of notes, called motives, to represent actions and objects. These vary according to rhythm, pitch, timbre, scale and volume. There are two types of combination of earcons and they are compound earcons and family earcons.
Compound earcons - Combine different motives to build up a specific action, for example combining the motives for ‘create’ and ‘file’.
Family earcons - represent compound earcons of similar types. For example, operating system errors and syntax errors would be in the ‘error’ family.
In this way, earcons can be hierarchically structured to represent menus. And they easily grouped and refined owing to their compositional and hierarchical nature, but they require learning to associate with a specific task in the interface.
Difference between Earcons and Auditory icons
Auditory icons: emphasis on natural sounds and metaphor with the real-world.e.g. sound of filling a bottle with water to match moving a large file.
Earcons: artificial sounds (generated).
e.g. more abstract metaphorical relationship to action or purely a convention(like corporate colour schemes)
Touch in the interface
Touch is the only sense that can be used to both send and receive information. The use of touch in the interface is known as haptic interaction. Haptics can provide information on shape, texture, resistance, temperature, comparative spatial factors. It can be divided into two areas and they are cutaneous perception and kinesthetics.
Cutaneous perception - which is concerned with tactile sensations through the skin like vibrations on the skin.
Kinesthetics - which is the perception of movement and position like force feedback, resistance, texture, friction.
- Electronic braille displays.
- Force feedback devices such as resistance, texture, Phantom which allowing users to touch virtual objects.
Handwriting is another communication mechanism which we are used to in day-to-day life. Handwriting recognition is the ability of a computer or device to take as input handwriting from sources such as printed physical documents, pictures and other devices, or to use handwriting as a direct input to a touchscreen and then interpret this as text.
Handwriting consists of complex strokes and spaces. This technology is used to input text to a computer by using a pen and a digesting tablet. In here,
- Free-flowing consists of strokes (using a pen) transformed into the sequence of coordinates.
- Depending on the pressure and movements Rapid movements wide-spaced dots/slow movements - narrowed dots.
- Capturing all useful information such as stroke path, pressure etc in a natural manner.
- Segmenting joined-up writing into individual letters.
- Interpreting individual letters.
- Coping with different styles of handwriting
It is suitable for digitising maps and technical drawings and there are large tablets for that. And also they are used in PDAs and tablet computers.
What is a gesture? Gesture means body motion used for communication. Being able to control the computer with certain movements of the hand would be advantageous in many situations where there is no possibility of typing, or when other senses are fully occupied.
- Dataglove - The dataglove provides easier access to highly accurate information, but is a relatively intrusive technology, requiring the user to wear the special Lycra glove.
- Computer vision
- An example for position sensing devices is MIT Media Room. The Media Room has one wall that acts as a large screen, with smaller touchscreens on either side of the user, who sits in a central chair. The user can navigate through information using the touchscreens, or by joystick, or by voice.
- Natural form of interaction -pointing.
- Easily interpreted by speech recognition system-using short, simple verbal statements.
- Enhance communication between signing and non-signing users.
- It could support communication for people who have hearing loss if signing could be ‘translated’ into speech or vice versa.
- Expensive - The technology that capturing gestures is too expensive.
- User-dependent - Variable and issues of co-articulation.