Internet-Based Personal Services on Demand
Gottfried Zimmermann, Ph.D., zimmer@trace.wisc.edu
Gregg Vanderheiden, Ph.D., gv@trace.wisc.edu
Al Gilman, D.Sc., asgilman@iamdigex.net
Trace R&D Center, University of Wisconsin, Madison
August 16, 2001
Zimmermann, G., & Vanderheiden, G. (2002). Internet-Based Personal Services on Demand. In: Winters, J.; Robinson, C.; Simpson, R.; Vanderheiden, G. Emerging and Accessible Telecommunications, Information and Healthcare Technologies - Engineering Challenges in Enabling Universal Access. RESNA Press.
Abstract
Internet-based services tailored to a person's personal communication, information, and assistance needs could improve the quality of their lives. Examples include language translation for business and private communication, text transcription for lawyers and people with hearing impairments, assistance for drivers searching their way through unfamiliar environments, and assistance for people with mental retardation to help them live more independently.
This chapter introduces the "Modality Translation and Assistance Services on Demand" concept, a variety of remote personal services available anywhere and anytime, to enhance the lives of people with and without disabilities. It identifies the research and development challenges for a telecommunication and information infrastructure aiming to provide personal services on demand.
1 Introduction
We are at the edge of radical technological changes in our environments. Recent and ongoing advancements in the area of telecommunications and information technologies are facilitating the implementation of the vision of pervasive computing, which will allow a wide variety of devices and services to talk with each other, which in turn opens up new possibilities for personal services. While the future of machine-to-machine connectivity could make our lives easier, we still struggle with the very essentials of human-to-human conversation and human-machine interaction. For example, people don't understand each other because they speak different languages, or because one cannot hear and the other person does not understand sign language. To overcome these restrictions we could harness emerging technologies to provide personal services on demand remotely. The following fictive example illustrates this.
Sarah is a lawyer working at a legal agency in New York. Because she wants to get accurate minutes of a current trial, she uses a remote speech-to-text service to get text transcriptions of the court hearings. The resulting text is automatically stored on her handheld computer.
Back in her car, she decides to visit a new client who is deaf, Rob, located somewhere in the vicinity of New York. In order to get verbal driving directions, she speaks the name of the client to the Web-enabled car radio. When she arrives at Rob's office, he welcomes her by signing in American Sign Language. Instantly a clear voice from his pocket computer translates the signs into English language. When Sarah talks to Rob, a signing avatar being projected onto his glasses provides a speech-to-sign translation.
A week later, Sarah is on business travel, sitting in a restaurant in Tokyo. As she cannot read the Japanese menu card, she directs her handheld computer's camera to it. Having a scanned image of the menu on the screen of her handheld, she taps on the menu items on her handheld and gets an English translation of each, plus a symbol telling whether or not she may eat it on her special diet.
2 Personal Services on Demand
The "Modality Translation and Assistance Services on Demand" concept[1] includes a variety of services, like the ones described in the scenario above. These services are becoming possible as a result of recent technological advancements in wide-area, high-bandwidth networks and wireless communication technologies. The concept uses telecommunication technologies to allow people to call up services on demand at anytime from anywhere, and on a variety of access devices. The services are operated on a moment-by-moment basis and the user pays only for a service when they are using it.
2.1 A Diversity of Personal Services
Figure 1: Modality Translation and Assistance Services on Demand Spectrum
Modality translation and assistance services on demand can render information from one specific presentation form (mode) to another, or provide other forms of assistance on demand (see Figure 1).
- The Speech Recognition Service accurately translates speech to text, for any speaker. There is a wide range of applications for this service. By translating spoken information into an "ears-free" text format, it can be used in noisy environments (e.g. a factory building), enforced quiet environments (e.g. in a library, in a business meeting), by persons needing accurate meeting minutes, or by people with hearing impairments. By generating text without requiring typing or writing by the user, it can be part of a highly accurate speech interface for devices with tiny or no keyboards (e.g. car radio, PDAs, cell phones), or for users with limited manual dexterity. Finally it can be used to generate a text index for audio content of real-time audio and video streams.
- The Sign Language and the Sign Language Recognition Services facilitate communication between a person who is deaf and using a sign language, and a person who can hear and does not understand sign language. These services may be applied in any situation where sign language interpretation is required: in face-to-face conversations, phone calls (involving a video phone and a regular phone), meetings, or tele-collaborative sessions.
- The International Language Translation Service translates from one language to another. It can be applied either real-time to spoken language (in face-to-face conversations, phone calls, meetings, or tele-collaborative sessions), or to written content in a foreign language (e.g. in documents, e-mails, or web pages).
- The Assistance/Mentoring Service provides expertise or assistance in certain situations. This includes help for people in unfamiliar surroundings (e.g. driving directions), mentoring or assistance (e.g. help desks, or medical assistance), or specialized expertise (e.g. for medical doctors, scientists). This service also provides assistance for people with cognitive impairment or older people needing help to live more independently.
- The Language Simplification Service simplifies text, or real-time speech, presented in a complex language or expert knowledge level. This service addresses the needs of people with cognitive impairment, but also anyone listening to someone who is much more expert on a topic (e.g. a doctor in an emergency room), and anyone who needs to present a high-sophisticated topic to an audience not familiar with it (e.g. a scientist writing a newspaper column about their science).
- The Print Recognition Service provides text or speech for written information available only in print or pixel-based formats. This service delivers textual content for people with visual impairments (e.g. for reading paper documents, getting text buried in online images, or reading menus or environmental signs). Another application for this service is generating an electronic index for real-time video streams with textual content. Coupled with the international language translation service it provides instant translation for scanned information in a foreign language (e.g. menu cards in restaurants, tourist panels).
- The Image/Video Description Service provides a speech or text equivalent for a visual image or video. When used by people with visual impairments, it can provide a verbal description where no text-based information is available (e.g. for a live photo on the web, or for a diagram being discussed in a meeting or tele-collaborative environment).
2.2 Try Harder
One barrier in trying to move forward is the fact that today there is no sophisticated fully-automated and reliable implementation of any services described above. Instead, we still rely on human assistance to provide a service, or verify and correct the results of a machine-provided service. For example, speech recognition software installed locally on a wearable computer may suffice for some face-to-face conversations if it is quiet and the people speak carefully and clearly. However, it may fail when there is too much background noise. In this case a more sophisticated (and more expensive) service implementation employing noise suppression running on a powerful network computer may be used in order to yield reasonable quality of text transcription. Again, this implementation might fail when dealing with a strong foreign accent of a speaker. In this case a human assisted service implementation could meet the user's needs. For example a re-speaking' method may be used where a person listens to the conversation remotely and re-voices everything distinctly into a high-quality speech recognition system, checking the output for errors.
Although fully automated services may be possible with future technologies, today's implementations are not as mature as needed in most situations. In these cases a "Try Harder" feature could be used to easily promote the task to more powerful applications (network-enhanced services), or to human assistance in the automatic translation process. Thus a "try harder" feature would allow users to try the least expensive approach first, but have an easy way to escalate the power (and cost) as needed until the service works for whatever situation they find themselves in[2].
From the perspective of a service provider, the "Try Harder" feature is a convenient method for allowing them to introduce future automated services today. Automated service implementations that would not be reliable enough of the time to sell, could be offered if they are backed up by more effective (albeit expensive) services (humans or additional resources) until they are mature enough to be used stand-alone.
3 Building an Infrastructure for Personal Services on Demand
Most all of the parts of modality translation and assistance services on demand exist as automatic or human-provided implementations today. There is, however, no implementation of this network concept overall and no infrastructure on which to build it. Building such an infrastructure for diverse Internet-based personal services on demand poses a number of challenges. We identify challenges for three different areas involved: communication networks (global access to a reliable and secure network), middleware standards (common service framework), and service implementations (automated service model that does not rely on human assistance).
3.1 Challenges for Global Access to a Reliable and Secure Network
The concept is for people to be able to use modality translation and assistance services on demand in virtually all situations. This means that people can tap into the future network of information and services using a number of different access devices (e.g. computers, wearable computers, public information kiosks, telephones, cell phones, PDAs, car equipment), outfitted with a variety of input and output devices (e.g. glasses with a built-in monitor and camera, earpieces, inconspicuous microphones) from any location (e.g. at work or at home, at school, on the road, on travel, in a tele-collaborative environment, in an emergency room).
As the network evolves to bridge the gap between the person requesting service and the service provider, this network connection has to be as reliable and secure as if the service provider were in the same room. The network must be able to provide the required bandwidth for different multi-media stream formats (text, audio, and video) with (almost) no time delay, and even be able to flexibly change bandwidth requirements on demand during a session. Eventually a network featured with a sophisticated "Quality of Service" (QoS) implementation could meet these requirements.
3.2 Challenges for a Common Service Framework
In order to develop a common service framework for modality translation and assistance services on demand a standard meeting the following requirements would be needed (this standard may be part of a broader standard for Internet-based services):
- Provides a mechanism for registering service providers so that a service request could automatically be directed to an available service implementation.
- Provides a common service description scheme in order to distinguish different service types and features.
- Provides a mechanism for automatic dispatching of service requests, taking into account user preferences and service availability. (Consider commercial and volunteer models for service provision.)
- Specifies common formats for multi-media streams (text, audio, and video) that are appropriate for all access devices (see above).
- Provides a mechanism for splitting off and merging streams in a virtual meeting. Several combinations of services may be used in a virtual meeting, with each combination used by a different subset of users, spread over several locations.
- Specifies a registry and common formats for globally available speaker profiles (for speech recognition service).
- Modularity: Allows that services may be combined (e.g. speech recognition and video description for a deaf-blind person) and concatenated (e.g. print recognition feeds international language translation service for the translation of a restaurant menu card).
- Allows seamless switching of the user's access devices within a session (e.g. when leaving home the user wants to switch from the desktop computer to a cell phone).
- Allows seamless switching of the service provider within a session (facilitating shift changes during a session).
- Provides for micro-payment model for accounting and charging for delivered services. This includes third party payment models.
- Accommodates a wide range of security and privacy technologies.
3.3 Challenges for an Automated Service Model
Humans can provide all services mentioned in the Modality Translation and Assistance Services on Demand concept today. Connecting to human-assisted remote services facilitated by a globally available and reliable network represents a time-efficient and flexible service provision model, as opposed to the traditional model requiring advance arrangement, traveling and on-site presence of the service provider.However, automated service applications implemented in the local and network enhanced layer could facilitate a more cost-effective service model. Among the services, some (speech recognition, international language translation, and print recognition) are available as automated services today, but often rely on human assistance for verifying results and making corrections if needed (human-assisted layer). For the other services (sign language, sign language recognition, assistance/mentoring, language simplification, and image/video description) there are no implementations yet mature enough to be used even in conjunction with human assistance.For each of these services we identify research issues and challenges, mainly in the area of artificial intelligence and natural language processing that need to be addressed in order to develop highly sophisticated, automated implementations for modality translation and assistance services on demand:
- Research Issues and Challenges for Speech Recognition Service. Today's speech recognition works moderately well on modern desktop computers and for speakers that have trained the system in advance. However, highly accurate voice recognition technologies that can adapt themselves to any speaker, suppress disturbing influences like background noise, and improve recognition quality on the fly are needed. By recognizing the speaker's voice and its parameters, systems could automatically tap into globally available profiles. Grammatical and semantic analysis may further improve accuracy. One of the main challenges is to derive implementations that can run on handheld devices. However, mobile devices of tomorrow may have the same computational power as desktop computers today.
- Research Issues and Challenges for Sign Language Service. A sign language service implementation consists of three consecutive tasks: speech recognition (optional), machine translation, and sign language rendering (signing avatar). More Research on sophisticated linguistical models for sign language processing is mostly needed. Advances in the area of human modeling and computer-generated animation could eventually facilitate a speaking interlocutor being "morphed" into a signing avatar so that the person who is speaking will appear to be signing.
- Research Issues and Challenges for Sign Language Recognition Service. A sign language recognition service implementation consists of three consecutive tasks: sign tracking, machine translation, and text-to-speech translation (optional). The greatest challenges here are developing the advanced algorithms for face recognition and tracking of upper body movements necessary to deliver high-quality input for the machine translation process, and developing the sign gesture recognition system.
- Research Issues and Challenges for International Language Translation Service. This service will benefit from intensive research and world-wide collaborations in the area of natural language processing and machine translation. Challenges are mostly caused by the complexity and context-sensitive nature of human language. Progress made in this area will also facilitate other services being implemented.
- Research Issues and Challenges for Assistance/Mentoring Service. We anticipate that this service could be partly provided by automatic service implementations made possible by emerging Artificial Intelligence technologies. Expert systems should respond to and generate output in natural language, providing answers for special knowledge areas (e.g. help desk), and even assist in making every-day decisions in the lives of cognitively impaired and mentally retarded persons.
- Research Issues and Challenges for Language Simplification Service. An automatic implementation of this service depends on advancements in processing and generating natural language, machine translation, and information extraction algorithms. Eventually a language simplification system should react to questions and commands in natural language, and simultaneously adapting to the user's language level.
- Research Issues and Challenges for Print Recognition Service. Optical character recognition (OCR) provides the basis for an automated implementation of this service. The main challenge for this service is to harness heuristics and Artificial Intelligence technologies to facilitate the process of serializing the textual parts of a complex layout (e.g. graphically arranged text, frames and columns), if no structural information for the document is available.
- Research Issues and Challenges for Image/Video Description Service. An automatic implementation would involve advanced image and pattern recognition algorithms, combined with natural language generation. The eventual goal for a fully-automated service implementation is to provide verbal descriptions for images and video that are indistinguishable from those of a human narrator.
For all these services, the "Try Harder" feature allows for a smooth transition from inferior to superior automated implementations, and from automated to human-provided service implementations. Beyond the challenges of implementing the individual services in an automated manner, there is the overall challenge of an automatic "Try Harder" feature. This feature could be facilitated by the services implementing a probabilistical model keeping track of the probabilities of the provided output. Then, based on heuristics, a more sophisticated service implementation (machine or human-based) could be automatically consulted for certain parts of the problem, or the whole assignment could be transferred to a superior (or inferior) service implementation, if appropriate.
4 Related Work
Applications and services that provide personal translation and assistance services similar to those described in this paper, are envisioned by researchers, developers, and service providers. Some interesting research and development results in this area are mentioned below. Once a service infrastructure is built up, these service implementations (or their successors) could be integrated into the concert of personal services on demand.
The company Vcom3D produces signing avatar software[3], that uses scripting technology to convey Web content by signing (see Figure 2). The TEAM system[4] uses a machine translation approach to translate English sentences to American Sign Language rendered by an avatar application. The ViSiCAST project[5] features a signing avatar that translates standardized content of a weather forecast Web page into several European sign languages. The iCommunicator[6] is a computer-based system providing speech to text translation, and speech to sign language translation via digital movie images. The SignTel interpreter[7] is a similar system. The Media Access Group[8] at Boston's WGBH provides text captioning and descriptive video services for the media industry. Ultratec in Madison, Wisconsin, provides a speech-to-text service called "FASTRAN" for telephone users with special premises[9]. The Trace R&D Center has demonstrated the application of speech-to-text translation to tele-collaborative environments at the SuperComputing conference 2001 in Denver, Colorado[10].
Figure 2: Signing Avatar 1.0 from Vcom3D
In the area of potential architectures, Foster and Kesselman (1999)[11]provide input for several aspects of the Modality Translation and Assistance Services on Demand concept, including communication and security issues. Postel and Touch (1999)[12]mention the convergence of media, of telephony, cable television, radio, and the Internet as a possible driving factor for a future network infrastructure. Foster (2000)[13] developed a set of requirements for an "Integrated Grid Architecture" and presents a candidate structure for this architecture.
Another potential architecture model is the eCommerce driven Universal Description, Discovery and Integration (UDDI) standard[14] which aims to connect buyers, suppliers, marketplaces, and service providers within a global, open electronic business framework.
5 Conclusion
The "Modality Translation and Assistance Services on Demand" concept has the potential to improve the quality of life for everybody regarding human-to-human communication, access to information, and independent living. Moreover, it seamlessly integrates the needs for people with disabilities into a more general network based service delivery model.
We have identified some challenges for building an infrastructure for Internet-based personal services on demand. In order to reach this goal, we rely on advancements in three areas. First, computer networks are needed that can make these services available to anybody from any location, on a reliable basis. Second, Web-based middleware standards are needed to provide a service framework that accommodates the structural needs of a common service infrastructure. And finally, advances in the area of Artificial Intelligence and natural language processing are needed in order to create fully automated implementations of these services, thus lessening today's dependence on expensive human assistance.
6 References
[1] Zimmermann, G., & Vanderheiden, G. (2001, March). Translation on Demand Anytime and Anywhere. CSUN's Sixteenth Annual International Conference, Los Angeles, California, March 19-24, 2001. Retrieved Aug. 14, 2001, from the World Wide Web.
[2] Vanderheiden, G. (in press). Telecommunications Accessibility and Future Directions. In: Abascal, & Nicolle, (eds.). Inclusive Guidelines for HCI. Taylor & Francis Ltd., in press.
[3] Wideman, C.; & Popson, S. (2001). Sign Language Assistive Technology Offers Access to Digital Media. Proceedings of CSUN's Sixteenth Annual International Conference on "Technology And Persons With Disabilities", Los Angeles, CA, March 19 - 24, 2001. Retrieved Aug. 14, 2001, from the World Wide Web.
[4] Zhao, L.; Kipper, K.; Schuler, W.; & Badler, N. (2000). A Machine Translation System from English to American Sign Language. Proceedings of AMTA-2000: Envisioning Machine Translation in the Information Future, Mexico, 2000.
[5] Verlinden, M.; Tijsseling, C.; & Frowein, H. (2001). A Signing Avatar on the WWW. International Gesture Workshop 2001, City University, London, April 2001. Retrieved Aug. 14, 2001, from the World Wide Web.
[6] Teach The Deaf, Interactive Solutions, Florida.
[7] SignTel Inc., Connecticut.
[8] Media Access Group, WGBH Educational Foundation, Boston, Massachusetts.
[9] Ultratec (2000). Some ultratec.com. Service announcement Aug. 2000. Retrieved Aug. 14, 2001, from the World Wide Web.
[10] Gores, N. (2002). Building an Accessible Access Grid. NCSA Access Online, Jan. 15, 2002. Retrieved Jan. 25, 2002, from the World Wide Web.
[11] Foster, I., & Kesselman, C. (1999). The Globus Toolkit. In: Foster, I., & Kesselman, C. (editors). The Grid Blueprint for a New Computing Infrastructure (chapter 11, pp. 259-278). San Francisco: Morgan Kaufmann.
[12] Postel, J., & Touch, J. (1999). Network Infrastructure. In: Foster, I., & Kesselman, C. (editors). The Grid Blueprint for a New Computing Infrastructure (chapter 21, pp. 533-566). San Francisco: Morgan Kaufmann.
[13] Foster, I. (2000). Building the Grid: An Integrated Services and Toolkit Architecture for Next Generation Networked Applications (Draft). Retrieved Aug. 14, 2001, from the World Wide Web.
[14] Universal Description, Discovery and Integration (UDDI).
Acknowledgments
This work was partly funded by the National Science Foundation (USA) via the Alliance Partnership for Advanced Computational Infrastructure within the Education, Outreach and Training (EOT) program; and the National Institute on Disability and Rehabilitation Research (NIDRR), US Department of Education under grants H133E980008, & H133E990006. Opinions expressed are those of the authors and not the funding agencies.
For more information see Modality Translation Services Program.

