Perakakis Manolis ,
Department of Electronics & Computer Engineering
Technical University of Crete
In my diploma thesis ``Distributed Speech Recognition for wired/wireless data networks (Internet,GPRS) using 2 Kbps coding'' (see  and related publications , , ) the use of Distributed Speech Recognition (DSR) over data instead of voice networks is proposed as a promising technology for information retrieval, especially from Wireless Information Devices (WID) like PDAs & smartphones. The emergence of wireless data networks like GPRS and the low data rate achieved (only 2 Kbps) make this technology promising although many related issues have still to be considered before the deployment of this technology to the real world.
For a technology to be successful criteria like user friendliness, low cost, standard compliance along with competitive advantages over other similar technologies need to be met. Of course one could study many related issues & topics, but trying to keep the scope limited to the most interesting of all issues and the organization as clean as possible, 4 different areas are finally discussed in this report, each including several related such issues. Firstly a related standard for DSR proposed by ETSI (European Telecommunication Standards Institue) is reviewed. Then a close look at GPRS standard is given, since it is the standard expected to give a boost to data applications over wireless networks. Next some related work by W3C & Wap Forum is reviewed & the report closes by examining application development issues related to WID platform.
Although all four areas are more or less related to DSR topic, they might seem to be irrelevant to each other. The purpose of this short report is not to closely connect all four topics together but rather just focus on 4 different topics as seen from the DSR perspective. The collection of related information although a bit difficult was fruitful enough, since knowledge from many interesting different areas was acquired. Compacting all that information to a small report was also a bit tedious. I hope that this report will be interesting to the reader too.
This section deals with a DSR proposal by ETSI. Most of the content was extracted from  and describes the STQ Aurora group's work on the standardization of a front-end for DSR applications. The general idea is well described but some interesting extra information is missing (or can further been acquired by membership to the specific group) and some parts of it are rather confusing. Anyway, an effort is given to present most of the information as clearly as possible right here.
The DSR idea is based on decoupling the front-end mechanism from the recognition process. This way the front-end mechanism can be located for example to a mobile terminal/phone (client) while the recognition process can take place in a remote central host (server). ETSI's work is focused on standardizing the front-end mechanism so that all front-end information produced by different clients is identical, no matter what kind of device the client is. Some requirements should be met which according to ETSI proposal are :
Focus is given on using data instead of voice network. The first reason is that mobile voice networks degrade performance due to low bit rate speech coding & channel transmission errors. Using an "error protected" data channel instead, recognition performance can been kept in higher levels. The second reason is the use of one single speech coding scheme instead of today's 4 different speech coding schemes. Apart from that, easy integration of speech and data applications can been established and ubiquitous access from different networks with guaranteed performance can been achieved.
The block diagram of DSR system is given in the following figure.
The front-end mechanism was proposed by Nokia and uses a modern approach used by most modern recognition systems. 13 coefficients per frame are produced and the logE is added as an extra coefficient (front-end is depicted in Figure 2).
The VQ technique was proposed from Motorola and uses a Split VQ scheme. A 64 size codebook is used for each pair of coefficients from C1 to C12, yielding 6 x 6bits, a total of 36 bits. Also a 256 size codebook is used for C0, LogE coefficients (8 bits) yielding a total of 44 bits per frame. Adding 4 bits of CRC for error protection, a data rate of 4.8 Kbps is produced (48bits/frame x 100 frames/sec). To make sure that VQ does not produce a degradation in recognition performance compared with using the floating point MFCC parameters directly, some experiments took place using both Resource Management (RM) and ATIS databases.
Some extra experiments for evaluation of error robustness were also held using TIdigits task. The process of creating a test database that has been subject to channel error is shown in Figure 3. Test set of digits at 20dB SNR with models trained on multicondition 8Khz data were used. TETRA and GSM channels with varying quality (EP1 to EP3) were used and the channel error mask was applied before decoding. The results are shown in Table 3. Although EP3 represents an extreme, only a 5% drop in performance was noticed in contrast with EFR GSM which gives a performance of 78.1% as shown in Figure 4.
The latency introduced over the wireless channel is 10 ms for encoder, 9.6 ms for transmission over GSM (at 9.6 kbps), and 0 or 20 ms for decoder (error free or error mitigation) yielding a total of 30 ms at most. As a conclusion the STQ Aurora group has completed the preparation of a standard for DSR proposed at Feb 2000. A second future standard that will give half the error rate in noise is expected sometime in 2002. The working group will also cooperate with other organizations (WAP, W3C) in order to define appropriate combinations of protocols in the chain from the client terminal to recognition server utilizing DSR standard.
Since 1992 when GSM was first deployed in Europe, an enormous growth of the GSM mobile networks has been achieved, creating a mass market with more than 150 million subscribers, about half the population. In parallel the evolution and growth of Internet has raised the demand for the integration of mobile and data communications in order to create new mobile data services. This industry shift to personal communication services opens up new exciting business opportunities for operators & content providers. Considering the business theory which suggests that successful apps & products are customer-driven the adoption of the first mobile protocol to offer the above integration can be considered as major move towards the so-called "mobile internet". What GPRS really offers is the integration with Internet/Intranet by data-enabling current GSM networks. In this section some key features of GPRS networks will be given, along with more technical info on network characteristics, architecture and protocol stack (see ).
Before talking about GPRS a short review of GSM is given (see ). GSM was the first digital mobile network to be deployed & incorporates key ideas like cellular coverage with adaptive cell size(see Table 1). For multiple access a combination of FDMA and TDMA is used. Specifically a total of 25 MHz bandwidth is available, which is divided into 124 carriers of 200 kHz each (FDMA) and by turn carries 8 time-slots (TDMA). The speech coding used is the RPE-LTP producing a data rate of 13Kbps (260 bits per each frame of 20 msecs). The number of bits per frame finally put onto the radio link is 456, since error protection is of major consideration. The 260 bits are divided into 3 classes according to the noise robustness. Class Ia contains the 59 first bits, Class Ib the following 132 bits and Class II the last 78 bits. 3 bits of CRC code is added to Class Ia and the addition of Class Ib and 4 bit tail is followed by a half rate convolutional code which plus Class II yields the total of 456 bits. Interleaving is used and the 456 bits are divided into 8 blocks of 57 bits plus 26 bits used for equalization as shown in Figure 2. Finally the signal is modulated by using Gaussian-filtered Minimum Shift Keying (GMSK) method.
In order for GPRS to become a mass product it must offer features that add user value and friendliness. In this paragraph some of those key features are briefly discussed. One of the most important features GPRS offers is the instant & constant connectivity. What instant connectivity means is that the cumbersome setup procedure of circuit switched data calls will no longer needed. Constant connectivity means that the mobile terminal is always ``online'' (attached to the network) no matter if data are transmitted or not. In contrast with circuit switched calls, GPRS doesn't require the preallocation & setup of a virtual circuit - this is done rather dynamically yielding network efficiency and utilization for the operator & a new billing scheme for subscriber, since he will only pay for data traffic he generates. This is perhaps the most beneficial feature for the end user among some other features like increased data rates (that theoretically can reach up to 171.2 Kbps - see next paragraph). GPRS features will be beneficial for many existing apps found in todays circuit switched environment (SMS, Wap, browsing, email, news-alerts) but new specially designed services & apps are also expected to become available that will take advantage of the GPRS environment.
The maximum data rate of 171.2 Kbps is a theoretical limit that is far away from the actually expected data rate. Because much of the press coverage of the upcoming GPRS is presented inaccurately mostly because of the marketing people this paragraph aims at clarifying this topic. The truth is that the expected data rate will be modest in size & vary significantly. Actually 3 restrictions are responsible for this : allocation of timeslots, restrictions to terminals & availability of Coding Schemes (CS). As noted above each carrier ``carries'' 8 timeslots, but not all of them will be available for GPRS traffic, since the circuit switched traffic will still be the dominant traffic for still many years. So, perhaps 1 timeslot will be statically assigned for GPRS traffic, while the rest 7 timeslots will be dynamically assigned between GPRS & circuit switched traffic as depicted in Figure 3. Next it is extremely difficult for terminals to support 8 timeslots for uplink/downlink since this would impose great complexity & would require great processing & transceiver power for such a small device. So manufacturers will initially support (1,3) and up to (2,4) timeslots for uplink/downlink respectively. Third but not last, the availability of Coding Schemes is also crucial factor. GPRS will support 4 different Coding Schemes (CS1-CS4) with different data rates ranging from 9.05-21.4 Kbps per timeslots(see Table 2). 171.2 Kbps can be realized only when CS4 is used for all 8 timeslots, but this is not expected to really happen since supporting CS3 or CS4 requires changes to Abis link (see Figure 4) and is very risky from an economical perspective. Given the above restrictions a maximum of 53.6 Kbps (4 timeslots, CS2) can be achieved for downlink on radio link. In practice the data rate seen at application level will be less, depending on QoS, available timeslots & noise, yielding a typical of about 30 Kbps, far less than the advertised 171.2 Kbps!
To obtain access to GPRS, a mobile phone or terminal that supports GPRS is needed & the subscription and activation of an account to a GPRS enabled mobile network. Most network operators & mobile manufacturers are supposed to support GPRS by 1st quarter of 2001. By early 2002 GPRS will be incorporated as a standard into new GSM phones and lately that year the second standardization phase of GPRS will become available, addressing even more issues & offering even better services (for GPRS roadmap see Table 4). In the early phase of the GPRS adoption the most common use will be the connection of a laptop computer to Internet via a GPRS enabled mobile phone but as apps will become mature enough and mobile phones more advanced, direct use of GPRS from these phones is estimated to form the main marketplace mass.
GPRS may be the most successful standard on the way to 3rd generation mobile networks (UMTS) since it offers so many new features while still being compatible with existing GSM networks. To upgrade an existing GSM network to GPRS, 2 new nodes need to be added, namely Gateway and Serving GPRS Service nodes (GGSN & SGSN). Hardware upgrades to Base Station Controller (BSC or just BS) are needed and software updates for most of the rest GSM nodes. A GPRS enabled GSM network is depicted in Figure 4. The main difference to note is that now BS has two connections : one with Mobile Switching Center (MSC) for circuit switched traffic and one with SGSN node for packet switched data (GPRS traffic). The GPRS backbone is connected to other GPRS operators backbones or directly to an external data network (Internet/Intranet) via GGSN.
Going into more details it should be noted how the upgrade will effect the network components. Beginning from mobile stations (terminals) three classes will be available. Class A terminals will support simultaneous circuit & packet switched traffic, Class B will support simultaneous attach but not simultaneous traffic & Class C will support only one kind of attach each time. In the GPRS architecture BS acts as a concentrator , bringing together connections from Base Transceiver Stations (25-150). BS is responsible for setting up, supervising & disconnecting CS & PS connections to MSC/SGSN respectively as noted above. To support packet switched data, a hardware upgrade (Packet Control Unit (PCU) is required which actually adds Radio Link Control (RLC) & Medium Access Control (MAC) layers to the radio interface. GRPS Support Nodes (GSN) functionality is given below :
Serving GSN functionality includes :
To 've a better idea of how the implementation of the GPRS protocol stack is implemented Figure 6 highlights the transmission of data traffic between the ``main'' GRPS nodes (namely MS , BSS, SGSN & GGSN) and a host. The Application Layer may include other protocols like HTTP, SNMP, IMAP built over TCP/IP. TCP/IP is needed for interoperability with Internet/Intranet and poses some interesting challenges regarding its successful implementation over wireless networks, which are shortly addressed :
First of all lets remind that TCP was designed with wireline communications in mind & offers reliable flow of data on the end-to-end connection, retransmission of unacknowledged packets (buffers at both sender & receiver) and an adaptive timeout mechanism called Automatic Repeat Request (ARQ).
Implementing TCP over GPRS wireless environment is radically different since high and varying delay due to available bandwidth & high noise levels will exist. This will cause retransmissions on Radio Link Control due to varying radio conditions. While TCP thinks packets were lost and not just delayed it goes to the well known "slow start state". To avoid this, GPRS offers fast RLC ARQ, causing many RLC retransmissions before TCP times out.
As far as IP is regarded, the IP user addresses will be allocated via GGSN, from an ISP or LAN using RADIUS/DHCP techniques. Both dynamic/static, both private/public addresses from either IPv4/IPv6 will be available.
SubNetwork Dependent Convergence Protocol (SNDCP) maps network level characteristics onto the underlying radio layers and deals with segmentation, header & data compression. LLC (Logical Link Layer) offers decoupling of underlying radio interface while RLC maps LLC data packets to RLC data blocks and supports ARQ protocol. Finally MAC handles data traffic & control signaling on physical radio interface. Figure 7 shows RLC-block & how it is transformed to a radio block according to CS used.
Speech Recognition technology may be the dominant application for voice networks (wireline/wireless) but what are the chances of successful deployment of DSR applications for wireline/wireless data networks? Speech Recognition will 've to compete with other technologies which try to adapt the Internet information space to WID platform (WAP, W3C's XHTML). DSR is expected to be a complementary service where the other services fail to be appealing enough(small screen size for XHTML). In this section the WID platform is presented and some of the competing technologies are reviewed such as XHTML, WAP, ETSI STQ Aurora applications group and W3C's Voice Interfaces.
The internet evolution has made easy the access to the enormous information space of the WWW. The demand for information retrieval is expected to be also high for devices like PDAs & smartphones (WIDs). But the Internet technology has been designed for desktop computers supporting medium to high bandwidth connectivity over generally reliable data networks. Wireless devices present a more constrained computing environment compared to desktop computers (less powerful CPUs, less memory (ROM and RAM), restricted power consumption and input/output scalability). Moreover wireless data networks also present a more constrained communication environment compared to wired networks (less bandwidth than traditional networks, more latency than traditional networks, less connection stability than other network technologies, and less predictable availability). So what are the currently available approaches to present WWW information to WIDs ?
The W3C approach is motivated by the huge device diversity versus the content to deliver. The first step to this approach is to categorize the different devices used for information retrieval along with the various input/output methods used for user interface. Such a possible categorization is shown in the following table.
|Output Methods||Devices||Input Methods|
|1600 x 1200 pixels||Desktop Computers||Windows+icons+mouse|
|1024 x 768 pixels||Set top box + TVs||Full keyboards|
|800 x 600 pixels||Public Kiosks||On screen keyboard|
|640 x 480 pixels||Watches||Touch Tablet|
|Web TV||Web pads||Pens|
|Public kiosk||Handhelds/PDAs||Speech only devices|
|Palm Pilot size||Mobile Phones||Speech augmented|
|Cell phone size||Standard phones||Phone keypads|
|Very large screens||Pagers||Wheels, knobs|
|Audio only||Electronic books||Buttons|
|Audio augmentation||Walkmans||Hand gestures|
|3D screens, VR||Automobiles||Facial expressions|
|Color, Black/white||TTYs (no graphics)||Brainwave detection|
A simplified model for study is depicted in the next figure dividing devices to linear , semi-linear and non-linear. Voice browsing for example is perhaps the only linear method since information is presented as a ``one piece at a time'' fashion to the user.
Another simple model is to divide devices according to their presentation capabilities. One such categorization is shown in the next figure
Since it is not possible to use a ``unified'' device to represent all devices the key idea is to accept device diversity but disallow such diversity to be applied directly to content source. The idea is that to support service convergence one should keep the content source same independently of the device to be presented and insert a transformation layer capable of making the right content transformation for a specific family of devices or just a specific device. So the key idea is to completely separate the content from its presentation. To add support for a new device, just a new presentation method is needed.
This way it would be possible to avoid creating directly different content for different devices as done today. Today one has to create separate sites :
The content source is written in XML(Extensible Markup Language) and transformed into [X]HTML by XSLT(Extensible Specification for Language Transformation) or DOM(Document Object Model), which in turn can be rendered using different CSS(Cascading Style Sheets) to different device families. XHTML will be the convergence of HTML and WML(Wireless Markup Language - used by WAP) and will be renderable everywhere (W3C & WAP cooperation is addressed later in this section). Several groups inside W3C have been formed in order to address content reuse issues. Composite Capability/Preferences Profiles (CC/PP) is such a group which deals with the specifications of profiles the content transformation is based on. Other related groups such the ``Mobile Interest Group'' deals with issues specific to information access for mobile devices and is in close cooperation with WAP forum.
The WAP forum was formed by telecommunication leaders (Ericsson ,Nokia, Motorola, etc) in order to bring internet content to mobile devices in the best possible way. The objectives of WAP Forum are :
This effort created an initial hype (as with many promising technologies) but soon after the initial adoption of the standard the disappointment had become evident to every single person talking to media about WAP. It seems that the truth relies between hype and disappointment. First of all WAP has the advantage of being bearer independent. Today it is experienced through GSM CSD which not only poses a cumbersome procedure to the user but also is expensive enough since the user has to pay for the duration of the connection, independently of the amount of information retrieved. If GPRS proves to be a successful standard the WAP may be considered again as a successful application. A possible mistake of the approach taken by WAP may be the rush to bring Internet to mobile devices which led to the unavoidable marriage of content & presentation, a radically different approach compared to W3C's content reuse.
In order to eliminate the gap between W3C & WAP approach a cooperation has been established. The cooperation goals include the creation of a unified information space based on common standards and technologies & the design and delivery of sophisticated information and services to mobile devices. W3C's Mobile Access Interest Group is responsible for presentation of the information (e.g. through CSS), management of information (e.g. through RDF) & technologies that structure and distribute data as objects (e.g XML and HTTP-NG). WAP related work will include issues dealing with bandwidth efficiency, smart web proxies, efficient protocols and content encoding, latency constraints and content scalability.
W3C has recently realized the importance of voice interfaces and has formed the Voice Browser Working Group. The interest for voice interfaces comes from the following facts :
Another figure showing the incorporation of IP interface to Speech Interface Framework is shown right after.
The IP interface can be used by standards like ETSI's DSR. In fact, ESTI has formed the ``DSR Applications & protocols'' group in order to implement complete end-to-end DSR services using the front-end standard (protocol elements, system architecture, API, etc.) The group intents to cooperate with W3C and WAP Forum for harmonising protocol elements and multimodal markup language for DSR applications.
In this section, issues affecting the development of related software for WIDs are brought into focus. First of all one should make clear that application development for WIDs is entirely different compared to desktop PCs. WIDs impose a constrained environment for application development mainly because of their hardware characteristics (slow CPU, limited amount of RAM & ROM, slow network connections). Although there are already many applications for PDA-like devices (for which well known APIs are offered), what about ``less-like PCs'' devices like smart-phones or even ``simple'' mobile phones? What are the choices for application development on such devices ? Two different development platforms, namely the Symbian & Java platforms which appear to be two promising application environments for WIDs are presented in this section.
As previously noted WIDs impose a constrained environment for application development. CPU speeds are low although lately the adoption of new generations of CPUs for embedded market has been appeared. For example for the PDA market there is currently a wide range of CPUs available, ranging from the 16 MHz Motorola DragonBall processor to the latest 206 MHz Intel Strong Arm processor. Available memory is restricted to up to 16MB ROM and 32 MB RAM. But the User Interface is what is radically different. There is no mouse/keyboard and the preferred input modality is pen/buttons. Screen resolution is limited to a 320x240 for advanced PDAs. And that is just for PDAs which are considered a ``high-end'' solution. PDAs are powered by Operating Systems like Windows CE versions 2.* & 3.0(mainly high-end PDAs), Symbian's EPOC, Palm OS and Linux which is also a promising solution since it is widely adopted by many programmers & also open source. Symbian's EPOC is an OS specially designed from the ground up for devices ranging from high end PDAs to smart-phones. That is why we will 've a closer look at it later in this section (see ).
For no PDA devices, specifications of hardware & operating system are hardly known to public (for example nobody knows what OS runs on Ericsson's T10 mobile phone). That is why apps for those devices are still extremely limited to third party developers & are still manufacturer-specific. Here comes the Java platform acting as an OS wrapper opening application development for such devices. In fact many mobile phone manufacturers claim they will support J2ME (Java 2 Micro Edition - see ) for their next generation phones & estimations has shown that very soon at least 60% of development will be done using the Java platform.
The Symbian Platform is supported by Ericsson, Nokia, Motorola, Panasonic and Psion. As noted above it is an OS specially designed for WIDs from the ground up. Some key features of the Symbian platform, version 6.0, are :
The huge diversity of computing platforms ranging from Servers to mobiles phones is a fact. Java's promise to bring a ``common'' application environment which will be platform independent is ``accomplished'' by providing 3 different Java editions : Enterprise, Standard & Micro editions. Java 2 Micro Edition is defined for a wide range of devices (mobile phones - PDAs). But even in this range of devices big differences can exist between different ``families'' of such devices. To address this problem the J2ME defines ``configurations'' & ``profiles''. A configuration defines the minimum Java technology libraries and VM capabilities for a family of devices whereas a profile is a collection of APIs that supplement a configuration to provide capabilities for a specific vertical market or device type (Fig.1).
Currently 2 configurations are provided : Connected Limited Device Configuration (CLDC) & Connected Device Configuration (CDC). CLDC is targeted at devices with :
The only currently available profile for CLDC is the Mobile Information Device Profile (MIDP) which targets at mobile devices implementing J2ME CLDC Profile & addresses :
Java apps for WIDs are expected to be simple apps requiring a minimal amount of resources. For more resource demanding Java apps the use of jit, static compiler or even native java processor (www.zucotto.com) will be needed. For even more advanced applications a custom hardware solution may be needed. Parthus (www.parthus.com) offers such solutions.
Developers can use a standard PC for testing the application. Emulators for both Symbian & Java platforms are already available. For GPRS testing Ericsson's test center () claims to support testing & troubleshooting of GPRS apps.
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)
Copyright © 1993, 1994, 1995, 1996,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html test.tex -split 0 -no_address