Show Less
Open access

Individual Differences in Speech Production and Perception

Series:

Edited By Susanne Fuchs, Daniel Pape, Caterina Petrone and Pascal Perrier

Inter-individual variation in speech is a topic of increasing interest both in human sciences and speech technology. It can yield important insights into biological, cognitive, communicative, and social aspects of language. Written by specialists in psycholinguistics, phonetics, speech development, speech perception and speech technology, this volume presents experimental and modeling studies that provide the reader with a deep understanding of interspeaker variability and its role in speech processing, speech development, and interspeaker interactions. It discusses how theoretical models take into account individual behavior, explains why interspeaker variability enriches speech communication, and summarizes the limitations of the use of speaker information in forensics.
Show Summary Details
Open access

Forensic Speaker Recognition: Mirages and Reality (Jean-François Bonastre / Juliette Kahn / Solange Rossato / Moez Ajili)

Jean-François Bonastre1, Juliette Kahn2, Solange Rossato3 and Moez Ajili1

1Laboratoire d’Informatique d’Avignon, Avignon

2Laboratoire National de Métrologie et d’Essais, Paris

3Laboratoire d’Informatique de Grenoble, Grenoble

Forensic Speaker Recognition: Mirages and Reality

Abstract: Forensic speaker recognition is a topic similar to a tropical climate, where a big storm could form any day. It is a particularly controversial topic for three main reasons: the nature of the material it relies on, the maturity of scientific knowledge in this field, and its history. Forensic speaker recognition is under the spotlight because there is a huge and increasing demand for expertise in courts. In this chapter, the importance of the Bayesian decision framework is highlighted, which is now the standard paradigm for forensic speaker comparison, and reopens the question of science in court. The impressive progresses achieved in the field of Automatic Speaker Recognition (ASR) during the last decade are acknowledged. This raises the question of the use of ASR in forensic voice comparison. In this context we point out on several important weaknesses in the evaluation protocols, insisting on the fact that the whole communication process has to be taken into account, including speaker specificities not only from a speech production perspective but also from the perspective of the interactions with the interlocutors. The final objective is to reaffirm strongly in the scientific area the “need for caution message” concerning forensic speaker recognition applications in courts.

1.   Introduction

Forensic speaker recognition is a hot topic primarily because of the forensic aspect. In the forensic field, mistakes have a direct impact on humans, on their lives. Forensics is also an area that science and law have to share. It is not easy, but also not impossible (Roberts, 2013). These two characteristics tend to make forensics similar to a tropical climate, where a big storm could form any day. Forensic speaker recognition is under the spotlight because there is a huge and increasing demand for expertise in courts. This is mainly due to the development of modern communication services: it is ← 254 | 255 → becoming increasingly rare to see a law case, in which there is no mention of the use of a smartphone or some other modern communication tool. In addition to sharing the ‘hot’ nature common to all forensic media, speaker recognition is also a particularly controversial topic for three main reasons: the nature of the material it relies on, the maturity of scientific knowledge in this field, and its history.

1.1.    Should speech analysis be regarded as physical biometric?

Firstly, speech is not exclusively a physical biometric. Language is alive: if part of language is rooted in our genetic makeup, most of it is learned and varies consistently over time. The idea that we can recognize individuals by their voices is widespread among people. The main reason is that speech is a human activity, directly attributed to a human speaker. When hearing speech, one imagines a speaker, and assigns sex, age, geographical and social origin, and even some personality features. Moreover, speech is constrained by diastratic, diatopic, and diaphasic variations. Sociolinguistics studies how languages differ among social groups (because of e.g. age, sex, level of education, status, and ethnicity; see Labov, 1972), while geolinguistics is concerned with the spatial distribution of linguistic phenomena. Pragmatics studies how speech production depends not only on phonology, lexicon and syntax but also on the inferred intent of the speaker and the context of the utterance (Austin, 1970). Speech also conveys the emotional and psychological states of the speaker (Scherer, 1986). All these factors may influence the realization of a speech utterance. Voice “biometrics” aims to identify idiosyncratic features in the speech signal produced at a given time and with a given communicative intention. This task is difficult because a speech signal is not a direct reading of body traces (like fingerprints or DNA), and includes a large variability caused by factors such as speech acts, languages or speaker’s roles, without even taking into account the possibility of intentional changes in voice (disguise) or speaker-independent conditions like noise. It clearly appears that voice authentication is largely based on behavioral variables: it looks at the way one is speaking, not the physical properties of her/his body! If the notion of behavioral biometrics is accepted, then speech could be considered as being related to it. ← 255 | 256 →

1.2.    Lack on commonly accepted approaches or techniques

In addition to the difficulty related to speech not being a true (well-defined) biometric feature, there is a clear lack of scientifically accepted knowledge, approaches or techniques in the field of forensic speaker recognition. This is a consequence of the nature of the material studied — human language mediated through speech. As explained in the previous paragraph, the large number of open variables creates a real and complex difficulty in experimental assessment. Gathering experimental confirmation from a sample database with specific conditions will not allow the scientist to generalize the results to other conditions. Most often, the researchers will have to propose new hypotheses and do the experiments again with other conditions. The lack of commonly accepted methods in forensic speaker recognition is linked to this variability, as well as to the involvement of multiple scientific areas such as acoustic phonetics, signal processing, phonology and other linguistic disciplines.

1.3.    Historical charlatanry and controversy in forensic speaker recognition

Finally, forensic speaker recognition is also a hot and controversial topic due to its history, with the existence of some charlatanry in the field since the sixties. In 1962, Kersta introduced the misleading term “Voiceprint identification”, referring to the speech spectrogram representation. However, this is only a visual representation of the speech, based on the acoustic properties of speech which result from articulatory movements controlled by the speaker. It does not trace the speaker himself (Bolt et al., 1970). This could be seen as a classical scientific controversy of the past but this misconception still holds: several associations of forensic speaker recognition experts still remind us that speech spectrograms should not be used in their “best practices” or resolutions. For example, in 2007, the IAFPA1voted a resolution2 considering that the spectrogram comparison approach (with a methodological reference to Tosi, 1979) is “without scientific foun ← 256 | 257 → dation and it should not be used in forensic casework”. This resolution was proposed and voted 37 years after Bolt’s paper (Bolt, 1970), which clearly indicates that this misleading visual representation of speech was still used by some “experts” in 2007, despite any scientific evidence. Boë (2000) described the “Voiceprint” history in detail, as well as several other examples of science misused in forensic speaker authentication, like the Micro-Surface “REVAO” tool in France during the 1984 “Gregory” case3or the “Prieto” case4. In these cases, the methods used by the “experts” were questioned by the court and finally rejected. Unfortunately, we are not only talking about history. For instance, Morrison (2014) discussed about “distinguishing between forensic science and forensic pseudoscience”. The need for reaffirming the unscientific aspects of spectrogram reading in 2007 is reinforced by recent charlatanism in different aspects of forensic speech science, as highlighted in recent articles (Eriksson and Lacerda, 2007; Boë and Bonastre, 2012).

To conclude this introduction, while the first important novelty in voice comparison area comes from the general acceptance of the Bayesian paradigm, which reopens the question of science in court, the real innovation is the strong emergence of Automatic Speaker Recognition (ASR) processes. Over the last few decades, automatic systems have improved from error rates of around 20% to error rates of less than 1%, even though the difficulty of the task has increased significantly. This raises the question of the use of ASR in forensic voice comparison. In the next sections, we will focus on these two main aspects, with a short side note on voice convergence phenomena.

2.   Bayesian decision framework: Evolution or revolution?

“Would jurists accept that the concept of reasonable doubt on the identification of a suspect escapes their province and that the threshold is imposed onto the court by the scientist?” This question asked by Christophe ← 257 | 258 → Champod and Didier Meuwly (Champod and Meuwly, 20005,6) marks an important change in the understanding of “forensics” by the speaker recognition community. Champod and Meuwly’s work followed several similar studies in forensics, like the one of Balding and Donnelly (1994) for DNA, but it was the first for forensic voice comparison. While automatic speaker recognition researchers were working on how to decrease the probability of a false identification in a forensic report, they were debating hotly in order to know if this probability is well-known and low enough to authorize forensic applications7. Champod and Meuwly showed to all experts that they are not in charge of making decisions. They have to provide the court with an evaluation which illustrates the convincing force of the results, not to take part in the judicial debate. This is definitively not possible if science says: “the suspect is guilty” through scientific expertise. Unfortunately, in many trials, saying that a suspect is the one speaking in a given trace/recording is actually equivalent to stating that he or she is guilty.

In other words, with the Bayesian paradigm, the speech scientist does not “identify” people, but provides the jury with the specialist’s scientific information in a procedure which is conceptually identical to the one used nowadays in the presentation of DNA evidence.

2.1.    Implementation of the Bayesian decision framework in forensic trials

Scientifically speaking, the need for the expert to stay out of the province of jurists is implemented using the Bayesian decision framework. Based on a piece of evidence E (a vocal message X), the experts have to present their conclusion using a Likelihood Ratio (LR), which expresses how likely the evidence is under the prosecutor’s hypothesis (the suspect pronounced message X), versus the defender’s hypothesis (the suspect did not pronounce ← 258 | 259 → message X). The LR is presented in equation 1, where H1 is the prosecutor’s hypothesis and H2 is the defender’s hypothesis (This formula differs slightly from Champod (2000) and will be explained later):

image

The numerator is the probability of the evidence given H1 and the denominator is the probability of the evidence given H2. While the numerator can be estimated by the expert by considering the evidence and the suspect, the denominator is the random match probability, “which can be derived from an objective or subjective estimation of the relative frequency of the concordant features in the relevant population” (Champod and Meuwly, 2000).

The use of the LR framework for the forensic expert’s report is very attractive. As expected, it places the expert on neutral ground by withdrawing the need for her/him to conclude the report with a “decision”. Furthermore, it also helps the expert to follow a scientific approach, since work is based only on evidence E.

However, the LR alone is not sufficient for the court, which must also consider the posterior odds of the two hypotheses H1 and H2 as expressed in equation 2:

image

In equation 2, the LR issued by the expert can be recognized. This LR is multiplied by the ratio of the prior probabilities of H1 and H2, respectively, Pr(H1|I) and Pr(H2|I). These prior probabilities are based on all the elements of the case, denoted I here. They may change during the law case or the trial, for example due to new elements added in I. Paraphrasing Champod and Meuwly, the prior probabilities are clearly in the province of the jurist and the court, and not in the province of the expert.

Although this Bayesian formalism was new for most caseworkers engaged in forensic speech comparison in 1998/2000, it is now widely accepted and considered by many experts as the logically correct framework (Rose, 2006; Gonzalez-Rodriguez et al., 2007; Jessen, 2008). It is interesting to notice that the references provided come mainly from the articulatory-phonetic ← 259 | 260 → voice comparison community. It clearly shows the wide acceptance of the (Bayesian) likelihood ratio for forensic voice comparison, even if this question is still debated, mainly when discussing what should be presented in courts (French and Harrison, 2007; Rose and Morrison, 2009; French et al., 2010). Gold and Hughes (2014) presented recently an interesting survey on the use of “numerical likelihood ratio framework to forensic speaker comparison”, which emphasizes both the advantages of the LR approach and its practical difficulties.

2.2.    Bayesian decision framework limitations for forensic trials

As shown previously, the Bayesian formalism becomes a cornerstone of forensic expertise and is reported in several areas, including speech. It provides a very elegant theoretical framework and places the expert (back) in her/his proper domain which is science and not judgment. However, implementing a theoretical framework to handle real-world cases requires some “adaptations” and causes three main problems:

a. Estimation of Pr(E|H2)

Pr(E |H 2) plays a very important role in LR, at least equivalent to Pr(E |H 1), even if the former is clearly underrepresented in the voice comparison and speaker recognition literature. Estimating the probability is not an easy task. For example, with a machine learning approach, it is possible to learn a class model for H1 using several samples of the suspect’s voice while it is not trivial to train such a model for H2 as it is more difficult to find samples of a “non-voice”. H2 implies that the speech recording under scrutiny was pronounced by someone else other than the suspect. Hence the corresponding class represents all voices except the suspect’s.

In Champod and Meuwly (2000), Pr(E |H 2) is the “random match probability” and “can be derived from an objective or subjective estimation of the relative frequency of the concordant features in the relevant population”. It is interesting to read that for H2, the notion of “subjective evaluation” is introduced in the scientific process of the forensic expert.

Furthermore, it is important to notice that three elements have to be evaluated in order to estimate Pr(E |H 2): the concordant features, their relative frequency and the relevant population. This means that a forensic approach claiming to comply with the Bayesian formalism, which is very often ← 260 | 261 → described as the only scientific formalism accepted for forensic evidence, should define these three elements explicitly. And the latter does not depend on the forensic expert, or at least not completely, since it is “dictated by the hypothesis proposed by the defense” (Champod and Meuwly, 2000). It means that the forensic expert referral should include a clear description of the expected relevant population. We should also remember that this hypothesis is not definitive and may evolve during the trial.

b. Background information

Frequently, the forensic expert has access to several pieces of background information concerning the current case, other than the piece of evidence E and the hypotheses to be evaluated. Therefore, the LR equation very often includes I’, a subset of I, in addition to the evidence E in the expert knowledge:

image

Forensic experts often have an unrestricted access to the background information. Consequently, the LR is often formulated using I’ = I (Champod and Meuwly, 2000).

We saw earlier that the LR denominator is an estimation of the random probability in the “relevant population”. This is understandable if the expert wishes to use as much information as possible in order to determine the H2 probability. Unfortunately, this is in obvious contradiction with the scientific position, which is to be as little subjective as possible. It may be useful here to remember the well-known double blind principle and why it is so important in medical research assessment.

So, the question is: Could a completely scientific and objective assessment be achieved if the expert has additional information beyond the evidence itself, for example about the suspect’s origins, preferences and criminal record? If the answer is no and if we want to keep expert’s reports as scientific as possible, it is important to clearly define which information can be provided to the experts. More generally speaking, this problem is known as the “forensic confirmation bias”. The clearest example of this bias is given by the high-profile mistaken fingerprint identification of Brandon Mayfield in the Madrid Bomber case (Kassin et al., 2013). From a juristic point of view, ← 261 | 262 → it also seems important to make sure that the details provided to the experts are accessible, case by case, to the various parties, e.g. the defender’s.

c. Understandability of LR by the court

Champod and Meuwly (2000) claim that LR is useful “for assisting scientists to assess the value of scientific evidence” and to “clarify the respective roles of scientists and of members of the court”. These two claims have been discussed previously and are quite easy to accept. But Champod and Meuwly also claim that LR is useful to “help jurists to interpret scientific evidence”. Of course, a forensic analysis has an interest only if judges, lawyers, and jurors are able to understand the work done by the expert precisely, as well as the intrinsic nature of the scientific evidence presented.

However, understanding probabilities in general and LR more specifically is not straightforward. Daniel Kahneman, the 2002 economics Nobel Prize (co)laureate, a specialist of judgement and decision-making and one of the two proposers of the prospect theory (Tversky and Kahneman, 1974), states in his 2011 book “Thinking, Fast and Slow” that Bayesian reasoning is not natural for humans. This is not only true for normal people but also for statistics specialists. In Thompson et al. (2013), the perception of LR by jurors is analyzed. It appears that it is not easy for them to correctly understand statistical evidence. As highlighted by the authors, this is particularly true when forensic experts, prosecutors or lawyers provide arguments that invite or encourage fallacious conclusions from statistical evidence, which is not uncommon in courts.

Moreover, as Bayesian theory as well as statistics and probabilities in general are now a mandatory part of forensic evidence presentation and understanding, it would be interesting to include serious courses in these areas in law studies curricula, which is not often the case at present.

3.   Automatic approaches: a new avenue for forensic speaker recognition?

The use of automatic approaches for forensic speaker recognition clearly offers important advantages, in terms of objectivity and repeatability of the voice comparison measures but also in terms of human time costs. The limited cost of automatic processes also can allow the expert to test several ← 262 | 263 → voices against the piece of evidence, which is a clear progress towards double blind, objective procedures. This interest in the use of automatic systems for forensic applications has been present in the literature for a long time (Nakasone and Beck, 2001; Alexander et al., 2005; Drygajlo, 2007; Becker et al., 2010; Mandasari et al., 2011).

For decades, from the early ages of speaker recognition (Pruzansky, 1963) until the end of the past millennium, the performance of automatic speaker recognition systems were so poor that using it for real forensic cases was not feasible yet. The situation began to change with new statistics-based approaches and the large scale speaker recognition evaluation campaigns organized by the NIST since 1996 (Przybocki and Martin, 2004). In order to take this evolution into account, several scientific institutions (see Bonastre et al., 2003) sent a clear need-for-caution message concerning the use of automatic speaker recognition technologies and for forensic speaker authentication in general to the forensic field, including statements such as, “currently, it is not possible to completely determine whether the similarity between two recordings is due to the speaker or to other factors”, “caution and judgment must be exercised when applying speaker recognition techniques, whether human or automatic” or “at the present time, there is no scientific process that enables one to uniquely characterize a person’s voice or to identify with absolute certainty an individual from his or her voice.”.

Campbell et al. (2009) started from this “need for caution” message and revisited it in light of the impressive improvement in terms of (measured) performance made during the last decade in the field of automatic speaker recognition (see Przybocki et al., 2006, 2007; Fauve et al., 2007). They observed that the performance measured in terms of Equal Error Rates (EERs) dropped from around 9% for the year 2000 system (Reynolds et al., 2000; Bimbot et al., 2004) to 4.5% for the 2006/2007 system (Kenny et al., 2007). The EER even goes down as far as about 1% when longer training excerpts or unsupervised speaker adaptation are used (Barras et al., 2004; McLaren et al., 2008, 2011). Since 2009, the progress in terms of error rate decrease is still noticeable, mainly thanks to the “iVector” approach (Kenny et al., 2007; Bousquet et al., 2014). Nowadays, EERs lower than 1% are obtained on quite large scale evaluation sets, with millions of voice comparisons. Figure 1 proposes a schematic view of the evolution of EER ← 263 | 264 → over the last 2.5 decades. It is indeed a schematic view, since experimental protocols evolved over the years and are not directly comparable.

image

Figure 1:   Schematic view of speaker detection error rates.

Recently, several studies investigated the use of Deep Neural Networks for automatic speaker recognition (Stafylakis et al., 2012; Lei et al., 2014; Kenny et al., 2014; Vasilakakis et al., 2013). The presented results show clearly that this approach is able – or will be soon – to bring an additional and significant decrease of the error rates.

While reporting these impressive progresses and error rates, it is interesting to question the results of these studies. In Campbell et al. (2009) the authors showed that an error rate is often not enough to understand the behavior of a system. In the following paragraphs, we propose a fresh look at the performance related numbers, their meaning and their limits.

3.1.    Instability, imprecision and inadequacy of the performance measures

An unquestionable advantage of automatic approaches for forensic applications is to offer the ability to assess the techniques on a large number of voice comparison trials. For example, in NIST SRE evaluations, hundreds of thousands of tests are done. The impressive error rates reported earlier in this chapter are obtained with this kind of experimental protocols. The robustness of such an evaluation protocol relies on respecting some straightforward rules (Phillips et al., 2000; Petrovska et al., 2009) and on “brute force”, i.e. the size of the evaluation set. Particularly in the NIST ← 264 | 265 → SRE evaluation, when a system is working on a voice comparison between recordings X and Y, only the use of these two recordings is allowed in the evaluation set (i.e. knowledge of Z is not allowed, if Z is another recording of the evaluation set).

Soong et al. (1987) is one of the first speaker recognition studies showing a strong evaluation protocol: 50 male and 50 female speakers were recorded, each of the speakers pronounced 200 digits in 5 recording sessions, which corresponds to the maximum available computing power at that time. For the main testing condition, NIST SRE 2010 (NIST, 2010) involved 6 000 speaker models, 25 000 test segments and up to 750 000 voice comparison tests (for one testing condition). Looking at the magnitude factor between the two experiments reported here, it is easy to understand why there has been a small interest in evaluation protocols in the last decades: the progress made by the computers, following Moore’s law and reflected by the size of the databases, gave a strong impression of increasing robustness, based on the “brute force” aspect alone.

During that period, performance was measured only by using global error rates averaged on the whole test set8. This way of evaluating the performance of speaker recognition systems presents two main drawbacks: the criterion itself and the global nature of the performance measure.

The classical speaker detection performance criteria – false alarm, false reject and cost functions – depend on a decision making (on a threshold) while the Bayesian decision paradigm rejects this notion of decision for a forensic voice comparison. In the Bayesian paradigm, the systems output – a likelihood ratio. Its value – is meaningful in itself, not simply because this LR is large enough (or small enough) to allow a “good” decision compared to a threshold. For example, we expect LRs with a low power (close to 1) when the piece of evidence contains little speech material, i.e. little speaker-specific information. The same effect is expected if the quality of the audio material is low. This is well described in Morrison (2011). The authors use the notions of validity/accuracy and precision, which are illustrated in Figure 2. If this approach is clearly the accepted one, we would like a solution which is able to represent both notions in one number. The ← 265 | 266 → “log-likelihood ratio cost function” introduced by Brümmer (Brümmer and du Preez, 2006; van Leeuwen and Brümmer, 2013) denotes CLLR, a value that could be seen as the best available solution to the problem. CLLR is an LR -oriented performance criterion based on assumptions about LR distributions. Although some of its underlying hypotheses are not always validated in practice, CLLR is now the official criterion for NIST Speaker Recognition Evaluations and Language Recognition Evaluations. CLLR also allows to separate calibration loss and discrimination loss. Calibration loss 9is a loss due to badly formatted LR values, a problem which could be solved with an adequate calibration process (“calibration” is often used as another word for “normalization”). Discrimination loss corresponds to the rest of the losses, which comes from the two speech recordings and from the system itself.

image

Figure 2:   Schematic view of accuracy and precision (Morrison, 2011)

The second drawback with some evaluation performance measures is the way evaluation data is used or the way the evaluation database and protocol ← 266 | 267 → were designed. Until now, a test condition has been defined only by little information, such as the durations of the two speech files composing a voice comparison, the language used and the “channel” (e.g. close or distant microphone, fixed phone or cellphone). All the available voice comparison samples corresponding to these conditions are taken together and a global performance is computed in terms of classical error rates or CLLR. The robustness of the evaluation for the given test condition once again relies on “brute force”: a large number of voice comparison samples10. The number of samples per speaker and the characteristics of the speaker are not taken into account, except the sex and the mother tongue of the speaker. It is amazing to observe that the “speaker factor” is still not taken into account in the design of evaluation plans even though its great influence is well-known. Doddington et al. (1998) showed that, for an automatic speaker recognition system, there are different “speaker profiles”. Depending on their “profile”, only a few speakers are responsible for a large part of the errors reported. The authors showed that the performance measures significantly depend on this factor.

Revisiting the perspective opened by Doddington et al. (1998), Kahn et al. (2010) demonstrated that the notion of “speaker profile” is in fact a simplified view of a more general problem: speaker recognition systems model speech files and not, or not only, the speech or the voice of a given speaker. In order to demonstrate this assumption, the authors built a new experimental setup using the NIST 2008 evaluation database. The experiment was composed of voice comparison trials, represented by a couple of speech signals (Xi,Yk). The right value, Yk, is fixed and simply one of the K speech extracts from recording set Y. The left value, Xi, is the in-interest factor. Xi is the recording of speaker Si, taken from a subset of recordings Xij, pronounced by Si. For each Si speaker, voice comparison trials (Xi, Yk) with k varying from 1 to K are carried out using each available speech signal Xij, j varying from 1 to J. For each Si speaker, the speech extract which allowed the speaker recognition system to make the least errors is labelled with a “best” label. Conversely, the speech extract showing the maximum number of errors is labelled with a “worst” label. Figure 3 plots ← 267 | 268 → the performance of the system when the recordings selected for the Xi parts of the voice comparisons are the “best” ones or the “worst” ones. The EER moves from less than 5% for the recordings with the “best” labels to more than 20% with the “worst” labels11. It is important to emphasize that the only difference between the “best” condition and the “worst” condition is the speech sample selected to represent a given speaker12. Clearly, the speaker recognition system gives a great importance to the speech extract itself. In forensic voice comparison, it means that the choice of the speech material used as comparison has an important effect on the voice comparison result itself.

image

Figure 3:   DET performance curves of a speaker recognition system using

(1) the “best” speech extracts, (3) the “worst” speech extracts and

(2) randomly selected speech extracts (Kahn et al., 2010). ← 268 | 269 →

Even if we have just highlighted the constraints on speech recognition systems, it is important to remember that international evaluations like NIST SRE and HASR (Greenberg et al., 2010, 2011; Martin et al., 2014), NFITNO (van Leeuwen et al., 2006) or AHUMADA (Ortega-Garcia et al., 2000) have allowed us to discover or evaluate several variability factors over the years.

HASR is an interesting and specific case as it merges phonetic-forensic aspects with the aspects of automatic approaches. NIST HASR initiative started in 2010. It is based on a short subset of trials extracted from the NIST-SRE evaluation set13. The trials are processed by human experts who are allowed to use automatic tools. This initiative was at the origin of numerous studies (Schwartz, 2010; Ramos et al., 2011; Audibert et al., 2010; Shen et al., 2011; Kahn et al., 2011) or, more recently, (Hautamäki et al., 2013; van Dijk et al., 2013; Univaso et al., 2013).

Campbell et al. (2009) show a striking “voice aging” effect detected by NIST after the SRE 2005 evaluation: performance decreased significantly when the two recordings of the voice comparison trial were separated only by a few weeks. Figure 4 presents the corresponding DET curves. During his Speaker Odyssey 2014 keynote talk (Campbell, 2014), Campbell presented two other factors of variability with a potentially strong impact for forensic speaker recognition: the recording device and the microphone distance. The diversity of recording devices and the mismatched conditions for different recording devices are known problems in speaker recognition. Figure 5 shows a wide performance gap depending on the used recording device. This gap widens significantly when different devices are used for the two recordings (mismatched conditions). In Figure 6, the variability factor is the distance to the microphone. The experimental results presented, extracted from NIST SRE 2008, show an EER varying from about 1% to about 3% (a threefold difference) depending on this factor. ← 269 | 270 →

image

Figure 4:   Performance difference reported by NIST in NIST SRE 2005, depending on the time elapsed between the recording sessions (NIST 2005 speaker recognition evaluation final meeting).

image

Figure 5:   Effect on performance of the recording device and of mismatched recording conditions (Campbell, 2014). ← 270 | 271 →

image

Figure 6:   EER variations depending on the microphone distance (Campbell, 2014).

3.2.    The speaker-specific information used by automatic speaker recognition systems

Results for the experiment in Kahn et al. (2010) are summarized in Figure 3. They show that it is not straightforward to know what information is used by automatic speaker recognition systems, even if the evaluated performances of these systems are high. Some other research work emphasizes this question. In Matrouf et al. (2006) and Bonastre et al. (2007) an artificial transformation14 of the voice was proposed in order to spoof a speaker recognition system: after the voice transformation, the system should recognize an impostor’s voice as coming from a targeted speaker. Note that only the automatic system was targeted in this spoofing experiment, not a human listener. In addition, the voice transformation should not be detected by a human listener. The targeted speaker was only described ← 271 | 272 → by a short speech sample of his/her voice (less than 2 minutes of speech), taken outside the evaluation dataset. The transformation was applied onto all the impostor trials of NIST SRE 2006 (restricted to the male trials). Table 1 reports the results of this experiment: the false alarm rate increases from 0.8% to 49.72%.

Table 1:   Effect of artefact-free artificial voice transformation of impostor voices (Bonastre et al., 2007)

 False Alarm (%)Miss probability (%)
Baseline (without transformation)0.827.45
Using impostor voice transformation49.7227.45

The ability of this non-audible transparent transformation technique to disrupt the speaker recognition system clearly questions the nature of the information used by the system. Several researchers (Perrot et al., 2007; Zhang and Tan, 2008; Alegre et al., 2012; Wu et al., 2012; Evans et al., 2014) have done similar experiments and explored other spoofing attacks (and countermeasures), with similar comments and conclusions (Matrouf et al., 2006; Bonastre et al., 2007).

4.   Voice convergence: A fundamental open question for forensic voice comparison

Quite recently, several interesting research studies have focused on voice convergence, when the interlocutors are known to establish a common ground and to align their linguistic production (Krauss and Pardo, 2006; Pardo, 2006; Babel, 2010; Kim et al., 2011). This phenomenon of interlocutor adjustment increases perceived similarity. Several acoustic attributes have been examined, such as speech rate, voice quality, formants or MFCC (Giles et al., 1991; Levitan and Hirschberg, 2011; Lelong and Bailly, 2012; Pardo et al., 2012, Pardo, 2013). This question potentially appears as a major threat against forensic speaker comparison for two reasons. First, voice convergence is an additional variability factor. Secondly, due to this type of speaker adjustments, the voice of speaker X could appear closer to the voice of speaker Y ← 272 | 273 → only due to the fact that X and Y participate in a conversation with the same other speaker Z. And to date, no scientific work excludes the hypothesis that the effects of voice convergence could remain after the conversation itself.

5.   Concluding remarks

In this chapter, we firstly reminded readers of the controversial aspects of forensic speaker comparison, mainly because of the intrinsic nature of the voice, which is very different from physical biometrics like DNA or fingerprints. We highlighted the importance of the Bayesian decision framework, which has become the standard paradigm for forensics in general and for forensic speaker comparison specifically. We went deeper into the question of the use of automatic systems in forensic applications. We acknowledged the impressive progresses achieved in the field of automatic speaker recognition during the last decade, but we also pointed out several important weaknesses in the evaluation protocols. We then went back to the speaker-specific nature of the information used by automatic systems. Clearly, some doubts about automatic systems remain as demonstrated for instance by Kahn et al. (2010) and Bonastre et al. (2007). It is particularly true if we use the broader perspective of “dependability” (Avizienis et al., 2004), which takes the whole process into account. It is important to have a comprehensive picture of forensic speaker recognition processes. Campbell et al. (2009) reported the importance of calibration and Bousquet et al. (2014) showed that normalization in the iVector domain also plays a major role for the performance of a system, although there is still no theoretical explanation for this.

Previous findings on automatic speaker recognition should not give the reader the wrong impression about the use of automatic approaches in forensic speaker recognition versus human-based approaches. If automatic approaches present some weaknesses, they are unavoidable in order to assert the scientific nature of forensic speaker comparison. We do not know whether it will be possible in the future to propose a fully automatic system for forensic speaker recognition, which would follow strong scientific guidelines like Daubert’s rules15. But we think it is quite impossible to ← 273 | 274 → meet such scientific rules without automatic processes as the typicality of each speaker-specific criterion16 has to be assessed on very large databases. Emerging studies on tools and methods for computer-assisted approaches, like the SPAAT tool17used by USSS-MITLL during their HASR 2010 participation (Schwartz, 2010), demonstrate the interest of such an approach.

However, our intention is not to dismiss human expert knowledge and manual approaches. Once again, Daubert’s case offers a nice proposal: “If scientific rules are not fulfilled, a witness qualified as an expert by knowledge, skill, experience, training, or education, may testify in the form of an opinion”. We fully support this statement and we wish to emphasize this distinction between an expert’s opinion and a scientifically-assessed method.

Finally, it is interesting to see that Campbell et al.’s (2009) conclusions are quite close to the 2003 conclusions. The main one concerns the “caution” message: “Looking at the different points highlighted in this article, we affirm that forensic applications of speaker recognition should still be taken under a necessary need for caution. Disseminating this message remains one of the most important responsibilities of speaker recognition researchers.” Since 2009, the research and remarks reported in this chapter have tended to significantly reinforce these conclusions.

Moving towards scientifically-sound speaker comparison approaches requires continuous research efforts. We are contributing to this effort with the work carried out within the scope of Juliette Kahn’s PhD thesis for example (Kahn, 201118). Figure 7 presents the logic of the work done. It reports an experiment where the part of inter-speaker variability explained by the different formants of vowels was extracted for male and female speakers. For example, the numbers reported in the figure mean that the first formant for vowel /a/ explains 15% of the inter-speaker variability. Speaker-specific information is not equally distributed on vowels and relies on the vocalic quality of sounds. These interactions between speaker-specific variability and acoustic-phonetic classes are the subject of some rare studies like (Bonastre and Meloni, 1994; Besacier et al., 2000). Further research is needed ← 274 | 275 → in order to provide an objective estimation of “the relative frequency of the concordant features in the relevant population” (Champod and Meuwly, 2000). Therefore, voice comparison reliability not only depends on relative frequency of the features but also on the concordance or homogeneity of the speaker-specific information classes in both speech excerpts. This work is carried on in the context of Moez Ajili’s ongoing PhD19. Ajili (Ajili, 2015) is presenting a first measure of the data homogeneity between the two speech extracts of a voice comparison trial.

image

Figure 7:   Part of inter-speaker variability explained by formant and vowel. Results are given for males (H) and females (F) (Kahn, 2011).

Acknowledgments

If opinions, interpretations, conclusions, and recommendations are those of the authors, this work would not be possible without the invaluable help from Joseph P. (“Joe”) Campbell and Anders Eriksson. This work was also stimulated by discussions with Reva Schwartz, Driss Matrouf, Pierre-Michel Bousquet and Guillaume Galou.

References

Aitken C.G.G., and Taroni F. (2004). Statistics and the evaluation of evidence for forensic scientists. 2nd ed. Wiley: Chichester.

Ajili, M., Bonastre, J.-F., Rossato, S., Kahn, J., and Lapidot. (accepted). An information theory based data-homogeneity measure for voice comparison. In Interspeech 2015. Dresden. ← 275 | 276 →

Alegre, F., Vipperla, R., Evans, N., and Fauve, B. (2012). On the vulnerability of automatic speaker recognition to spoofing attacks with artificial signals. In Proceedings of the 20th European Signal Processing Conference, 36–40.

Alexander, A., Dessimoz, D., Botti, F., and Drygajlo, A. (2005). Aural and automatic forensic speaker recognition in mismatched conditions. International Journal of Speech Language and the Law, 12(2), 214.

Audibert, N., Larcher, A., Lévy, C., Kahn, J., Rossato, S., Matrouf, D., and Bonastre, J. F. (2010). LIA human-based system description for NIST HASR 2010. In Proceedings of NIST 2010 Speaker Recognition Evaluation Workshop, Brno.

Austin, J. L. (1970). Quand dire, c’est faire. Seuil: Paris.

Avizienis, A., Laprie, J.-C., Randell, B., and Landwehr, C. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11–33.

Babel, M. (2010). Dialect divergence and convergence in New Zealand English. Language in Society, 39(4), 437–456.

Balding, D. J., and Donnelly, P. (1994). The prosecutor’s fallacy and DNA evidence. Criminal Law Review, 711–721.

Barras, C., Meignier, S., and Gauvain, J.-L. (2004). Unsupervised online adaptation for speaker verification over the telephone. In Odyssey 2004 The Speaker and Language Recognition Workshop.

Becker, T., Jessen, M., Alsbach, S., Broß, F., and Meier, T. (2010). SPES: The BKA forensic automatic voice comparison system. In Odyssey 2010The Speaker and Language Recognition Workshop.

Besacier, L., Bonastre, J.-F., and Fredouille, C. (2000). Localization and selection of speaker-specific information with statistical modeling. Speech Communication, 31 (2–3), 89–106.

Bimbot, F., Bonastre, J.-F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., and Reynolds, D. A. (2004). A tutorial on text-independent speaker verification. EURASIP Journal on Advances in Signal Processing, 4, 430–451.

Boë, L.-J. (2000). Forensic voice identification in France. Speech Communication, 31(2), 205–224. ← 276 | 277 →

Boë, L.-J., and Bonastre, J.-F. (2012). L’identification du locuteur: 20 ans de témoignage dans les cours de Justice. In JEP-TALN-RECITAL Grenoble, 1, 417–424.

Bolt, R. H., Cooper, F. S., Jr, E. E. D., Denes, P. B., Pickett, J. M., and Stevens, K. N. (1970). Speaker identification by speech spectrograms: A scientists’ view of its reliability for legal purposes. The Journal of the Acoustical Society of America, 47 (2B), 597–612.

Bonastre, J.-F., and Meloni, H. (1994). Inter- and intra-speaker variability of French phonemes. Advantages of an explicit knowledge-based approach. In Proceedings of the ESCA Workshop on Speaker Recognition, Identification and Verification, Martigny, 157–160.

Bonastre, J.-F., Bimbot, F., Boë, L.-J., Campbell, J., Reynolds, D., and Ma-grin-Chagnolleau, I. (2003). Person authentification by voice: A need for caution. In Proceedings of EUROSPEECH.

Bonastre, J.-F., Matrouf, D., and Fredouille, C. (2007). Artificial impostor voice transformation effects on false acceptance rates. In Proceedings of Interspeech, Antwerp, 2053–2056.

Bousquet, P.-M., Bonastre, J.-F., and Matrouf, D. (2014). Exploring some limits of Gaussian PLDA modeling for i-vector distributions. In Odyssey 2014 – The Speaker and Language Recognition Workshop.

Brümmer, N., and du Preez, J. (2006). Application-independent evaluation of speaker detection. Computer Speech & Language, 20 (2–3), 230–275.

Campbell, J. P. (2014). Speaker recognition for forensic applications. Presentation at Odyssey 2014 – The Speaker and Language Recognition Workshop.

Campbell, J. P., Shen, W., Campbell, W. M., Schwartz, R., Bonastre, J.-F., and Matrouf, D. (2009). Forensic speaker recognition. Signal Processing Magazine, IEEE, 26(2), 95–103.

Champod, C., and Meuwly, D. (2000). The inference of identity in forensic speaker recognition. Speech Communication, 31 (2–3), 193–203.

Doddington, G., Liggett, W., Martin, A., Przybocki, M., and Reynolds, D. (1998). Sheep, goats, lambs and wolves: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. In Proceedings of ICSLP-1998, Sydney.

Drygajlo, A. (2007). Forensic automatic speaker recognition [Exploratory DSP]. IEEE Signal Processing Magazine, 24(2), 132–135. ← 277 | 278 →

Gold, E., and Hughes, V. (2014). Issues and opportunities: The application of the numerical likelihood ratio framework to forensic speaker comparison. Science and Justice, 54 (4), 292–299.

Eriksson, A., and Lacerda, F. (2007). Charlatanry in for forensic speech science: A problem to be taken seriously. International Journal of Speech Language and the Law, 14(2), 169–193.

Evans, N., Kinnunen, T., Yamagishi, J., Wu, Z., Alegre, F., and Leon, P. D. (2014). Speaker recognition anti-spoofing. In S. Marcel, M. S. Nixon, and S. Z. Li (ed.) Handbook of Biometric Anti-Spoofing (pp. 125–146). Springer: London.

Fauve, B. G. B., Matrouf, D., Scheffer, N., Bonastre, J.-F., and Mason, J. S. D. (2007). State-of-the-art performance in text-independent speaker verification through open-source software. IEEE Transactions on Audio, Speech and Language Processing, 15(7), 1960–1968.

Fienberg, S. E. (1989). The evolving role of statistical assessments as evidence in the courts. Springer: New York.

French, P., and Harrison, P. (2007). Position Statement concerning use of impressionistic likelihood terms in forensic speaker comparison cases. International Journal of Speech Language and the Law, 14(1), 137–144.

French, P., Nolan, F., Foulkes, P., Harrison, P., and McDougall, K. (2010). The UK position statement on forensic speaker comparison: A rejoinder to Rose and Morrison. International Journal of Speech Language and the Law, 17(1), 143–152.

Giles, H., Coupland, J., and Coupland, N. (1991). Contexts of accommodation: Developments in applied sociolinguistics (Cambridge University Press). New York.

Gonzalez-Rodriguez, J., Rose, P., Ramos, D., Toledano, D. T., and Ortega-Garcia, J. (2007). Emulating DNA: Rigorous quantification of evidential weight in transparent and testable forensic speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2104–2115.

Greenberg, C. S., Martin, A. F., Brandschain, L., Campbell, J. P., Cieri, C., Doddington, G. R., and Godfrey, J. J. (2010). Human assisted speaker recognition in NIST SRE10. Odyssey 2010 – The Speaker and Language Recognition Workshop. ← 278 | 279 →

Greenberg, C. S., Martin, A. F., Doddington, G. R., and Godfrey, J. J. (2011). Including human expertise in speaker recognition systems: report on a pilot evaluation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5896–5899.

Hautamäki, R. G., Hautamäki, V., Rajan, P., and Kinnunen, T. (2013). Merging human and automatic system decisions to improve speaker recognition performance. In Proceedings of Interspeech, 2519–2523.

Jessen, M. (2008). Forensic phonetics. Language and Linguistics Compass, 2(4), 671–711.

Kahneman, D. (2011). Thinking, fast and slow. Macmillan.

Kahn, J. (2011). Parole de locuteur: performance et confiance en identification biométrique vocale. Université d’Avignon.

Kahn, J.; Audibert, N.; Rossato, S.; Bonastre, J.-F. (2011). Speaker verification by inexperienced and experienced listeners vs. speaker verification system. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5912 –5915.

Kahn, J., Audibert, N., Rossato, S., and Bonastre, J.-F. (2010). Intra-speaker variability effects of speaker verification performance. Odyssey 2010 – The Speaker and Language Recognition Workshop.

Kassin, S. M., Dror, i.e., and Kukucka, J. (2013). The forensic confirmation bias: Problems, perspectives, and proposed solutions. Journal of Applied Research in Memory and Cognition, 2(1), 42–52.

Kenny, P., Boulianne, G., Ouellet, P., and Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–1447.

Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., and Alam, J. (2014). Deep neural networks for extracting baum-welch statistics for speaker recognition. Odyssey 2014 – The Speaker and Language Recognition Workshop.

Kersta, L. G. (1962). Voiceprint identification. Nature, 196(4861), 1253–1257.

Kim, M., Horton, W. S., and Bradlow, A. R. (2011). Phonetic convergence in spontaneous conversations as a function of interlocutor language distance. Laboratory Phonology, 2(1), 125–156.

Krauss, R. M., and Pardo, J. S. (2006). Speaker perception and social behavior: bridging social psychology and speech science. In P.A.M. van ← 279 | 280 → Lange (ed.) Bridging social psychology: Benefits of transdisciplinary approaches, pp. 273–278.

Labov, W. (1972). Sociolinguistic patterns (University of Pennsylvania Press). Philadelphia.

Lelong, A., and Bailly, G. (2012). Characterizing phonetic convergence with speaker recognition techniques. The Listening Talker Workshop (LISTA 2012), 28–31.

Lei, Y., Scheffer, N., Ferrer, L., and McLaren, M. (2014). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1695–1699.

Levitan, R., and Hirschberg, J. B. (2011). Measuring acoustic-prosodic entrainment with respect to multiple levels and dimensions. Proceedings of Interspeech, 3081–3084.

Mandasari, M. I., McLaren, M., and van Leeuwen, D. A. (2011). Evaluation of i-vector speaker recognition systems for forensic application. Proceedings of Interspeech, 21–24.

Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M. (1997). The DET curve in assessment of detection task performance. Proceedings of Eurospeech, Rhodes.

Martin, A. F., Greenberg, C. S., Stanford, V. M., Howard, J. M., Doddington, J. J., and Godfrey, J. J. (2014). Performance factor analysis for the 2012 NIST speaker recognition evaluation. Proceedings of Interspeech. Singapore.

Matrouf, D., Bonastre, J.-F., and Fredouille, C. (2006). Effect of speech transformation on impostor acceptance. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 933–936.

McLaren, M. L., Matrouf, D., Vogt, R. J., and Bonastre, J.-F. (2008). Combining continuous progressive model adaptation and factor analysis for speaker verification. Proceedings of Interspeech, Brisbane.

McLaren, M., Matrouf, D., Vogt, R., and Bonastre, J.-F. (2011). Applying SVMs and weight-based factor analysis to unsupervised adaptation for speaker verification. Computer Speech & Language, 25(2), 327–340.

Morrison, G. S. (2011). Measuring the validity and reliability of forensic likelihood-ratio systems. Science & Justice, 51(3), 91–98. ← 280 | 281 →

Morrison G. S. (2014). Distinguishing between forensic science and forensic pseudoscience: Testing of validity and reliability, and approaches to forensic voice comparison. Science and Justice, 54 (3), 245–256.

Nakasone, H., and Beck, S. B. (2001). Forensic automatic speaker recognition. Odyssey 2001The Speaker and Language Recognition Workshop.

NIST. (2010). The NIST year 2010 speaker recognition evaluation plan. Available at http://itl.nist.gov/iad/mig/tests/sre/2010/NIST_SRE10_evalplan.r6.pdf

Ortega-Garcia, J., Gonzalez-Rodriguez, J., and Marrero-Aguiar, V. (2000). AHUMADA: A large speech corpus in Spanish for speaker characterization and identification. Speech Communication, 31 (2–3), 255–264.

Pardo, J. S. (2006). On phonetic convergence during conversational interaction. The Journal of the Acoustical Society of America, 119(4), 2382–2393.

Pardo, J. S. (2013). Measuring phonetic convergence in speech production. Frontiers in Psychology, 4, 559.

Pardo, J. S., Gibbons, R., Suppes, A., and Krauss, R. M. (2012). Phonetic convergence in college roommates. Journal of Phonetics, 40(1), 190–197.

Perrot, P., Aversano, G., and Chollet, G. (2007). Voice disguise and automatic detection: Review and perspectives. In Y. Stylianou, M. Faundez-Zanuy, and A. Esposito (ed.), Progress in nonlinear speech processing (pp. 101–117). Springer Berlin Heidelberg.

Petrovska-Delacrétaz, D., Chollet, G., and Dorizzi, B. (2009). Guide to biometric reference systems and performance evaluation. Springer: London.

Phillips, P. J., Martin, A., Wilson, C. L., and Przybocki, M. (2000). An introduction evaluating biometric systems. Computer, 33(2), 56–63.

Pruzansky, S. (1963). Pattern-matching procedure for automatic talker recognition. The Journal of the Acoustical Society of America, 35(3), 354–358.

Przybocki, M., and Martin, A. F. (2004). NIST speaker recognition evaluation chronicles. Odyssey 2004 – The Speaker and Language Recognition Workshop.

Przybocki, M., Martin, A. F., and Le, A. N. (2006). NIST speaker recognition evaluation chronicles-part 2. Odyssey 2006The Speaker and Language Recognition Workshop. ← 281 | 282 →

Przybocki, M. A., Martin, A. F., and Le, A. N. (2007). NIST speaker recognition evaluations utilizing the mixer corpora – 2004, 2005, 2006. IEEE Transactions on Audio, Speech and Language Processing, 15(7), 1951–1959.

Ramos, D., Franco-Pedroso, J., and Gonzalez-Rodriguez, J. (2011). Calibration and weight of the evidence by human listeners. The ATVS-UAM submission to NIST HUMAN-aided speaker recognition 2010. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5908–5911.

Reynolds, D. A., Quatieri, T. F., and Dunn, R. B. (2000). Speaker verification using adapted Gaussian Mixture Models. Digital Signal Processing, 10 (1–3), 19–41.

Roberts, P. (2013). Renegotiating forensic cultures: Between law, science and criminal justice. Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences, 44(1), 47–59.

Rose, P. (2006). Technical forensic speaker recognition: Evaluation, types and testing of evidence. Computer Speech & Language, 20 (2–3), 159–191.

Rose, P., and Morrison, G. (2009). A response to the UK position statement on forensic speaker comparison. The International Journal of Speech, Language and the Law, 16(1), 139.

Scherer, K. (1986). Vocal affect expression: A review and a model for future research. Psychological Bulletin, 99, 143–165.

Schwartz, R., Campbell, J. P., Shen, W., Sturim, D. E., Campbell, W. M., Richardson, F. S., Dunn, R. B., and Granville, R. (2010). USSS-MITLL 2010 human assisted speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5904–5907.

Shen, W., Campbell, J.P., Straub, D., and Schwartz, R. (2011). Assessing the speaker recognition performance of naïve listeners using mechanical turk. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5916–5919.

Soong, F. K., Rosenberg, A. E., Juang, B.-H., and Rabiner, L. R. (1987). Report: A vector quantization approach to speaker recognition. AT&T Technical Journal, 66(2), 14–26. ← 282 | 283 →

Stafylakis, T., Kenny, P., Senoussaoui, M., and Dumouchel, P. (2012). Preliminary investigation of boltzmann machine classifiers for speaker recognition. Odyssey 20012 – The Speaker and Language Recognition Workshop.

Thompson, W. C., Kaasa, S. O., and Peterson, T. (2013). Do jurors give appropriate weight to forensic identification evidence? Journal of Empirical Legal Studies, 10(2), 359–397.

Tosi, O. (1979). Voice identification theory and legal applications. Baltimore: University Park Press.

Tversky, A., and Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185(4157), 1124–1131.

Univaso, P., Soler, M. M., and Gurlekian, J. A. (2013). Human assisted speaker recognition using forced alignments on HMM. International Journal of Engineering Research and Technology, 2(9), ESRSA Publications.

van Dijk, M., Orr, R., van der Vloed, D., and van Leeuwen, D. (2013). A human benchmark for automatic speaker recognition. In Proceedings of the 1st International Conference Biometric Technologies in Forensic Science, Nijmegen, 39–45.

van Leeuwen, D. A., and Brümmer, N. (2013). The distribution of calibrated likelihood-ratios in speaker recognition. Proceedings of Inter-speech, Lyon.

van Leeuwen, D. A., Martin, A. F., Przybocki, M. A., and Bouten, J. S. (2006). NIST and NFI-TNO evaluations of automatic speaker recognition. Computer Speech &Language, 20 (2–3), 128–158.

Vasilakakis V., Cumani S., Laface P. (2013). Speaker recognition by means of Deep Belief Networks. Proceedings of Biometric Technologies in Forensic Science, Nijmegen.

Wu, Z., Siong, C. E., and Li, H. (2012). Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. Proceedings of Interspeech, Portland.

Zhang, C., and Tan, T. (2008). Voice disguise and automatic speaker recognition. Forensic Science International, 175 (2–3), 118–122.


1       International Association for Forensic Phonetics and Acoustics (http://www.iafpa.net/)

2       http://www.iafpa.net/voiceprintsres.htm

3       Grégory Villemin was a young boy murdered in 1984. This unresolved case involves several members of his family and is very famous in France.

4       Jérôme Prieto was accused of participating in a Basque terrorism case which took place in 1996 on the basis of a recorded phone message.

5       Firstly presented in Christophe Champod’s tutorial, RLA2C, Avignon, 1998 (RLA2C was one of the precursors of “Speaker Odyssey” workshops).

6       As presented by the authors, this sentence was inspired by the report of a panel on statistical assessments as evidence in courts (Fienberg, 1989, p. 141), from which the following quotation is taken “it is the utility function of the court that is appropriate, not the utility function of the statistician”.

7       Of course, these questions were important and still are. We will get back to these aspects later, in the light of the Bayesian decision framework.

8       Mainly false alarm, miss probability, EER, DCF and DET plots (Martin et al., 1997)

9       An example could help us to define “calibration loss”. Let us imagine that we have a perfect system which outputs perfect LRs. Now, something like a constant background noise disturbs this system and adds a constant bias to its output. Of course, the CLLR of the system will improve significantly while its discrimination power is still the same. The difference between the two CLLR is “calibration loss”.

10     The number of different speakers involved in the condition is often taken into consideration.

11     Kahn et al.(2010) reported similar performance differences when different databases or systems are used.

12     All the speech excerpts are coming from the same evaluation condition of NIST 2008, in order to limit biases like channel, language or duration.

13     The trials were selected in 2010 depending on their “intrinsic difficulty”, estimated by an automatic system. This choice could be questioned by several other variants like average difficulty selection, random selection or auditory-based selection.

14     The transformation is done acoustically frame by frame, only on the filter parameters of the classical source-filter model.

15     See Daubert v. Merrell Dow Pharmaceuticals, Inc. (1993) 509 U.S. 579, 589 and USA supreme court rule 702 as amended Apr. 17, 2000, eff. Dec. 1, 2000; Apr. 26, 2011, eff. Dec. 1, 2011.

16     A speaker specific criterion could be based on a manual or computer assisted measure on the signal.

17     Super Phonetic Annotation and Analysis Tool

18     Speech of speakers: Performance and reliability in voice biometrics

19     Moez Ajili’s PhD is funded by the French National Agency funded project “Fabiole”, about reliability in voice comparison.