Log on: Remember me
Powered by Elgg

das :: Activity :: Just Me

People: Everyone | Friends & Community | Inbox | Just Me
Display: Full-text | Summary
Include: Blog Posts | Blog Comments | Files | Wiki Page | Wiki Comments

<< Back

Page 2 of 8

Forward >>
minutes080508
das | page | Thu May 08
  - present: M, T, D
  - priority: close processing chain. Finally get something from ASR
    to parser to DM to TTS -- even if it is only a parrot system!
  - Dialogue Manager:
    - can be something like Dipper, i.e. information-state update
      based.. 
    - or FSA (specified in SCXML or similar?)
    - but rules can be simple anyway, simple FSA-stuff:
      - identification -> confirmation (repeat on negative) ->
        orientation -> confirmation (repeat on negative) -> placement
        (repeat on negative)
    - S: "Welche Teil?" U: "Das zweite von links" S: "Das hier?" 
      U: "Nein, daneben". [ --> need to be able to deal with
      context-dependent utterances ]
  - do WOz pretty soon? Wizard hears user utterances, can trigger
    simple prompts:
    - "Welches Teil?" "Soll ich es drehen" "Wohin?";
      "So?" "Hier?"
    - to hide that Wizard is human, let GUI do mouse movements? I.e.,
      wizard selects parameters of action (selecting piece, rotating
      it, dragging it), then selects prompt ("So?"); this is then sent
      to system which executes action (e.g., computes and executes
      mouse path; plays synchronised utterance). This won't allow us
      to test reaction to smooth turn-taking (since it is
      non-incremental; the wizard will have to fully specify the
      action), but it will allow us to test user reactions & learn
      about the complexity of their speech. Especially the reactions to
      CRs like "so?". E.g., "nein, eins weiter hoch".


THE FRIGGING WIKI IS BROKEN. you can find the complete minutes on my weblog.
minutes080508
das | page | Thu May 08
  - present: M, T, D
  - priority: close processing chain. Finally get something from ASR
    to parser to DM to TTS -- even if it is only a parrot system!
  - Dialogue Manager:
    - can be something like Dipper, i.e. information-state update
      based.. 
    - or FSA (specified in SCXML or similar?)
    - but rules can be simple anyway, simple FSA-stuff:
      - identification -> confirmation (repeat on negative) ->
        orientation -> confirmation (repeat on negative) -> placement
        (repeat on negative)
    - S: "Welche Teil?" U: "Das zweite von links" S: "Das hier?" 
      U: "Nein, daneben". [ --> need to be able to deal with
      context-dependent utterances ]
  - do WOz pretty soon? Wizard hears user utterances, can trigger
    simple prompts:
    - "Welches Teil?" "Soll ich es drehen" "Wohin?";
      "So?" "Hier?"
    - to hide that Wizard is human, let GUI do mouse movements? I.e.,
      wizard selects parameters of action (selecting piece, rotating
      it, dragging it), then selects prompt ("So?"); this is then sent
      to system which executes action (e.g., computes and executes
      mouse path; plays synchronised utterance). This won't allow us
      to test reaction to smooth turn-taking (since it is
      non-incremental; the wizard will have to fully specify the
      action), but it will allow us to test user reactions & learn
      about the complexity of their speech. Especially the reactions to
      CRs like "so?". E.g., "nein, eins weiter hoch".


Home Page
das | page | Thu May 08

Besprechungsprotokolle / meeting minutes

(newest first)

05/05/08 minutes080508 

14/04/08 minutes140408

03/02/08 minutes030208b

04/12/07 minutes041207

26/11/07 @Timo

19/11/07 minutes191107

13/11/07 minutes131107

05/11/07 minutes051107

22/10/07 minutes221007

01/10/07 minutes2007_10_01

10/09/07 minutes100907

23/08/07 minutes230807

03/07/07 minutes030707

19/06/07 minutes190607

05/06/07 minutes050607_zeitwort2

21/05/07 minutes210507

 

Sonstiges

 

Conferences2008

minutes140408
das | page | Mon Apr 14
- InPro, meeting, minutes, 14/04/08
  - present: M, T, G, D
  - Gabriel demo'ed current state of Higgins. Displays duration of
    vocal action (both recognised and own) on timeline, uses simple
    boundary tone classification (up, down) to base decisions on
    thresholding on. (This is mostly a test of the architecture at the
    moment, the strategies are very simple.)
  - dysfluencies: what to do with aborted words? Most likely, sphinx
    will recognise rubbish. Would be too unrestrictive to include
    aborted versions of all words; adding other methods (e.g., using
    prosodic info) would require too much changes at low level of
    ASR. (Hm. But at some point we'll have frame-level /
    syllable-level prosodic info anyway. Shouldn't be too hard to let
    classifier judge whether word was perhaps misrecognised because it
    was a different, aborted one.)
  - Timo and Gabriel will work together on getting better classifier
    for boundary tone detection to work. Does it need to do speaker
    adaptation?
  - first step on syntax side: toy grammar for Pento domain.
    (``Nimm das {Kreuz | Teil | lange Ding} aus der Mitte links
    oben'') in Higgins parser.
  - using a grammar as linguistic model in sphinx apparently doesn't
    work incrementally (doesn't return results before top category has
    been found), but using statistical LM does work. (Although there
    are still technical problems, but it looks promising.)
  - even if we can't use a grammar, we can still bootstrap an n-gram
    LM with utterances generated from a domain grammar.

Home Page
das | page | Mon Apr 14

Besprechungsprotokolle / meeting minutes

(newest first)

14/04/08 minutes140408

03/02/08 minutes030208b

04/12/07 minutes041207

26/11/07 @Timo

19/11/07 minutes191107

13/11/07 minutes131107

05/11/07 minutes051107

22/10/07 minutes221007

01/10/07 minutes2007_10_01

10/09/07 minutes100907

23/08/07 minutes230807

03/07/07 minutes030707

19/06/07 minutes190607

05/06/07 minutes050607_zeitwort2

21/05/07 minutes210507

 

Sonstiges

 

Conferences2008

030208cont
das | page | Mon Mar 03
  - kurzfristige Projekte:
    - bababa2, SIGdial Poster
      - TO DOs, unprioritisiert: a) Silbengrenzen, von
        Aussprachewörterbuch kommend; b) echtes Audio verwenden,
  Kielkorpus; c) ASR verwenden, Wörter, ngramme; d) bessere
  speech states, phrasengrenzen (f. BCs); e) besser
        TT-Strategien; f) simulation, constant time < (or >)
        real-time; g) bessere Evaluation; h) interruption management;
        i) BC management; j) Parametrisierung (chattiness,
interruption propability, etc.); k) adaptivity
      - mögliche Ansätze f. Paper:
      - in Richtung David T., `believable, non-scripted content-free
          background chatter'
  Nicht sehr überzeugend; um online erzeugt zu werden, doch
          ein wenig resourcenhungrig. Nur für Hintergrundgerede würde
          das wohl niemand ernsthaft einsetzen.
        - `simple rules create realistic turn-taking patterns'
  SSJ rules as *generative* rules, not just descriptive. Shows
          that such a set of rules, together w/ some audio magic, are
          enough to produce patterns that are `natural' (in a way that
    needs to be defined properly). Again sort of upper-bound; to
          get something like this working properly within a real
          system, here's what we would need in terms of components.
          - to do first: b), d), e), g).
  - needed: more principled metric for `naturalness' of
            resulting corpus. Multi-dimensional: distribution of gaps
      & overlaps, balance btw speakers, turn length (in time,
      but also # of utterances).
    - `syntactic and prosodic language modelling for incremental
      utterance segmentation', für Coling
      utterance end pointing, but in an incremental set up. Needed to
      know where to clear the chart of the parser. Connected to a
      well-researched task (i.e., easy to motivate & compare), but
      different in that we don't allow (as much?) right context.
      - method:
      - select only multi-utterance turns; EOUs to find are the
          turn-internal ones.
        - use original data & variants w/ various WER.
          Those need plausible time information. How much does
          this degrade performance?

      - what's a good way to evaluate this? follow-on effects of wrong
        decisions: an insert for example makes us restart the parser,
        and hence get other things wrong?

030208cont
das | page | Mon Mar 03
  - kurzfristige Projekte:
    - bababa2, SIGdial Poster
      - TO DOs, unprioritisiert: a) Silbengrenzen, von
        Aussprachewörterbuch kommend; b) echtes Audio verwenden,
  Kielkorpus; c) ASR verwenden, Wörter, ngramme; d) bessere
  speech states, phrasengrenzen (f. BCs); e) besser
        TT-Strategien; f) simulation, constant time < (or >)
        real-time; g) bessere Evaluation; h) interruption management;
        i) BC management; j) Parametrisierung (chattiness,
interruption propability, etc.); k) adaptivity
      - mögliche Ansätze f. Paper:
      - in Richtung David T., `believable, non-scripted content-free
          background chatter'
  Nicht sehr überzeugend; um online erzeugt zu werden, doch
          ein wenig resourcenhungrig. Nur für Hintergrundgerede würde
          das wohl niemand ernsthaft einsetzen.
        - `simple rules create realistic turn-taking patterns'
  SSJ rules as *generative* rules, not just descriptive. Shows
          that such a set of rules, together w/ some audio magic, are
          enough to produce patterns that are `natural' (in a way that
    needs to be defined properly). Again sort of upper-bound; to
          get something like this working properly within a real
          system, here's what we would need in terms of components.
          - to do first: b), d), e), g).
  - needed: more principled metric for `naturalness' of
            resulting corpus. Multi-dimensional: distribution of gaps
      & overlaps, balance btw speakers, turn length (in time,
      but also # of utterances).
    - `syntactic and prosodic language modelling for incremental
      utterance segmentation', für Coling
      utterance end pointing, but in an incremental set up. Needed to
      know where to clear the chart of the parser. Connected to a
      well-researched task (i.e., easy to motivate & compare), but
      different in that we don't allow (as much?) right context.
      - method:
      - select only multi-utterance turns; EOUs to find are the
          turn-internal ones.
        - use original data & variants w/ various WER.
      - what's a good way to evaluate this? follow-on effects of wrong
        decisions: an insert for example makes us restart the parser,
        and hence get other things wrong?

030208cont
das | page | Mon Mar 03
  - kurzfristige Projekte:
    - bababa2, SIGdial Poster
      - TO DOs, unprioritisiert: a) Silbengrenzen, von
        Aussprachewörterbuch kommend; b) echtes Audio verwenden,
  Kielkorpus; c) ASR verwenden, Wörter, ngramme; d) bessere
  speech states, phrasengrenzen (f. BCs); e) besser
        TT-Strategien; f) simulation, constant time < (or >)
        real-time; g) bessere Evaluation; h) interruption management;
        i) BC management; j) Parametrisierung (chattiness,
interruption propability, etc.); k) adaptivity
      - mögliche Ansätze f. Paper:
      - in Richtung David T., `believable, non-scripted content-free
          background chatter'
  Nicht sehr überzeugend; um online erzeugt zu werden, doch
          ein wenig resourcenhungrig. Nur für Hintergrundgerede würde
          das wohl niemand ernsthaft einsetzen.
        - `simple rules create realistic turn-taking patterns'
  SSJ rules as *generative* rules, not just descriptive. Shows
          that such a set of rules, together w/ some audio magic, are
          enough to produce patterns that are `natural' (in a way that
    needs to be defined properly). Again sort of upper-bound; to
          get something like this working properly within a real
          system, here's what we would need in terms of components.
          - to do first: b), d), e), g).
  - needed: more principled metric for `naturalness' of
            resulting corpus. Multi-dimensional: distribution of gaps
      & overlaps, balance btw speakers, turn length (in time,
      but also # of utterances).
    - `syntactic and prosodic language modelling for incremental
      utterance segmentation', für Coling
      utterance end pointing, but in an incremental set up. Needed to
      know where to clear the chart of the parser. Connected to a
      well-researched task (i.e., easy to motivate & compare), but
      different in that we don't allow (as much?) right context.
      - method:
      - select only multi-utterance turns; EOUs to find are the
          turn-internal ones.
        - use original data & variants w/ various WER.
      - what's a good way to evaluate this? follow-on effects of wrong
        decisions: an insert for example makes us restart the parser,
        and hence get other things wrong?

030208cont
das | page | Mon Mar 03
  - kurzfristige Projekte:
    - bababa2, SIGdial Poster
      - TO DOs, unprioritisiert: a) Silbengrenzen, von
        Aussprachewörterbuch kommend; b) echtes Audio verwenden,
  Kielkorpus; c) ASR verwenden, Wörter, ngramme; d) bessere
  speech states, phrasengrenzen (f. BCs); e) besser
        TT-Strategien; f) simulation, constant time < (or >)
        real-time; g) bessere Evaluation; h) interruption management;
        i) BC management; j) Parametrisierung (chattiness,
interruption propability, etc.); k) adaptivity
      - mögliche Ansätze f. Paper:
      - in Richtung David T., `believable, non-scripted content-free
          background chatter'
  Nicht sehr überzeugend; um online erzeugt zu werden, doch
          ein wenig resourcenhungrig. Nur für Hintergrundgerede würde
          das wohl niemand ernsthaft einsetzen.
        - `simple rules create realistic turn-taking patterns'
  SSJ rules as *generative* rules, not just descriptive. Shows
          that such a set of rules, together w/ some audio magic, are
          enough to produce patterns that are `natural' (in a way that
    needs to be defined properly). Again sort of upper-bound; to
          get something like this working properly within a real
          system, here's what we would need in terms of components.
          - to do first: b), d), e), g).
  - needed: more principled metric for `naturalness' of
            resulting corpus. Multi-dimensional: distribution of gaps
      & overlaps, balance btw speakers, turn length (in time,
      but also # of utterances).
    - `syntactic and prosodic language modelling for incremental
      utterance segmentation', für Coling
      utterance end pointing, but in an incremental set up. Needed to
      know where to clear the chart of the parser. Connected to a
      well-researched task (i.e., easy to motivate & compare), but
      different in that we don't allow (as much?) right context.
      - method:
      - select only multi-utterance turns; EOUs to find are the
          turn-internal ones.
        - use original data & variants w/ various WER
      - what's a good way to evaluate this? follow-on effects of wrong
        decisions: an insert for example makes us restart the parser,
        and hence get other things wrong?

030208cont
das | page | Mon Mar 03
  - kurzfristige Projekte:
    - bababa2, SIGdial Poster
      - TO DOs, unprioritisiert: a) Silbengrenzen, von
        Aussprachewörterbuch kommend; b) echtes Audio verwenden,
  Kielkorpus; c) ASR verwenden, Wörter, ngramme; d) bessere
  speech states, phrasengrenzen (f. BCs); e) besser
        TT-Strategien; f) simulation, constant time < (or >)
        real-time; g) bessere Evaluation; h) interruption management;
        i) BC management; j) Parametrisierung (chattiness,
interruption propability, etc.); k) adaptivity
      - mögliche Ansätze f. Paper:
      - in Richtung David T., `believable, non-scripted content-free
          background chatter'
  Nicht sehr überzeugend; um online erzeugt zu werden, doch
          ein wenig resourcenhungrig. Nur für Hintergrundgerede würde
          das wohl niemand ernsthaft einsetzen.
        - `simple rules create realistic turn-taking patterns'
  SSJ rules as *generative* rules, not just descriptive. Shows
          that such a set of rules, together w/ some audio magic, are
          enough to produce patterns that are `natural' (in a way that
    needs to be defined properly). Again sort of upper-bound; to
          get something like this working properly within a real
          system, here's what we would need in terms of components.
          - to do first: b), d), e), g).
  - needed: more principled metric for `naturalness' of
            resulting corpus. Multi-dimensional: distribution of gaps
      & overlaps, balance btw speakers, turn length (in time,
      but also # of utterances).
    - `syntactic and prosodic language modelling for incremental
      utterance segmentation', für Coling
      utterance end pointing, but in an incremental set up. Needed to
      know where to clear the chart of the parser. Connected to a
      well-researched task (i.e., easy to motivate & compare), but
      different in that we don't allow (as much?) right context.
      - method:
      - select only multi-utterance turns; EOUs to find are the
          turn-internal ones.
        - use original data & variants 
      - what's a good way to evaluate this? follow-on effects of wrong
        decisions: an insert for example makes us restart the parser,
        and hence get other things wrong?

<< Back

Page 2 of 8

Forward >>