Diarization (beta)

Diarization is the process of automatically detecting speakers and speech segments within an audio file. Diarization is available via the phon-diarization-plugin which may be downloaded from https://github.com/phon-ca/phon-diarization-plugin/packages/1025013. Diarization actions are available from the Timeline view via the Diarization toolbar button. When clicked the diarization menu will be displayed.

Installing the Plug-in

The following steps will outline how to install the diarization plug-in for the current user:
  • Download the newest version of the diarization plug-in jar from https://github.com/phon-ca/phon-diarization-plugin/packages/1025013. Choose the phon-diarization-plugin-<version>.jar file from the right-hand side of the webpage.
  • Copy this file into the Phon plug-ins folder for this user:
    • macos (Finder)
      1. Open a new Finder window and choose Go > Go to Folder... from the window menu
      2. Type ~/Library/Application Support/Phon in the dialog and press Enter
      3. Create a new folder plugins if this folder does not already exist
      4. Copy the downloaded file, phon-diarization-plugin-<version>.jar into the plugins folder.
      5. Restart Phon
    • macos (command line)
      Copy and paste the following into a new Terminal window (create destination folder if it does not exist:)
      cp ~/Downloads/phon-diarization-plugin-<version>.jar ~/Library/Application\ Support/Phon/plugins/
    • windows
      1. Open a new Explorer window and type %APPDATA%/Phon into the address bar
      2. Create a new folder plugins if this folder does not already exist
      3. Copy the downloaded file, phon-diarization-plugin-<version>.jar into the plugins folder.
      4. Restart Phon

Diarization Wizard

The diarization wizard assists with execution of available diarization tools. This will be the first action in the diarization procedure.

Step one of the wizard is to select the desired diarization tool:
  • LIUM

    The LIUM diarization tool (https://projets-lium.univ-lemans.fr/spkdiarization/) is an embedded (i.e., offline) diarization tool which is packaged with Phon.

    • Max speakers

      Results must conain no more than the specified number of speakers. A value of 0 indicates this parameter is not considered.

  • Google Cloud Speech-to-Text

    Requires a Google Cloud account and Internet access. Audio for the session will be uploaded and transcribed using Google Speech-to-Text. A Google Cloud account is required to use this function. See Setting Up Your Google Cloud Account for instructions on setting up Google Cloud for use with the diarization tool.

    • Project id

      The project id for your Google Cloud project available from the IAM > Settings page in your Google Cloud account page.

    • Google Cloud service account credentials file

      The service account credentials file (.json) created when Setting Up Your Google Cloud Account.

    • Google Cloud Storage bucket location

      The storage bucket location if the audio for diarization is longer than 60s. For best results choose the option closest to your geographical location.

    • Language model

      The language model used for Speech-to-Text transcription. (Only languages which have the diarization feature available are shown the list, more information can be found at https://cloud.google.com/speech-to-text/docs/languages.)

    • Audio format model

      The format of the audio file. Choose the format most appropriate (for most people this will be video)

      • default

        Use this model if your audio does not fit any of the other models. For example, you can use this for long-form audio recordings that feature a single speaker only. The default model will produce transcription results for any type of audio, including audio such as video clips that has a separate model specifically tailored to it. However, recognizing video clip audio using the default model would like yield lower-quality results than using the video model. Ideally, the audio is high-fidelity, recorded at 16,000Hz or greater sampling rate.

      • video

        Use this model for transcribing audio from video clips or other sources (such as podcasts) that have multiple speakers. This model is also often the best choice for audio that was recorded with a high-quality microphone or that has lots of background noise. For best results, provide audio recorded at 16,000Hz or greater sampling rate.

      • phone_call

        Use this model for transcribing audio from a phone call. Typically, phone audio is recorded at 8,000Hz sampling rate.

      • command_and_search

        Use this model for transcribing shorter audio clips. Some examples include voice commands or voice search.

    • Max speakers

      Results must conain no more than the specified number of speakers. A value of 0 indicates this parameter is not considered.

After clicking Next the selected diarization tool will execute with output displayed in the Log (this may take some time and will depend on length of audio.) Once the tool has completed Diarizaiton Results will be loaded in the Timeline view. Results are also saved in the session resources folder as __res/diarization/<corpus>_<session>.xml. Results may be modified saved, and recalled at a later time using the diarization menu.

Modifying and Adding Diarization Results

Diarization results are not automatically added to the session. Results from the diarization tool are loaded in the Timeline view for review and modification when the diarization wizard has finished. Diarization results will appear in place of the normal record tier and may be modifed using typical mouse and keyboard interactions (see the Timeline view documentation for details on modifying and playing record segments using keyboard and mouse.) A label titled Diarization Results will appear above the record grid to indicate that results have been loaded. This label will include a '*' if the results have been modfied. Results may be saved using the appropariate option in the diarization menu and recalled at a later time if necessary.

Actions

Diarization actions are available from a contextual menu displayed when clicking the Diarization button in the Timeline view's toolbar or by clicking the Diarization Results label in the Timeline view when results are loaded. Participant specific actions are available from the contextual menu displayed when clicking the participant name in the diarization results tier.

Merging Diarization Participants

The diarization process may produce more speakers then what exists in the audio recording and it is often useful to merge the records between two diarization speakers. To merge participants in loaded diarization results, use the Assign records to speaker sub-menu of the participant contextual menu and select a participant under Diarization Participants.

Assign Records to Existing Participant

To assign diarization results to an existing session participant, use the Assign records to speaker sub-menu of the speaker contextual menu and select a participant under Session Participants.

Adding Results to Session

There are several methods for adding sessions:
  • Add all diarization results to session

    Available from the diarization menu, this will add all diarization results to the session and close the results.

  • All all results for participant to session

    Available from the diarization and participant contextual menus; add all diarization results for a speaker to the session and remove them from the diarization results. Diarization results will be closed if all records have been added.

  • Add specific records to session

    Add individual records by selecting them and clicking the '+' icon dispayed above the selected records, or by pressing enter, or via the diarization menu. Records will be removed from the diarization results and results will be closed if all records have been added.