Orthography

The Orthography tier in Phon encodes the spoken form of an utterance. Phon's orthography format is based on the CHAT transcription system used by TalkBank and CHILDES. The CHAT main line maps directly to Phon's Orthography tier, with one difference: CHAT embeds media segment timing on the main line, while Phon stores media segments in a separate Segment tier.

Within the Orthography tier, words are separated by spaces. Each word may be modified using the prefixes and suffixes defined below. Utterances end with a terminator (e.g., period, question mark). Events, annotations, and other coding may also appear inline.

For the complete CHAT specification, see the CHAT Manual.

Words

Words are the basic units of the Orthography tier. A word is a series of characters separated by spaces. The first word of an utterance is not capitalized unless it is a proper noun or a word normally capitalized on its own (e.g., "I" in English, nouns in German).

Word Prefixes

Word prefixes identify words with special status. Prefixed words appear in a secondary color in the Transcript view to distinguish them from regular word text.


Prefix	Category	Example
`0`	word omission (excluded from word alignment)	`0word`
`&+`	fragment (incomplete word)	`&+ba`
`&-`	filler	`&-um`
`&~`	nonword	`&~ba`

Special Form Markers

Special form markers identify words that are not found in standard dictionaries or have some special status. They are placed at the end of a word after the @ symbol. Special form markers appear in a secondary color in the Transcript view to distinguish them from regular word text.


Suffix	Category	Example
`@a`	addition	`xxx@a`
`@b`	babbling	`abame@b`
`@c`	child-invented form	`gumma@c`
`@d`	dialect form	`younz@d`
`@e`	echolalia	`want@e more@e`
`@f`	family-specific form	`bunko@f`
`@fp`	filled pause	`um@fp`
`@g`	general special form	`gonga@g`
`@i`	interjection	`uhhuh@i`
`@k`	multiple letters (kana)	`abcd@k`
`@l`	letter	`b@l`
`@n`	neologism	`breaked@n`
`@nv`	no voice	`ha@nv`
`@o`	onomatopoeia	`woofwoof@o`
`@p`	phonology consistent form	`aga@p`
`@q`	quoted metareference	`no@q`
`@sas`	sign & speech	`apple@sas`
`@si`	singing	`lalala@si`
`@sl`	signed language	`apple@sl`
`@t`	test word	`wug@t`
`@u`	UNIBET transcription	`binga@u`
`@wp`	word play	`goobarumba@wp`
`@x`	words to be excluded	`stuff@x`
`@z:*`	user-defined code	`word@z:rtfd`

Untranscribed Material

These special words represent material that cannot be transcribed normally. Untranscribed words should be aligned with * in IPA Target and IPA Actual tiers.


Code	Meaning
`xxx`	unintelligible speech
`yyy`	unintelligible with phonological coding on %pho line
`www`	untranscribed material

Incomplete Words

When a word is incomplete but the intended meaning is clear, the missing material is enclosed in parentheses within the word: (be)cause, sit(ting). The word is treated as the complete form for analysis.

Compound Words and Clitics

Two types of word concatenation are supported:

+ joins compound words, e.g., bird+house
~ joins clitics to their host word, e.g., is~n't

Language Specification

Words from a secondary language are marked with @s followed by a language code, e.g., istenem@s:hu for a Hungarian word in an English transcript. Language codes follow the ISO 639 standard.

Utterance Terminators

Every utterance must end with a terminator. The three basic terminators are the period, question mark, and exclamation point. Special terminators begin with + and end with a basic terminator.


Terminator	Name	Description
`.`	period	end of a declarative utterance
`?`	question	end of a question
`!`	exclamation	end of an imperative or emphatic utterance
`+.`	broken for coding	utterance broken at a phrasal boundary to mark overlap
`+...`	trail off	incomplete utterance where speaker trails off
`+..?`	trail off question	question that trails off
`+!?`	question with exclamation	question spoken with amazement
`+/.`	interruption	utterance interrupted by another speaker
`+/?`	interruption question	question interrupted by another speaker
`+//.`	self interruption	speaker breaks off and starts a new utterance
`+//?`	self interruption question	question where speaker self-interrupts
`+"/ .`	quotation next line	quoted material follows on next line
`+".`	quotation precedes	quoted material preceded this utterance

Utterance Linkers

Linkers appear at the beginning of an utterance to indicate how it connects to a preceding utterance. All linkers begin with +.


Linker	Name	Description
`+"`	quoted utterance	marks an utterance being directly quoted
`+^`	quick uptake	utterance follows immediately after previous speaker
`+<`	lazy overlap	utterance overlaps previous utterance (without specifying extent)
`+,`	self completion	completion of own utterance after interruption
`++`	other completion	completion of another speaker's utterance

Pauses

Unfilled pauses between words are coded with parentheses containing periods. Pauses are included in word alignment and should be aligned with the corresponding pause in IPA Target and IPA Actual tiers.


Notation	Length	Example
`(.)`	simple (short pause)	`I don't (.) know .`
`(..)`	long pause	`I don't (..) know .`
`(...)`	very long pause	`(...) what do you think ?`
`(1.5)`	numeric (exact seconds)	`I don't (0.15) know .`

Numeric pauses may include minutes using a colon, e.g., (1:05.15) for one minute and 5.15 seconds.

Events

Events describe actions, sounds, or other non-word occurrences that happen at a specific point within an utterance.

Simple Events (Happenings)

Simple events use the &= prefix to mark actions and sounds such as coughs, laughs, and sneezes: &=laughs, &=coughs, &=sneezes.

Events may include an object after a colon: &=imit:motor, &=points:car.

Interposed Words

An interposed word from another speaker is marked with &* followed by the speaker's three-letter ID, a colon, and the word:

when I was at my friend's house &*MOT:mhm the dog tried to lick me .

Long Events

Events spanning multiple words use begin/end markers:

Vocal: &{l=laughs ... &}l=laughs
Nonvocal: &{n=waving:hands ... &}n=waving:hands

Scoped Annotations

Scoped annotations are enclosed in square brackets and refer to stretches of speech. When angle brackets precede a scoped annotation, the annotation applies to the enclosed material. Without angle brackets, the annotation applies to the single preceding word.

Example: <I wanted> [/] I wanted cereal .

Retracing and Repetition Markers


Marker	Name	Description
`[/]`	retracing (repetition)	speaker repeats preceding material without change
`[//]`	retracing with correction	speaker repeats and corrects preceding material
`[///]`	retracing reformulation	complete reformulation of the message
`[/?]`	retracing unclear	type of retracing is uncertain
`[/-]`	false start	speaker abandons utterance and starts a new one

Stressing, Guessing, and Exclusion


Marker	Description
`[!]`	stressing of preceding word(s)
`[!!]`	contrastive stressing
`[?]`	best guess (transcription uncertain)
`[e]`	excluded material (omitted from analysis)

Group Annotations

Group annotations provide explanatory text within square brackets:


Notation	Name	Example
`[= text]`	explanation (target word)	`etymologist [= entomologist]`
`[=! text]`	paralinguistics	`that's mine [=! cries] .`
`[=? text]`	alternative transcription	`one or two [=? one too]`
`[% text]`	inline comment	`wouldn't [% said with emphasis] do that`

Replacements

A replacement substitutes a standard form for a nonstandard form on the preceding word:

[: text] — replacement (e.g., whyncha [: why don't you])
[:: text] — real-word replacement

Error Marking

Errors are marked by placing [* text] after the error, e.g., goed [: went] [*] .

Duration

Duration of a preceding event or word can be marked as [# M:S.ms], e.g., [# 1:05.3].

Overlap Markers

Overlap markers indicate simultaneous speech between speakers:

[>] — overlap follows (material overlaps with next speaker)
[<] — overlap precedes (material overlaps with previous speaker)

When multiple overlaps occur in a single utterance, they are numbered: [>1], [<1], [>2], [<2].

Postcodes and Freecodes

[+ code] — postcode: utterance-level code placed after the terminator
[^ code] — freecode: marks a local event at the point of insertion

Separators and Tag Markers

Separators provide conventional punctuation within utterances. Unlike terminators, separators do not end the utterance. Tag markers indicate pragmatic function.


Symbol	Description
`,`	comma (pause, syntactic juncture)
`;`	semicolon
`[^c]`	clause delimiter
`‡` (double dagger)	vocative marker
`„` (double low quote)	tag marker

Quotation

Short quoted stretches within an utterance are enclosed in curly (typographic) quotation marks: “ (begin, U+201C) and ” (end, U+201D). For longer quoted material spanning multiple utterances, use the quotation terminators and linkers (+"/. and +").

Tone Direction Markers

Tone markers indicate intonation contour at the end of or within utterances:


Symbol	Direction
`⇗` (U+21D7)	rising to high
`↗` (U+2197)	rising to mid
`→` (U+2192)	level
`↘` (U+2198)	falling to mid
`⇘` (U+21D8)	falling to low

Prosody Within Words

Prosodic features can be marked within words:


Symbol	Feature	Example
`:`	drawl (lengthened syllable)	`bana:nas`
`^`	pause between syllables	`rhi^noceros`
`ˈ` (U+02C8)	primary stress	`baˈna:nas`
`ˌ` (U+02CC)	secondary stress	`ˌbaˈna:nas`

Phonetic Groups

When multiple orthographic words map to a single phonetic transcription, they can be grouped using angle quotation marks: ‹ (U+2039) and › (U+203A). For example, ‹going to› represents two orthographic words with a single phonetic form.

Conversation Analysis (CA) Coding

Phon supports an extensive set of Conversation Analysis symbols for detailed transcription of speech features. These are specialized symbols used primarily in CA research.

CA Elements Within Words

The following Unicode symbols can appear within words to mark articulatory features:


Symbol	Unicode	Feature
`≠`	U+2260	blocked segments
`∾`	U+223E	constriction
`∙`	U+2219	inhalation
`↓`	U+2193	pitch down
`↻`	U+21BB	pitch reset
`↑`	U+2191	pitch up
`⁑`	U+2051	hardening
`⤇`	U+2907	hurried start
`⤆`	U+2906	sudden stop

CA Scope Delimiters

These symbols mark the beginning of a stretch of speech with a particular quality. The same symbol marks the end of the stretch:


Symbol	Unicode	Quality
`⁎`	U+204E	creaky voice
`∆`	U+2206	faster
`◉`	U+25C9	louder
`°`	U+00B0	softer
`∇`	U+2207	slower
`∬`	U+222C	whisper
`☺`	U+263A	smile voice
`∮`	U+222E	singing
`§`	U+00A7	precise
`↫`	U+21AB	repeated segment
`⁇`	U+2047	unsure
`♋`	U+264B	breathy voice
`▔`	U+2594	high pitch
`▁`	U+2581	low pitch

CA Overlap Points

Precise overlap onset and offset can be marked within words using bracket-like symbols:


Symbol	Unicode	Position
`⌈`	U+2308	top start (first speaker overlap begins)
`⌉`	U+2309	top end (first speaker overlap ends)
`⌊`	U+230A	bottom start (second speaker overlap begins)
`⌋`	U+230B	bottom end (second speaker overlap ends)