as the name implies, the detect_syllables script attempts to detect and count the syllables in English words. it is written in bash but uses an insane amount of embedded sed code to process the words in order to be able to make the correct count. the only reasonable way you can count syllables (afaik) is through counting vowels, and it isnt always so that one syllable is a single vowel neatly encased in consonants. no, instead we have shit like "instead", which is 2 syllables yet 3 vowels. so, the script catches that and changes it to "instid" so the vowel count and the syllable count are equal, and thus we have obtained a correct result. then just when you thought you'd just do that for every case of the "ea" combination, you suddenly realise that the word "realise" has "ea" but it is 2 syllables. fuck. and so begins the epic saga of writing sed rules.
i soon realised the amount of rules needed to do this was staggering and i had no idea what i was doing. so i Googled and found out some dude wrote a paper on this exact problem. i tried to read it but it was too complex, and i found it a bit of a shock to think that this was apparently a difficult enough thing for somebody to write a paper on. but, i continued valiantly, arrogant as i am, thinking i might be able to solve this. then i came across a word... a word that shattered my world. a filthy homonym. two meanings to the word, and each meaning came with its own syllable count, both different. it was such a sneaky bastard that even now the exact word escapes me. at this point, i gave up and cursed the universe.
after about three months i was recovered from my depression and just lived my life as if nothing ever happened, the damned script still sitting in my collection unfinished. until that moment, a couple of nights ago, when i looked at it again out of sheer boredom. as i read through the code, refamiliarizing myself with this old enemy, i suddenly found a brainwave that was quite interesting. however, i found it hard to focus on it and it slipped away. my friend suggested going to bed, and as she walked by the computer i, for some reason, decided to attempt to explain this weird script to her. to my amazement she seemed to understand well enough what it did and everything (as opposed to some other people i know, who, as soon as i use the words "script" and "program" in a sentence, completely zone out and decide they will never understand this). when i was finished explaining, she said "so you could use it to write poetry?" - exactly. i was happy for a moment, but then had to explain to her that it didnt work. in her innocence she then asked me "why is it in English?", and i didnt really know the answer to that. i just happened to apply it to the English language. "maybe you should try it on Dutch", she said. and that's what i did.
it turns out the Dutch language is much simpler than the English language when it comes to the counting of syllables through a script. it seems after only a handful of rules (97 lines vs 314 in the English version) it is already functioning without much error. i figured i'd post it here for posterity, for the Dutch-speakers who might like to try this, and for anyone interested in perhaps adapting this to their own language (most of the comments are in English, only a handful in Dutch, plus you will have to change most shit anyway).
here it is.
Code: Select all
#!/bin/bash
# script to detect syllables in a word
# count the vowels in the word.
# subtract any silent vowels, (like the silent e at the end of a word, or the second vowel when two vowels are together in a syllable)
# subtract one vowel from every diphthong (diphthongs only count as one vowel sound.)
# the number of vowels sounds left is the same as the number of syllables.
# usage: detect_syllables [word] ([word] [word] ...)
#
# not factoring in abbreviations or special characters
# exit when no argument is given
if [ $# -lt 1 ]; then
echo "$(basename $0): no argument given." >&2
exit 1
fi
# continue if there is an argument
full_count=""
declare -a syll_arr=()
for arg in "$@"; do
# convert to lowercase
word="$( tr [:upper:] [:lower:] <<< "$arg" )"
# cleanup of the word for syllable-matching
# '!' special character, translates into a single vowel.
clean=$(
sed '
####### --- PREREQUISITES
####### stuff which needs to be done beforehand.
s/\(.*\)/>\1-/
# A1. append a > at the start, and a - at the end of every word,
# to denote its beginning and ending
s/-..-/!/
# A2. 2-letter words are 1 vowel
####### --- DIPHTHONGS / TWEEKLANKEN
####### !!! mind the pattern of substitution:
####### 1st strong/common, 2nd weak/uncommon diphthongs
####### every word ending in vowel + i: make the i a j
s/\([aeoui]\)i-/\1j-/g
####### (3-letter diphthongs):
####### ----------------------
# handle these before the j-rule
s/eui\|ioe/!!/g
# fuck de i. hier doet ie weer raar met de e. eei & iee: 2 klinkers
# script zal geen rekening houden met fucking tremas
# maar essentieel enz fokken t dus eerst...
s/\([aeoui]l\)ien/\1!!n/g
s/\([sct]\)ieel/\1!l/g
s/\([bdfghklmnpqrsvwxz]\)ieel/\1!!l/g
s/eei/!!/g
s/iee/!!/g
# de letter i binnen een 3-klinker-groep is stiekem een j
s/\([aeoui][aeoui]\)i/\1j/g
s/\([aeoui]\)i\([aeoui]\)/\1j\2/g
s/i\([aeoui][aeoui]\)/j\1/g
# eeu: 1 klinker
s/eeu/!/g
####### (2-letter diphthongs):
####### ----------------------
# ij = y
s/ij/y/g
# y + vowel = j; y + consonant = ij
s/y\([aeoui]\)/j\1/g
s/y\([bcdfghklmnpqrstvwxz]\)/!\1/g
s/\([bcdfghklmnpqrstvwxz]\)y/\1!/g
# common diphthongs
s/aa\|ae\|ao\|au\|ei\|ee\|eu\|ie\|oe\|oi\|oo\|ou\|ui\|uu\|io/!/g
####### --- SPECIALS
s/!/o/g
# X1. translate special char ! into o
#s/-//g
# X2. translate special char - into nothing.
#s/>//g
# X3. translate special char > into nothing.
s/[aeui]/o/g
# MEES. translate all vowels into o
' <<< $word
)
# DEBUG:
#echo -n "$clean, "
# count the vowels/syllables
syll_count=$(grep -io [aeiou] <<< "$clean" | wc -w)
# put them in an array
syll_arr=(${syll_arr[@]} $syll_count)
done
echo ${syll_arr[@]}
exit 0