[script] detect syllables

Unread post by **rhowaldt** » Thu Feb 12, 2015 11:00 pm

so i was tired and stoned and my friend who was over was spacing out on her emotions so i decided i'd stare at a script again for a change. i wanted something entertaining so i grabbed my old detect_syllables script. probably only a handful of you know about this, and i enjoy typing so i'll explain.

as the name implies, the detect_syllables script attempts to detect and count the syllables in English words. it is written in bash but uses an insane amount of embedded sed code to process the words in order to be able to make the correct count. the only reasonable way you can count syllables (afaik) is through counting vowels, and it isnt always so that one syllable is a single vowel neatly encased in consonants. no, instead we have shit like "instead", which is 2 syllables yet 3 vowels. so, the script catches that and changes it to "instid" so the vowel count and the syllable count are equal, and thus we have obtained a correct result. then just when you thought you'd just do that for every case of the "ea" combination, you suddenly realise that the word "realise" has "ea" but it is 2 syllables. fuck. and so begins the epic saga of writing sed rules.

i soon realised the amount of rules needed to do this was staggering and i had no idea what i was doing. so i Googled and found out some dude wrote a paper on this exact problem. i tried to read it but it was too complex, and i found it a bit of a shock to think that this was apparently a difficult enough thing for somebody to write a paper on. but, i continued valiantly, arrogant as i am, thinking i might be able to solve this. then i came across a word... a word that shattered my world. a filthy homonym. two meanings to the word, and each meaning came with its own syllable count, both different. it was such a sneaky bastard that even now the exact word escapes me. at this point, i gave up and cursed the universe.

after about three months i was recovered from my depression and just lived my life as if nothing ever happened, the damned script still sitting in my collection unfinished. until that moment, a couple of nights ago, when i looked at it again out of sheer boredom. as i read through the code, refamiliarizing myself with this old enemy, i suddenly found a brainwave that was quite interesting. however, i found it hard to focus on it and it slipped away. my friend suggested going to bed, and as she walked by the computer i, for some reason, decided to attempt to explain this weird script to her. to my amazement she seemed to understand well enough what it did and everything (as opposed to some other people i know, who, as soon as i use the words "script" and "program" in a sentence, completely zone out and decide they will never understand this). when i was finished explaining, she said "so you could use it to write poetry?" - exactly. i was happy for a moment, but then had to explain to her that it didnt work. in her innocence she then asked me "why is it in English?", and i didnt really know the answer to that. i just happened to apply it to the English language. "maybe you should try it on Dutch", she said. and that's what i did.

it turns out the Dutch language is much simpler than the English language when it comes to the counting of syllables through a script. it seems after only a handful of rules (97 lines vs 314 in the English version) it is already functioning without much error. i figured i'd post it here for posterity, for the Dutch-speakers who might like to try this, and for anyone interested in perhaps adapting this to their own language (most of the comments are in English, only a handful in Dutch, plus you will have to change most shit anyway).

here it is.

Code: Select all

#!/bin/bash
# script to detect syllables in a word
# count the vowels in the word.
# subtract any silent vowels, (like the silent e at the end of a word, or the second vowel when two vowels are together in a syllable)
# subtract one vowel from every diphthong (diphthongs only count as one vowel sound.)
# the number of vowels sounds left is the same as the number of syllables.
# usage: detect_syllables [word] ([word] [word] ...)
#
# not factoring in abbreviations or special characters

# exit when no argument is given
if [ $# -lt 1 ]; then
    echo "$(basename $0): no argument given." >&2
    exit 1
fi
# continue if there is an argument
full_count=""
declare -a syll_arr=()
for arg in "$@"; do
# convert to lowercase
word="$( tr [:upper:] [:lower:] <<< "$arg" )"

# cleanup of the word for syllable-matching
# '!' special character, translates into a single vowel.
clean=$(
   sed '
####### --- PREREQUISITES
####### stuff which needs to be done beforehand.
      s/\(.*\)/>\1-/
      # A1. append a > at the start, and a - at the end of every word,
      # to denote its beginning and ending
      s/-..-/!/
      # A2. 2-letter words are 1 vowel
         
####### --- DIPHTHONGS / TWEEKLANKEN
####### !!! mind the pattern of substitution:
####### 1st strong/common, 2nd weak/uncommon diphthongs

####### every word ending in vowel + i: make the i a j
        s/\([aeoui]\)i-/\1j-/g

####### (3-letter diphthongs):
####### ----------------------
      # handle these before the j-rule
      s/eui\|ioe/!!/g
      # fuck de i. hier doet ie weer raar met de e. eei & iee: 2 klinkers
      # script zal geen rekening houden met fucking tremas
      # maar essentieel enz fokken t dus eerst...
      s/\([aeoui]l\)ien/\1!!n/g
      s/\([sct]\)ieel/\1!l/g
      s/\([bdfghklmnpqrsvwxz]\)ieel/\1!!l/g
      s/eei/!!/g
      s/iee/!!/g
      # de letter i binnen een 3-klinker-groep is stiekem een j
      s/\([aeoui][aeoui]\)i/\1j/g
      s/\([aeoui]\)i\([aeoui]\)/\1j\2/g
      s/i\([aeoui][aeoui]\)/j\1/g
      # eeu: 1 klinker
      s/eeu/!/g
      
####### (2-letter diphthongs):
####### ----------------------
      # ij = y
      s/ij/y/g
      # y + vowel = j; y + consonant = ij
      s/y\([aeoui]\)/j\1/g
      s/y\([bcdfghklmnpqrstvwxz]\)/!\1/g
      s/\([bcdfghklmnpqrstvwxz]\)y/\1!/g
      # common diphthongs
      s/aa\|ae\|ao\|au\|ei\|ee\|eu\|ie\|oe\|oi\|oo\|ou\|ui\|uu\|io/!/g

      
####### --- SPECIALS         
      s/!/o/g
      # X1. translate special char ! into o
      #s/-//g
      # X2. translate special char - into nothing.
      #s/>//g
      # X3. translate special char > into nothing.
      s/[aeui]/o/g
      # MEES. translate all vowels into o
   ' <<< $word
)

# DEBUG:
 #echo -n "$clean, "

# count the vowels/syllables
syll_count=$(grep -io [aeiou] <<< "$clean" | wc -w)
# put them in an array
syll_arr=(${syll_arr[@]} $syll_count)

done

echo ${syll_arr[@]}
exit 0

my next step will now be to build a script that will make poetry (rhythm based on syllable-counts, perhaps i'll start with haiku to keep it simple ;). i got this huge fucking Dutch wordlist, 200.000 words. it had some questionmark-characters in it; took Geany about 10 minutes to find-and-replace them :)

Unread post by **wuxmedia** » Fri Feb 13, 2015 12:02 am

* writes script using sed to interpret and translate syllables
* uses geany to search and replace... 0-O

Thanks Rhow, English language is way too fucked to be governed by rules.

Unread post by **Dr_Chroot** » Fri Feb 13, 2015 12:24 am

Rho, this is incredible. Well done!

wuxmedia wrote:* writes script using sed to interpret and translate syllables
* uses geany to search and replace... 0-O

That went through my mind too ;D M-% would have solved the woes in a single elegant swoop, but alas it was not to be. Amazing all the same... I'm not nearly intelligent enough to begin to come up with things like this organically without first seeing someone do something remotely similar!

Code: Select all

 ~/git  >>  ./rhoscript godgeleerdheid
4

Unread post by **rhowaldt** » Fri Feb 13, 2015 9:29 am

@wux: i used Geany for two reasons. 1. i am a fucking noob and finding out how to do it with non-GUI tools would've taken me more time. 2. it was an unknown character, and i find they often give me trouble on the commandline, because how the fuck do i reference it? it will just show up as some block or whatever, i think. so i copy-pasted it from the file inside Geany so i knew i was search-and-replacing the correct thing.

@Dr: trust me, this shit isn't that difficult. it is just a process of elimination, basically. all it takes is some brainpower and some time. for example, i understand fuck all of maths, and most of the shit you guys post is way over my head. i just have my niches where i happened to run into a problem to solve, then did it.

thanks for the kind words :)

Unread post by **ivanovnegro** » Fri Feb 13, 2015 1:20 pm

Epic.

Btw, the blocks come because you do not use Unicode but I told you already somewhere. :)

Unread post by **rhowaldt** » Fri Feb 13, 2015 1:49 pm

^ ja, you're right. another thing i still have to fix :)

Unread post by **GekkoP** » Fri Feb 13, 2015 5:54 pm

Wow.
Now do that for Chinese and you're the man. ;)

Unread post by **machinebacon** » Fri Feb 13, 2015 11:51 pm

moved to /usr/local/bin :)

LinuxBBQ

[script] detect syllables

[script] detect syllables

Re: [script] detect syllables

Re: [script] detect syllables

Re: [script] detect syllables

Re: [script] detect syllables

Re: [script] detect syllables

Re: [script] detect syllables

Re: [script] detect syllables