Czech Words Frequency List

May 27, 2020 2 minutes

I make a decision to learn the Czech language and to procrastinate I’ve made up a problem for myself:

I need a deck for the Anki with the most frequently accrued words in that language.

So my mind considered that problem as a blocker to my further Czech lessons and force me to spend time on it. First of all, I need data, and a very convenient way to get a lot of Czech text was movie subtitles. Luckily it could be gathered as one big archive from here. That file contains lines of tokens that could be easily separated.

First, my implementation was blunt and straight: python + SQLite. Both are simple and I have some experience. And there was an obvious problem: it was slow. I had some ideas about optimization at first but then decided to change it drastically.

The second implementation was on pure rust and in the end, it was even easier. It turned out then there is no need for a database and all collected data could be stored in a memory without any problem.

use std::io::BufReader;
use std::io::BufRead;
use std::fs::File;

// I'm using an external crate to have an easy way to get the most common items.
use counter::Counter;

fn main() {
    let mut counter: Counter<String> = Counter::new();
    // Create a reader from a file
    let file = BufReader::new(File::open("cs.tok").expect("Cannot open file."));

    // Each line will be splitted by whiltespace
    for line in file.lines() {
        for tok in line.unwrap().split_whitespace(){
            // Skip a token if first character is not a letter.
            if tok.chars().next().unwrap().is_alphabetic() {
                counter[&tok.to_lowercase()] += 1
            }
        }
    }

    // Print first 5000 most common tokens
    for (tok, _) in counter.most_common_ordered().iter().take(5000) {
        println!("{}", tok);
    }
}

So, now I have in my hands a list of the most frequently used words. In movies at least. Current list available here. This method is applyable to any language you can get a decent amount of text.

Next step will be Anki deck generation.

Have a comment on one of my posts? Start a discussion in my public inbox by sending an email to ~histrio/[email protected] [mailing list etiquette ]

False Protagonist

Czech Words Frequency List