wordfreq: Open source and open data about word frequencies

Robyn Speer

September 1, 2015

Unfortunately, the images in this post have been lost to history. We blame WordPress, which we don’t use anymore. We recommend reading a more recent post anyway.

Often, in NLP, you need to answer the simple question: “is this a common word?” It turns out that this leaves the computer to answer a more vexing question: “What’s a word?”

Let’s talk briefly about why word frequencies are important. In many cases, you want to assign more significance to uncommon words. For example, a product review might contain the word “use” and the word “defective”, and the word “defective” carries way more information. If you’re wondering what the deal is with John Kasich, a headline that mentions “Kasich” will be much more likely to be what you’re looking for than one that merely mentions “John”.

For purposes like these, it would be nice if we could just import a Python package that could tell us whether one word was more common than another, in general, based on a wide variety of text. We looked for a while and couldn’t find it. So we built it.

wordfreq provides estimates of the frequencies of words in many languages, loading its data from efficiently-compressed data structures so it can give you word frequencies down to 1 occurrence per million without having to access an external database. It aims to avoid being limited to a particular domain or style of text, getting its data from a variety of sources: Google Books, Wikipedia, OpenSubtitles, Twitter, and the Leeds Internet Corpus.

The 10 most common words that wordfreq knows in 15 languages. Yes, it can handle multi-character words in Chinese and Japanese; those just aren’t in the top 10. A puzzle for Unicode geeks: guess where the start of the Arabic list is.

Partial solutions: stopwords and inverse document frequency¶

Those who are familiar with the basics of information retrieval probably have a couple of simple suggestions in mind for dealing with word frequencies.

One is to come up with a list of stopwords, words such as “the” and “of” that are too common to use for anything. Discarding stopwords can be a useful optimization, but that’s far too blunt of an operation to solve the word frequency problem in general. There’s no place to draw the bright line between stopwords and non-stopwords, and in the “John Kasich” example, it’s not the case that “John” should be a stopword.

Another partial solution would be to collect all the documents you’re interested in, and re-scale all the words according to their inverse document frequency or IDF. This is a quantity that decreases as the proportion of documents a word appears in increases, reaching 0 for a word that appears in every document.

One problem with IDF is that it can’t distinguish a word that appears in a lot of documents because it’s unimportant, from a word that appears in a lot of documents because it’s very important to your domain. Another, more practical problem with IDF is that you can’t calculate it until you’ve seen all your documents, and it fluctuates a lot as you add documents. This is particularly an issue if your documents arrive in an endless stream.

We need good domain-general word frequencies, not just domain-specific word frequencies, because without the general ones, we can’t determine which domain-specific word frequencies are interesting.

Avoiding biases¶

The counts of one resource alone tend to tell you more about that resource than about the language. If you ask Wikipedia alone, you’ll find that “census”, “1945”, and “stub” are very common words. If you ask Google Books, you’ll find that “propranolol” is supposed to be 10 times more common than “lol” overall (and also that there’s something funny going on, so to speak, in the early 1800s).

If you collect data from Twitter, you’ll of course find out how common “lol” is. You also might find that the ram emoji “🐏” is supposed to be extremely common, because that guy from One Direction once tweeted “We are derby super 🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏”, and apparently every fan of One Direction who knows what Derby Super Rams are retweeted it.

Yes, wordfreq considers emoji to be words. Its Twitter frequencies would hardly be complete without them.

We can’t entirely avoid the biases that come from where we get our data. But if we collect data from enough different sources (not just larger sources), we can at least smooth out the biases by averaging them between the different sources.

What’s a word?¶

You have to agree with your wordlist on the matter of what constitutes a “word”, or else you’ll get weird results that aren’t supported by the actual data.

Do you split words at all spaces and punctuation? Which of the thousands of symbols in Unicode are punctuation? Is an apostrophe punctuation? Is it punctuation when it puts a word in single quotes? Is it punctuation in “can’t”, or in “l’esprit”? How many words is “U.S.” or “google.com”? How many words is “お早うございます” (“good morning”), taking into account that Japanese is written without spaces? The symbol “-” probably doesn’t count as a word, but does ”+”? How about “☮” or ”♥”?

The process of splitting text into words is called “tokenization”, and everyone’s got their own different way to do it, which is a bit of a problem for a word frequency list.

We tried a few ways to make a sufficiently simple tokenization function that we could use everywhere, across many languages. We ended up with our own ad-hoc rule including large sets of Unicode characters and a special case for apostrophes, and this is in fact what we used when we originally released wordfreq 1.0, which came packaged with regular expressions that look like attempts to depict the Flying Spaghetti Monster in text.

But shortly after that, I realized that the Unicode Consortium had already done something similar, and they’d probably thought about it for more than a few days.

Word splitting in Unicode. Not pictured: how to decide which of these segments count as “words”.

This standard for tokenization looked like almost exactly what we wanted, and the last thing holding me back was that implementing it efficiently in Python looked like it was going to be a huge pain. Then I found that the regex package (not the re package built into Python) contains an efficient implementation of this standard. Defining how to split text into words became a very simple regular expression… except in Chinese and Japanese, because a regular expression has no chance in a language where the separation between words is not written in any way.

So this is how wordfreq 1.1 identifies the words to count and the words to look up. Of course, there is going to be data that has been tokenized in a different way. When wordfreq gets something that looks like it should be multiple words, it will look them up separately and estimate their combined frequency, instead of just returning 0.

Language support

wordfreq supports 15 commonly-used languages, but of course some languages are better supported than others. English is quite polished, for example, while Chinese so far is just there to be better than nothing.

The reliability of each language corresponds pretty well with the number of different data sources we put together to make the wordlist. Some sources are hard to get in certain languages. Perhaps unsurprisingly, for example, not much of Twitter is in Chinese. Perhaps more surprisingly, not much of it is in German either.

The word lists that we’ve built for wordfreq represent the languages where we have at least two sources. I would consider the ones with two sources a bit dubious, while all the languages that have three or more sources seem to have a reasonable ranking of words.

5 sources: English
4 sources: Arabic, French, German, Italian, Portuguese, Russian, Spanish
3 sources: Dutch, Indonesian, Japanese, Malay
2 sources: Chinese, Greek, Korean

Compact wordlists¶

When we were still figuring this all out, we made several 0.x versions of wordfreq that required an external SQLite database with all the word frequencies, because there are millions of possible words and we had to store a different floating-point frequency for each one. That’s a lot of data, and it would have been infeasible to include it all inside the Python package. (GitHub and PyPI don’t like huge files.) We ended up with a situation where installing wordfreq would either need to download a huge database file, or build that file from its source data, both of which would consume a lot of time and computing resources when you’re just trying to install a simple package.

As we tried different ways of shipping this data around to all the places that needed it, we finally tried another tactic: What if we just distributed less data?

Two assumptions let us greatly shrink our word lists:

We don’t care about the frequencies of words that occur less than once per million words. We can just assume all those words are equally informative.
We don’t care about, say, 2% differences in word frequency.

Now instead of storing a separate frequency for each word, we group the words into 600 possible tiers of frequency. You could call these tiers “centibels”, a logarithmic unit similar to decibels, because there are 100 of them for each factor of 10 in the word frequency. Each of them represents a band of word frequencies that spans about a 2.3% difference. The data we store can then be simplified to “Here are all the words in tier #330… now here are all the words in tier #331…” and converted to frequencies when you ask for them.

Some tiers of word frequencies in English.

This let us cut down the word lists to an entirely reasonable size, so that we can put them in the repository, and just keep them in memory while you’re using them. The English word list, for example, is 245 KB, or 135 KB compressed.

But it’s important to note the trade-off here, that wordfreq only represents sufficiently common words. It’s not suited for comparing rare words to each other. A word rarer than “amulet”, “bunches”, “deactivate”, “groupie”, “pinball”, or “slipper”, all of which have a frequency of about 1 per million, will not be represented in wordfreq.

Getting the package¶

wordfreq is available on GitHub, or it can be installed from the Python Package Index with the command pip install wordfreq. Documentation can be found in its README on GitHub.

Comparing the frequency per million words of two spellings of “café”, in English and French.

ftfy (fixes text for you) 4.0: changing less and fixing more

Robyn Speer

May 21, 2015

Comments

ftfy is a Python tool that takes in bad Unicode and outputs good Unicode. I developed it because we really needed it at Luminoso — the text we work with can be damaged in several ways by the time it gets to us. It’s become our most popular open-source project by far, as many other people have the same itch that we’re scratching.

The coolest thing that ftfy does is to fix mojibake — those mix-ups in encodings that cause the word más to turn into mÃ¡s or even mÃƒÂ¡s. (I’ll recap why this happens and how it can be reversed below.) Mojibake is often intertwined with other problems, such as un-decoded HTML entities (más), and ftfy fixes those as well. But as we worked with the ftfy 3 series, it gradually became clear that the default settings were making some changes that were unnecessary, and from time to time they would actually get in the way of the goal of cleaning up text.

ftfy 4 includes interesting new fixes to creative new ways that various software breaks Unicode. But it also aims to change less text that doesn’t need to be changed. This is the big change that made us increase the major version number from 3 to 4, and it’s fundamentally about Unicode normalization. I’ll discuss this change below under the heading “Normalization”.

Mojibake and why it happens¶

Mojibake is what happens when text is written in one encoding and read as if it were a different one. It comes from the Japanese word “•¶Žš‰»‚¯” — no, sorry, “文字化け” — meaning “character corruption”. Mojibake turns everything but basic ASCII characters into nonsense.

Suppose you have a word such as “más”. In UTF-8 — the encoding used by the majority of the Internet — the plain ASCII letters “m” and “s” are represented by the familiar single byte that has represented them in ASCII for 50 years. The letter “á”, which is not ASCII, is represented by two bytes.

The details of all the changes can be found, of course, in the CHANGELOG.

Has ftfy solved a problem for you? Have you stumped it with a particularly bizarre case of mojibake? Let us know in the comments or on Twitter.

ftfy (fixes text for you) version 3.0

Robyn Speer

August 26, 2013

Comments

About a year ago, we blogged about how to ungarble garbled Unicode in a post called Fixing common Unicode mistakes with Python â€” after they’ve been made. Shortly after that, we released the code in a Python package called ftfy.

You have almost certainly seen the kind of problem ftfy fixes. Here’s a shoutout from a developer who found that her database was full of place names such as “BucureÅŸti, Romania” because of someone else’s bug. That’s easy enough to fix:

pip install ftfy

If ftfy is useful to you, we’d love to hear how you’re using it. You can reply to the comments here or e-mail us at info@luminoso.com.

Fixing Unicode mistakes and more: the ftfy package

Robyn Speer

August 24, 2012

Comments

There’s been a great response to my earlier post, Fixing common Unicode mistakes with Python. This is clearly something that people besides me needed. In fact, someone already made the code into a web site, at fixencoding.com. I like the favicon.

I took the suggestion to split the code into a new standalone package. It’s now called ftfy, standing for “fixes text for you”. You can install it with pip install ftfy.

I observed that I was doing interesting things with Unicode in Python, and yet I wasn’t doing it in Python 3, which basically makes me a terrible person. ftfy is now compatible with both Python 2 and Python 3.

Something else amusing happened: At one point, someone edited the previous post and WordPress barfed HTML entities all over its text. All the quotation marks turned into “, for example. So, for a bit, that post was setting a terrible example about how to handle text correctly!

I took that as a sign that I should expand ftfy so that it also decodes HTML entities (though it will leave them alone in the presence of HTML tags). While I was at it, I also made it turn curly quotes into straight ones, convert Windows line endings to Unix, normalize Unicode characters to their canonical forms, strip out terminal color codes, and remove miscellaneous control characters. The original fix_bad_unicode is still in there, if you just want the encoding fixer without the extra stuff.

Fixing common Unicode mistakes with Python â€“ after they’ve been made

Robyn Speer

August 20, 2012

Comments

Update: Not only can you fix Unicode mistakes with Python, you can fix Unicode mistakes with our open source Python package, “ftfy”.

You have almost certainly seen text on a computer that looks something like this:

If numbers arenâ€™t beautiful, I donâ€™t know what is. â€“Paul ErdÅ‘s

Somewhere, a computer got hold of a list of numbers that were intended to constitute a quotation and did something distinctly un-beautiful with it. A person reading that can deduce that it was actually supposed to say this:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Here’s what’s going on. A modern computer has the ability to display text that uses over 100,000 different characters, but unfortunately that text sometimes passes through a doddering old program that believes there are only the 256 that it can fit in a single byte. The program doesn’t even bother to check what encoding the text is in; it just uses its own favorite encoding and turns a bunch of characters into strings of completely different characters.

Now, you’re not the programmer causing the encoding problems, right? Because you’ve read something like Joel Spolsky’s The Absolute Minimum Every Developer Absolutely, Positively Must Know About Unicode And Character Sets or the Python Unicode HOWTO and you’ve learned the difference between text and bytestrings and how to get them right.

But the problem is that sometimes you might have to deal with text that comes out of other code. We deal with this a lot at Luminoso, where the text our customers want us to analyze has often passed through several different pieces of software, each with their own quirks, probably with Microsoft Office somewhere in the chain.

So this post isn’t about how to do Unicode right. It’s about a tool we came up with for damage control after some other program does Unicode wrong. It detects some of the most common encoding mistakes and does what it can to undo them.

Here’s the type of Unicode mistake we’re fixing.

Some text, somewhere, was encoded into bytes using UTF-8 (which is quickly becoming the standard encoding for text on the Internet).
The software that received this text wasn’t expecting UTF-8. It instead decodes the bytes in an encoding with only 256 characters. The simplest of these encodings is the one called “ISO-8859-1”, or “Latin-1” among friends. In Latin-1, you map the 256 possible bytes to the first 256 Unicode characters. This encoding can arise naturally from software that doesn’t even consider that different encodings exist.
The result is that every non-ASCII character turns into two or three garbage characters.

The three most commonly-confused codecs are UTF-8, Latin-1, and Windows-1252. There are lots of other codecs in use in the world, but they are so obviously different from these three that everyone can tell when they’ve gone wrong. We’ll focus on fixing cases where text was encoded as one of these three codecs and decoded as another.

A first attempt¶

When you look at the kind of junk that’s produced by this process, the character sequences seem so ugly and meaningless that you could just replace anything that looks like it should have been UTF-8. Just find those sequences, replace them unconditionally with what they would be in UTF-8, and you’re done. In fact, that’s what my first version did. Skipping a bunch of edge cases and error handling, it looked something like this:

# A table telling us how to interpret the first word of a letter's Unicode
# name. The number indicates how frequently we expect this script to be used
# on computers. Many scripts not included here are assumed to have a frequency
# of "0" -- if you're going to write in Linear B using Unicode, you're
# probably aware enough of encoding issues to get it right.
#
# The lowercase name is a general category -- for example, Han characters and
# Hiragana characters are very frequently adjacent in Japanese, so they all go
# into category 'cjk'. Letters of different categories are assumed not to
# appear next to each other often.
SCRIPT_TABLE = {
    'LATIN': (3, 'latin'),
    'CJK': (2, 'cjk'),
    'ARABIC': (2, 'arabic'),
    'CYRILLIC': (2, 'cyrillic'),
    'GREEK': (2, 'greek'),
    'HEBREW': (2, 'hebrew'),
    'KATAKANA': (2, 'cjk'),
    'HIRAGANA': (2, 'cjk'),
    'HIRAGANA-KATAKANA': (2, 'cjk'),
    'HANGUL': (2, 'cjk'),
    'DEVANAGARI': (2, 'devanagari'),
    'THAI': (2, 'thai'),
    'FULLWIDTH': (2, 'cjk'),
    'MODIFIER': (2, None),
    'HALFWIDTH': (1, 'cjk'),
    'BENGALI': (1, 'bengali'),
    'LAO': (1, 'lao'),
    'KHMER': (1, 'khmer'),
    'TELUGU': (1, 'telugu'),
    'MALAYALAM': (1, 'malayalam'),
    'SINHALA': (1, 'sinhala'),
    'TAMIL': (1, 'tamil'),
    'GEORGIAN': (1, 'georgian'),
    'ARMENIAN': (1, 'armenian'),
    'KANNADA': (1, 'kannada'),  # mostly used for looks of disapproval
    'MASCULINE': (1, 'latin'),
    'FEMININE': (1, 'latin')
}

An intelligent Unicode fixer¶

Because encoded text can actually be ambiguous, we have to figure out whether the text is better when we fix it or when we leave it alone. The venerable Mark Pilgrim has a key insight when discussing his chardet module:

Encoding detection is really language detection in drag. –Mark Pilgrim, Dive Into Python 3

The reason the word “Bront녔” is so clearly wrong is that the first five characters are Roman letters, while the last one is Hangul, and most words in most languages don’t mix two different scripts like that.

This is where Python’s standard library starts to shine. The unicodedata module can tell us lots of things we want to know about any given character:

>>> import unicodedata
>>> unicodedata.category(u't')
'Ll'
>>> unicodedata.name(u't')
'LATIN SMALL LETTER T'
>>> unicodedata.category(u'녔')
'Lo'
>>> unicodedata.name(u'녔')
'HANGUL SYLLABLE NYEOSS'

Now we can write a more complicated but much more principled Unicode fixer by following some rules of thumb:

We want to apply a consistent transformation that minimizes the number of “weird things” that happen in a string.
Obscure single-byte characters, such as ¶ and ƒ, are weird.
Math and currency symbols adjacent to other symbols are weird.
Having two adjacent letters from different scripts is very weird.
Causing new decoding errors that turn normal characters into � is unacceptable and should count for much more than any other problem.
Favor shorter strings over longer ones, as long as the shorter string isn’t weirder.
Favor correctly-decoded Windows-1252 gremlins over incorrectly-decoded ones.

That leads us to a complete Unicode fixer that applies these rules. It does an excellent job at fixing files full of garble line-by-line, such as the University of Leeds Internet Spanish frequency list, which picked up that “mÃ¡s” is a really common word in Spanish text because there is so much incorrect Unicode on the Web.

The code we arrive at appears below. (But as I edit this post six years later, I should remind you that this was 2012! We’ve gotten much fancier about this, so you should try our full-featured Unicode fixing library, ftfy.)

# -*- coding: utf-8 -*-
#
# This code has become part of the "ftfy" library:
#
# http://ftfy.readthedocs.io/en/latest/
#
# That library is actively maintained and works on Python 2 or 3. This recipe
# is not.

import unicodedata

def fix_bad_unicode(text):
    u"""
    Something you will find all over the place, in real-world text, is text
    that's mistakenly encoded as utf-8, decoded in some ugly format like
    latin-1 or even Windows codepage 1252, and encoded as utf-8 again.

    This causes your perfectly good Unicode-aware code to end up with garbage
    text because someone else (or maybe "someone else") made a mistake.

    This function looks for the evidence of that having happened and fixes it.
    It determines whether it should replace nonsense sequences of single-byte
    characters that were really meant to be UTF-8 characters, and if so, turns
    them into the correctly-encoded Unicode character that they were meant to
    represent.

    The input to the function must be Unicode. It's not going to try to
    auto-decode bytes for you -- then it would just create the problems it's
    supposed to fix.

        >>> print fix_bad_unicode(u'Ãºnico')
        único

        >>> print fix_bad_unicode(u'This text is fine already :þ')
        This text is fine already :þ

    Because these characters often come from Microsoft products, we allow
    for the possibility that we get not just Unicode characters 128-255, but
    also Windows's conflicting idea of what characters 128-160 are.

        >>> print fix_bad_unicode(u'This â€” should be an em dash')
        This — should be an em dash

    We might have to deal with both Windows characters and raw control
    characters at the same time, especially when dealing with characters like
    \x81 that have no mapping in Windows.

        >>> print fix_bad_unicode(u'This text is sad .â\x81”.')
        This text is sad .⁔.

    This function even fixes multiple levels of badness:

        >>> wtf = u'\xc3\xa0\xc2\xb2\xc2\xa0_\xc3\xa0\xc2\xb2\xc2\xa0'
        >>> print fix_bad_unicode(wtf)
        ಠ_ಠ

    However, it has safeguards against fixing sequences of letters and
    punctuation that can occur in valid text:

        >>> print fix_bad_unicode(u'not such a fan of Charlotte Brontë…”')
        not such a fan of Charlotte Brontë…”

    Cases of genuine ambiguity can sometimes be addressed by finding other
    characters that are not double-encoding, and expecting the encoding to
    be consistent:

        >>> print fix_bad_unicode(u'AHÅ™, the new sofa from IKEA®')
        AHÅ™, the new sofa from IKEA®

    Finally, we handle the case where the text is in a single-byte encoding
    that was intended as Windows-1252 all along but read as Latin-1:

        >>> print fix_bad_unicode(u'This text was never Unicode at all\x85')
        This text was never Unicode at all…
    """
    if not isinstance(text, unicode):
        raise TypeError("This isn't even decoded into Unicode yet. "
                        "Decode it first.")
    if len(text) == 0:
        return text

    maxord = max(ord(char) for char in text)
    tried_fixing = []
    if maxord < 128:
        # Hooray! It's ASCII!
        return text
    else:
        attempts = [(text, text_badness(text) + len(text))]
        if maxord < 256:
            tried_fixing = reinterpret_latin1_as_utf8(text)
            tried_fixing2 = reinterpret_latin1_as_windows1252(text)
            attempts.append((tried_fixing, text_cost(tried_fixing)))
            attempts.append((tried_fixing2, text_cost(tried_fixing2)))
        elif all(ord(char) in WINDOWS_1252_CODEPOINTS for char in text):
            tried_fixing = reinterpret_windows1252_as_utf8(text)
            attempts.append((tried_fixing, text_cost(tried_fixing)))
        else:
            # We can't imagine how this would be anything but valid text.
            return text

        # Sort the results by badness
        attempts.sort(key=lambda x: x[1])
        #print attempts
        goodtext = attempts[0][0]
        if goodtext == text:
            return goodtext
        else:
            return fix_bad_unicode(goodtext)


def reinterpret_latin1_as_utf8(wrongtext):
    newbytes = wrongtext.encode('latin-1', 'replace')
    return newbytes.decode('utf-8', 'replace')


def reinterpret_windows1252_as_utf8(wrongtext):
    altered_bytes = []
    for char in wrongtext:
        if ord(char) in WINDOWS_1252_GREMLINS:
            altered_bytes.append(char.encode('WINDOWS_1252'))
        else:
            altered_bytes.append(char.encode('latin-1', 'replace'))
    return ''.join(altered_bytes).decode('utf-8', 'replace')


def reinterpret_latin1_as_windows1252(wrongtext):
    """
    Maybe this was always meant to be in a single-byte encoding, and it
    makes the most sense in Windows-1252.
    """
    return wrongtext.encode('latin-1').decode('WINDOWS_1252', 'replace')


def text_badness(text):
    u'''
    Look for red flags that text is encoded incorrectly:

    Obvious problems:
    - The replacement character \ufffd, indicating a decoding error
    - Unassigned or private-use Unicode characters

    Very weird things:
    - Adjacent letters from two different scripts
    - Letters in scripts that are very rarely used on computers (and
      therefore, someone who is using them will probably get Unicode right)
    - Improbable control characters, such as 0x81

    Moderately weird things:
    - Improbable single-byte characters, such as ƒ or ¬
    - Letters in somewhat rare scripts
    '''
    assert isinstance(text, unicode)
    errors = 0
    very_weird_things = 0
    weird_things = 0
    prev_letter_script = None
    for pos in xrange(len(text)):
        char = text[pos]
        index = ord(char)
        if index < 256:
            # Deal quickly with the first 256 characters.
            weird_things += SINGLE_BYTE_WEIRDNESS[index]
            if SINGLE_BYTE_LETTERS[index]:
                prev_letter_script = 'latin'
            else:
                prev_letter_script = None
        else:
            category = unicodedata.category(char)
            if category == 'Co':
                # Unassigned or private use
                errors += 1
            elif index == 0xfffd:
                # Replacement character
                errors += 1
            elif index in WINDOWS_1252_GREMLINS:
                lowchar = char.encode('WINDOWS_1252').decode('latin-1')
                weird_things += SINGLE_BYTE_WEIRDNESS[ord(lowchar)] - 0.5

            if category.startswith('L'):
                # It's a letter. What kind of letter? This is typically found
                # in the first word of the letter's Unicode name.
                name = unicodedata.name(char)
                scriptname = name.split()[0]
                freq, script = SCRIPT_TABLE.get(scriptname, (0, 'other'))
                if prev_letter_script:
                    if script != prev_letter_script:
                        very_weird_things += 1
                    if freq == 1:
                        weird_things += 2
                    elif freq == 0:
                        very_weird_things += 1
                prev_letter_script = script
            else:
                prev_letter_script = None

    return 100 * errors + 10 * very_weird_things + weird_things


def text_cost(text):
    """
    Assign a cost function to the length plus weirdness of a text string.
    """
    return text_badness(text) + len(text)

#######################################################################
# The rest of this file is esoteric info about characters, scripts, and their
# frequencies.
#
# Start with an inventory of "gremlins", which are characters from all over
# Unicode that Windows has instead assigned to the control characters
# 0x80-0x9F. We might encounter them in their Unicode forms and have to figure
# out what they were originally.

WINDOWS_1252_GREMLINS = [
    # adapted from http://effbot.org/zone/unicode-gremlins.htm
    0x0152,  # LATIN CAPITAL LIGATURE OE
    0x0153,  # LATIN SMALL LIGATURE OE
    0x0160,  # LATIN CAPITAL LETTER S WITH CARON
    0x0161,  # LATIN SMALL LETTER S WITH CARON
    0x0178,  # LATIN CAPITAL LETTER Y WITH DIAERESIS
    0x017E,  # LATIN SMALL LETTER Z WITH CARON
    0x017D,  # LATIN CAPITAL LETTER Z WITH CARON
    0x0192,  # LATIN SMALL LETTER F WITH HOOK
    0x02C6,  # MODIFIER LETTER CIRCUMFLEX ACCENT
    0x02DC,  # SMALL TILDE
    0x2013,  # EN DASH
    0x2014,  # EM DASH
    0x201A,  # SINGLE LOW-9 QUOTATION MARK
    0x201C,  # LEFT DOUBLE QUOTATION MARK
    0x201D,  # RIGHT DOUBLE QUOTATION MARK
    0x201E,  # DOUBLE LOW-9 QUOTATION MARK
    0x2018,  # LEFT SINGLE QUOTATION MARK
    0x2019,  # RIGHT SINGLE QUOTATION MARK
    0x2020,  # DAGGER
    0x2021,  # DOUBLE DAGGER
    0x2022,  # BULLET
    0x2026,  # HORIZONTAL ELLIPSIS
    0x2030,  # PER MILLE SIGN
    0x2039,  # SINGLE LEFT-POINTING ANGLE QUOTATION MARK
    0x203A,  # SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
    0x20AC,  # EURO SIGN
    0x2122,  # TRADE MARK SIGN
]

# a list of Unicode characters that might appear in Windows-1252 text
WINDOWS_1252_CODEPOINTS = range(256) + WINDOWS_1252_GREMLINS

# Rank the characters typically represented by a single byte -- that is, in
# Latin-1 or Windows-1252 -- by how weird it would be to see them in running
# text.
#
#   0 = not weird at all
#   1 = rare punctuation or rare letter that someone could certainly
#       have a good reason to use. All Windows-1252 gremlins are at least
#       weirdness 1.
#   2 = things that probably don't appear next to letters or other
#       symbols, such as math or currency symbols
#   3 = obscure symbols that nobody would go out of their way to use
#       (includes symbols that were replaced in ISO-8859-15)
#   4 = why would you use this?
#   5 = unprintable control character
#
# The Portuguese letter Ã (0xc3) is marked as weird because it would usually
# appear in the middle of a word in actual Portuguese, and meanwhile it
# appears in the mis-encodings of many common characters.

SINGLE_BYTE_WEIRDNESS = (
#   0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
    5, 5, 5, 5, 5, 5, 5, 5, 5, 0, 0, 5, 5, 5, 5, 5,  # 0x00
    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,  # 0x10
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x20
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x30
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x40
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x50
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x60
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5,  # 0x70
    2, 5, 1, 4, 1, 1, 3, 3, 4, 3, 1, 1, 1, 5, 1, 5,  # 0x80
    5, 1, 1, 1, 1, 3, 1, 1, 4, 1, 1, 1, 1, 5, 1, 1,  # 0x90
    1, 0, 2, 2, 3, 2, 4, 2, 4, 2, 2, 0, 3, 1, 1, 4,  # 0xa0
    2, 2, 3, 3, 4, 3, 3, 2, 4, 4, 4, 0, 3, 3, 3, 0,  # 0xb0
    0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0xc0
    1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,  # 0xd0
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0xe0
    1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,  # 0xf0
)

# Pre-cache the Unicode data saying which of these first 256 characters are
# letters. We'll need it often.
SINGLE_BYTE_LETTERS = [
    unicodedata.category(unichr(i)).startswith('L')
    for i in xrange(256)
]

# A table telling us how to interpret the first word of a letter's Unicode
# name. The number indicates how frequently we expect this script to be used
# on computers. Many scripts not included here are assumed to have a frequency
# of "0" -- if you're going to write in Linear B using Unicode, you're
# you're probably aware enough of encoding issues to get it right.
#
# The lowercase name is a general category -- for example, Han characters and
# Hiragana characters are very frequently adjacent in Japanese, so they all go
# into category 'cjk'. Letters of different categories are assumed not to
# appear next to each other often.
SCRIPT_TABLE = {
    'LATIN': (3, 'latin'),
    'CJK': (2, 'cjk'),
    'ARABIC': (2, 'arabic'),
    'CYRILLIC': (2, 'cyrillic'),
    'GREEK': (2, 'greek'),
    'HEBREW': (2, 'hebrew'),
    'KATAKANA': (2, 'cjk'),
    'HIRAGANA': (2, 'cjk'),
    'HIRAGANA-KATAKANA': (2, 'cjk'),
    'HANGUL': (2, 'cjk'),
    'DEVANAGARI': (2, 'devanagari'),
    'THAI': (2, 'thai'),
    'FULLWIDTH': (2, 'cjk'),
    'MODIFIER': (2, None),
    'HALFWIDTH': (1, 'cjk'),
    'BENGALI': (1, 'bengali'),
    'LAO': (1, 'lao'),
    'KHMER': (1, 'khmer'),
    'TELUGU': (1, 'telugu'),
    'MALAYALAM': (1, 'malayalam'),
    'SINHALA': (1, 'sinhala'),
    'TAMIL': (1, 'tamil'),
    'GEORGIAN': (1, 'georgian'),
    'ARMENIAN': (1, 'armenian'),
    'KANNADA': (1, 'kannada'),  # mostly used for looks of disapproval
    'MASCULINE': (1, 'latin'),
    'FEMININE': (1, 'latin')
}