Machine Learning

What causes the language look like that?

Speak to the tongue voice, often guess what language is looking at certain letters. I'm not talking about Icelandic, for example, but I see Icelandic text when I see letters such as “ð” or letters are rare elsewhere. Similarly, I have seen that when I see a lot of “IJK” in the Scripture, perhaps the Dutch.

This document assesses how we can use simple statistics to learn these visual finger signs, the character sequence that is most likely to any language of 20 different languages.

How to learn mathematical fingers

To read the visual “seal seals” of the language, first we need a way to measure that a different pattern is given. The first point of nature may look at the most common character patterns in each language. However, this method quickly falls as a briefly as a piece of characters can be very common in one language and appear many times from others. Usually alone doesn't catch unity. Instead, we want to ask:

“How can this pattern come from one language compared to all others?”

This is where the math comes in! Officially, Allow:

  • L be a collection of all 20 learned tongues
  • Cheeks Have a set of all the character patterns stored in that language

Deciding how strong character pattern is givenCheeks Identify Language L∈LIt includes the Likelihoud rating:

[LR_{s,l} = frac{P(s|l)}{P(s|neg l)}]

This compares the chances of seeing a piece of letters in the language l, to compare any other language. The higher estimates, are distinguished separately that pattern in the language.

To calculate the Likelihoud measure in performance

Integrating the Likiod Level of each character in operation, we need to translate conditional opportunities at the amounts we can actually measure. Here's how we define
Right prices:

  • cl(s): The amount of time of Times Character Persic appears in language l
  • c¬.
  • Nil: The total number of letters of events of events in the languages ​​l
  • Ni¬: The total number of events on all other languages

Using these, conditions are unconditional:

[P(s|l) = frac{c_l(s)}{N_l},~P(s|neg l)=frac{c_{neg l}(s)}{N_{neg l}}]

And limohoud measure makes it easy to:

[LR_{s,l} = frac{P(s|l)}{P(s|neg l)} = frac{c_l(s)cdot N_{neg l}}{c_{neg l}(s)cdot N_l}]

This gives us numerical points that choose how the character's patterns will appear in language L comparing to all others.

To handle zero to count

Tragedy there is a problem with the formula of our measuring method: What happens when c¬(s) = 0?

In other words, what if a piece of characters appear in the language only l and beyond anything else? This leads to divorce – zero in denominator, and the Infinity's female average.

This means, this means that we have received a unique example of that language. But in working, it is not very helpful. The character's pattern can only appear once in one language and has not been somewhere and will be given unlimited points. It is not very useful as a stable “fingerprint” of the language.

To avoid this issue, we use a process called adding smoothing. This approach is changing a slight green count to eliminate zeros and reduce the impact of unusual events.

Specifically, adding a small α to Count Tonstator, and α |Cheeks| to denominator, with |Cheeks| being the total number of character patterns seen. This contributes to the fact that all the character pattern has a small opportunity in all languages, even if it is not yet recognized.

By procrastination, repaired opportunities have:

[P'(s|l) = frac{c_l(s) + alpha}{N_l + alpha|S|},~P'(s|neg l)=frac{c_{neg l}(s) + alpha}{N_{neg l} + alpha|S|}]

And the Final Liciohouped Ratio to Makhuli:

[LR_{s,l} = frac{P'(s|l)}{P'(s|neg l)} = frac{(c_l(s) + alpha)cdot(N_{neg l} + alpha|S|)}{(N_l + alpha|S|)cdot(c_{neg l}(s) + alpha)}]

This keeps things stable and make sure that an extraordinary pattern is not automatically supportive because it is special.

Data

Now that we have described the metric identifying the most different characters (our “Fingerprints”, it is time to collect language data to analyze.

In this regard, I used the Wordthon Library Wordfreq, which includes a number of names based on large resources such as Wikipedia, books, subtitles and web text.

One useful function is particularly analysis top_n_list()To return the edited list of high high quality words in the language provided. For example, getting 40 normal Icelandic names, we would call:

wordfreq.top_n_list("is", 40, ascii_only=False)

Quarrel ascii_only=False It ensures that non-Ascii characters – as “ð” of Icelandic and “Þ” – are kept in the output. It is important in this analysis, because directly we want language-language characters, including single characters.

Creating data, I drew more common words to 5,000 in the 20th European revelations:

CATALAN, Czech, Danish, Dzech, English, Finnish, French, German, Hunali, Isrowegian, Spanish, Spanish, Spanish, Spanish, Spanish, Spanish, Spanish, Spanish, Spanish.

This produces the great vocabulary of many full-language words, which are rich enough to remove sensible statistics in all languages.

To remove the character patterns used in analytical, all may be 1 to 5 long chapters produced in each of the Data Ship. For example, the name language can contain patterns like l, la, lan, lang, langu, a, an, angand always. The result is a complete set Cheeks More than 180,000 letters patterns are recognized for 20 studded languages.

Result

In each language, five highest letters patterns are calculated on the level of leamyay, indicated. Smoothing is always chosen to α = 0.5.

Because the green rates can be quite large, reported the Base-10 Listioywood Ratio (logPumass(Lr)) instead. For instance, a log rating easier+ = 1 000 times just appear in that language than any other. Note that due to media, these opportunities for opportunities are estimated rather than understand, and overreacting may be drunk.

Each cell shows the top character pattern and its loog scale easier

Language # 1 # 2 # 3 # 4 # 5
Catalan èenc
3.03
èci
3.01
milk
2.95
ècia
2.92
has been lined
2.77
Czech ě
4.14
ř
3.94
ni
3.65
ů
3.59
ře
3.55
Jishe Øw
2.82
pream and then
2.77

2.73
insist
2.67
Øez
2.67
Ditch ijk
3.51
Lijk
3.45
Elijok
3.29
joke
3.04
sauce
3.04
English counselor
2.79
tly
2.64
another place
2.54
y
2.54
strong
2.52
Device räi
3.74
ÄÄä
3.33
täÄ
3.27
wellÄ
3.13
SSÄ
3.13
French êt
2.83
fortune
2.78
Rése
2.73
pressing
2.68
herrré
2.64
German purpose
3.03
object
2.98
tlich
2.98
Serminal Type
2.98
cancel
2.90
Hungarian ő
3.80
ű
3.17
rye
3.16
sweat
3.14
Észz
3.09
Icelandic ð
4.32
opport
3.74
a
3.64
¿
3.63
ði
3.60
Italian Zionone feeling
3.41
Azion
3.29
stock
3.07
Aggi
2.90
Zion
2.87
Latvia Ā
450
Hochures booklets
4.20
+ They See
4.10

3.66

3.64
Lithuanian ė
4.11
ų
4.03

3.58
į
3.57
ės
3.56
Norwegian kind
3.17
asj
2.93
Øy
2.88
Asjon
2.88
alarm
2.88
Polish ν
4.13
ś
3.79
³
3.77
ż
3.69
For what to do
3.59
Conclusion Taking Nody
3.73
Ç ರ
3.53
Ção
3.53
ação
3.32
açã
3.32
Romanian 3:
4.31
ţ
4.01
ţi
3.86
ş ş
3.64
chance
3.60
Spanish mat
3.51
Ación
3.29
und
3.14
subtitle
2.86
eento
2.85
Sweetest förs
2.89
recently
2.72
Stäl
2.72
Ång
2.68
ÖRA
2.68
Turkish +
4.52
ş ş
4.10
Ок
3.83
ın
3.80
l
3.60

Talk

Below from some happy interpretation of the results. This is not meant to be a complete analysis, just a few of which I have found:

  • Many of the character patterns have the highest number of letters differ from their language, such as Icelandic mentioned earlier “ð”, or “ı”, or “ı”, or “ı”, or “ı”, or “ı”. Because these characters are illegal in all other languages ​​in the Datasette, produce ending limitations if not as much as additional fluency.
  • In some languages, especially the Dutchs, many high results are one's method. For example, the upper pattern “IJK” is also visible in the next highest patterns: “Lijk”, “Elijk”, and “Joke”. This shows how a combination of letters how often by long words, making them more different from that language.
  • English has different character patterns in all the analysis language, with the highest level of LOG this may be due to the English loans in many languages' 5,000 words, including any other patterns.
  • There are several charges where uppercase character patterns indicate language structures in all languages. For example, Spanish “, Italian”, Italian “tere” -the “-Sjon” is all work as customers' – these adjectives are weakening in each language and highlighted the same languages ​​using different spells.

Store

The project began with a simple question: What causes the language to look like this? By analyzing 5000 common terms in 20 languages ​​and compares the character patterns they use, not to uncover different “Fingersprints' in each language – from the Actented Leters and let” Ijk “or” Ción “. While the results are not designed, they provide a fun and basic way to test what puts languages ​​without seeing, or without understanding one word.

See my GitHub Location for the full start of this method.

Thank you for reading!

Progress

wordfreq Python Library:

  • Robyn Speer. (2022). RSPER / WordFreq: v3.0 (v3.0.2). Zenodo.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button