Strategic Client Relationship Manager

Discuss smarter ways to manage and optimize cv data.
Post Reply
rifat28dddd
Posts: 688
Joined: Fri Dec 27, 2024 12:31 pm

Strategic Client Relationship Manager

Post by rifat28dddd »

So many people can't help but ask if the large language model is so magical, can the low-resource languages ​​in the world be saved? The founder of linguistics, K, has devoted his life to developing a universal grammar. He has a very famous metaphor that if aliens come to Earth, they can understand and read all the languages ​​on Earth. Because in their view, every language on Earth follows the same grammar, but everyone speaks different "dialects." If you can switch between multiple languages, does it crack the mystery of the world's universal grammar? Low-resource languages ​​are still underrepresented in large language models. Despite their transformative potential, the reality remains that large language models cater primarily to English and a few other high-resource languages.


A close examination of the training corpora used hong kong whatsapp phone number by models such as - shows a clear imbalance among languages. English dominates. The vast majority of -'s training corpus is English. Subsequent models based on - continue this trend. Limited representation of languages ​​(analysis limited to - corpus) Only two languages ​​account for more than 10% of - corpus, namely French () and German (). Other languages ​​that fall into the range include Spanish, Italian, Portuguese, Dutch, Russian, Romanian, Polish, Finnish, Danish, Swedish, Japanese, and Norwegian. It is worth noting that languages ​​like Chinese and Hindi, which have more than 100 million speakers in total, do not even make the corpus threshold.


Training data concentration - there is a clear head effect for the top languages ​​in the training corpus, accounting for 10% in total. Limited word coverage - only 10 languages ​​in the training corpus have more than 10 million words, of which the 10th language is Khmer. Although Khmer is spoken by 10 million people in Cambodia, it only has 10,000 words in the training corpus of . The bias towards English and selected high-resource languages ​​is not intentional by the parent company; most of the corpus comes from the Internet, and the Internet reflects the wealth, openness and activity of countries and languages. Big language models largely ignore the majority of the world's existing languages.
Post Reply