How to classify multilingual texts with AI support

Are you tired of manually categorizing multilingual text data for your website? Are you struggling to keep up with SEO tasks due to the increase of multilingual content on your website?

Well, we have created a small script in Google Colab to act as a multilingual text classifier with AI to help you.

In this article, we’ll define what an AI multilingual text classifier is and discuss its importance for increasing productivity in SEO migrations and for optimizing the Hreflang tag of a multilingual website.

We will also go into the features and advantages of using a multilingual text classifier with AI.

But that’s not all, our Google Colab offers a Multilingual Text Classifier tool that can handle about 50 languages and provides up to 3 matches for each result, based on a semantic similarity index.

This is why we can say that it is an excellent tool for multilingual site SEO migration processes. So let’s get started and find out how you can automate language identification and text classification with ease!

What is multilingual text classification?

In today’s fast-paced world, companies and individuals generate an enormous amount of textual data on a daily basis.

With the advent of the AI-powered multilingual text classifier, text classification has become more efficient, accurate and faster.

Text classification is the process of categorizing this data into groups based on specific characteristics, which facilitates information management, analysis, and retrieval.

The goal of text classification is to automatically classify unstructured data, such as emails, social media posts and news articles, into predefined categories.

Multilingual text classification consists of categorizing text data in multiple languages.

Creating a multilingual text classifier using AI can help organizations automate the process of language identification and text categorization for various applications.

The relevance of accurate and automated multilingual text categorization

Multilingual text classification helps to:

1. Multilingual website migrations:

When migrating a website to a new domain or content management system (CMS), it is crucial to ensure that all pages and content are correctly categorized in the target language. A multilingual text classification tool can help automate this process by identifying the language of each page and categorizing content accordingly.

2. Identification of keyword cannibalization:

Identifying cannibalization is another vital aspect of SEO, and it is essential to do this in multiple languages to prevent it from being replicated across a global audience. A multilingual text classification tool can help identify these by grouping content with keywords and phrases relevant to specific languages, making it easier to identify content that may be solving the same search intent from different texts.

Ingredients: What you need to classify texts on a multilingual website

The truth is that you need two things.

Install the dependencies of our script
And load this Excel with the texts in different languages to compare.

The script contains libraries programmed to perform a symmetric semantic search between different texts in multiple languages at the same time.

In fact, what it does is to buy text in different languages, grouping those that are most similar from a semantic point of view, offering a score between 0 and 1.

In this way, what we get back is a file grouping the texts in different languages whose content is similar, generating a parity between them.

Functions of the script to classify multi-language texts

This code makes a language-language mapping between two CSVs with texts in different languages based on their semantic similarity.

It can be extended by adding more CSV documents to the comparison with other languages.

Several multi-language models are available:

distiluse-base-multilingual-cased-v1: uses up to 15 different languages. These include Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, and Turkish.
distiluse-base-multilingual-cased-v2: Distilled multilingual knowledge version of the universal multilingual phrase encoder. This version supports more than 50 languages, but its performance is somewhat lower than that of v1.
paraphrase-multilingual-MiniLM-L12-v2 – Multilingual version of paraphrase-MiniLM-L12-v2, trained on parallel data from over 50 languages.

paraphrase-multilingual-mpnet-base-v2 – Multilingual version of paraphrase-mpnet-base-v2, trained on parallel data for more than 50 languages.

In the script, we have chosen the best performing model based on our tests.

In our case: paraphrase-multilingual-MiniLM-L12-v2.

As you can see it is different from GPT, but it is 100% free, and we can install it directly in Colab and use it from there.

Once this is done, you only need to upload the CSVs, which must include a column: Title.

Colab will generate the data frames, pass the text to a list, assign the main language that we have marked and group by default up to 3 results.

Then we have two functions that perform the necessary calculations:

Embeddings are generated for the text in both languages.
As they are vectors, we can calculate their cosine distance to calculate their similarity.
We keep the most similar title.

At the end, we get a spreadsheet with the results marked with a match score between 0 and 1.

We recommend you to proceed with a manual review of the results with reliability lower than 0.9.

Running the script in Google Colab

To simplify the sorting process, Luis Fernández from iSocialWeb has prepared the following video where he explains step by step how you can run Google Colab to get the most out of it.

You can watch it right here below:

See video translation

This content is generated from the audio voiceover so it may contain errors.

(00:01) Very good welcome and welcome to a new video of iSocialWeb follow with the artificial Intelligence and the automations And in this case go to explain step by step how transcribe videos of YouTube/Youtube of free form and with a very tall quality with this go to achieve that from a video of YouTube/Youtube any one have or do not have subheads generate a text with all what says in the video with a correct punctuation and a quality of translation very tall and all this of form totally free
(00:30) this is a big advantage for example to anterior methods like the download of subheads of a video can use to generate subheads of manual form of a video or even Translate the same and publish our video in other languages for this go to use Google colab as we have used in anterior videos in this video go to go a bit faster if you want a more detailed explanation step by step of how works Python or how use colab can throw an eye to the anterior videos of the canal in these videos of
(00:59) the series can As for example how classify the intention of research of your keywords or how use the artificial Intelligence to write articles are similar contents and in them also use colab and Python, but goes in a bit more to the detail the execution of each line the only that you have to know Is that in collage can use the things by blocs the code executes bloc to bloc This is a bloc and the following bloc would go separated here and to execute it simply have to do click in the
(01:27) left upper corner of each bloc in the button of Play for this go to begin to install the things would have to do click here as you can see all is very commented the text in green and preceded by a pad are commentaries that explain you the things and here would begin to install can see here a line a flechita green in the left part that goes us indicating line what goes executing and once end up that it would have to be anytime this would spend to show the again and would mark us
(01:58) here a check green that it has executed beside the time of execution here would have all the logs of the execution can close them to leave a bit cleaner all and would follow executing cell all the code then here install all the necessary dependencies to continuation import all the bookshops that have installed more some extras that will need and finally will go here to ingresar to the videos of YouTube/Youtube that we want to transcribe for this go to an example one of the videos that
(02:30) we find published could access to him copy the URL and here already indicates it to us, ingrese a list of urls of video of YouTube/Youtube practical eat I go to use only one if it wanted to Add more would put a comma and another video is not necessary So simply I execute the one who need and give to the intro and can see here the green tic has executed correctly now would have a list of videos with all those that have executed separated by commas this is a input manual that have configured like this for the eases and know a bit of
(03:02) programming always can modify it to go up a list of cessation a csv with a list of urls for example or any method of rise of urls that prefer simply need a final list here can see that in this case it is a ready call videolist and that has an item only the video YouTube/Youtube that have selected in the following cell go to generalise all the process of download of video and transcription of the same For this believe two lists one in which we will save the texts and one in which
(03:31) we will save the titles and to continuation will go by all the videos use the bookshop paytuber to download the video download us the model of whisper that is the the model of pney that go to use to transcribe the same here in the part Upper have left you a link to the blog in which they announce the same in case you want to go in more in details, but simply have to know that it is a model coached to transcribe videos transcribe audios to format text is entrelanado in a lot of languages and Works very very
(04:03) well it allows even do translations of the same in this case do not go to do an attribution simply go to work with a translation with a text in Spanish and an audio in Spanish then follow with the code and here can see that we download the model here there is an important point that can modify and is the size of the model Exist several types of models Usually share distinct sizes to facilitate his use in this case go to use the small The small will be able to use the medium or the
(04:32) arts that is to say the average or the big the what goes to do each thing is goes you to allow a main speed of download and processed use models smaller, but went to have a minor quality use models bigger or the involve an upper quality then can here modify it and simply would put the model large here and would download you instead of the small model the Big go to close this and I would leave the small Then download the model download the video to continuation obtain the audio of the
(05:05) video, since the video does not do us fault and would be processing that extra that is occupying us space download us the audio and here begin already go to the model and simply say Hears model transcribes the audio that the audio that have gone down us goes it to us to save dip it in the list of texts translated and erase the audio that have gone down to leave hollow and follow iterating more videos if we were it do not go to execute this cell, since it executed it previously and takes a bit in processing all the audio
(05:36) but you will be able to see here that the result and executes correctly once executed the anterior cell would have the video transcribed in a list and can publish it of automatic form in WordPress this already have seen it in another video that is exactly east of here like automating the creation of contents in WordPress go it to you to review in brief But simply have to have an user in this case have created a call of article chair a password and a URL This is the URL with which Go to
(06:06) publish in accessing to pencil of WordPress of publication we go to use the web of the agency and for the user and the password have to have an user with us work with rights of administrator and Go to have to go to the edition of users and create an application here put him a name to the password of the application put him the same name that to the user for example article Sia of here not to complicate you and will give him Add new password would generate this password that I later
(06:35) I will annul here so that you can not do any and copy it everything here in the URL here modify your command and simply execute the only point here interesting that can modify apart from time of publication is that we go to mark the status like draft do not want that it publish directly, but you can put it directly in publish if like this you wish it execute him once again begins to execute and marks us the green tic 2 seconds as created correctly if we go us to the entrances of social web and update
(07:14) we expect a second to that it upload and will be able to see here Here is can see that have a draft titled like classifying the intention of researches of your keywords in a minute that it is the video of the that spoke and if we access to see it will be able to see that have all the very good content to all today from social web and here have all the perfectly transcribed content and see that it is a tocho is a quite long video are 8 minutes of video and see it that it is perfectly marked Perfectly formateado here have to him give a
(07:56) little to do it a bit beautiful do not publish it like this directly, but is that it works very very well even, keywords, words in English words in Spanish The mixes uses perfectly the perfect relations recognises names of marks and executes them perfectly is a type of transcription much more advanced of which used previously So here you have all the information will leave you the link to this colab in the video and any doubt that have or need are, as leave
(08:28) in the commentaries and can find us in the social nets a greeting

Our script presents a free and useful alternative for the task of multilingual migration.

It uses a Transformer model from the Sentest Transformers library.

In the video, the model chosen for the demo is the Frase Multilingual MiniLM L12V2.

With this model, a semantic search can be done to compare texts in different languages and assign them to each other according to their semantic similarity.

This is especially useful for the multilingual migration of many contents and to automate the process of assigning products or Hreflangs.

In addition, the model used is capable of working with more than 50 languages, including the most common languages in the world, such as Spanish, Italian, German, and Chinese.

Features of the multilingual text classifier with AI:

Multilingual support: The AI-powered multilingual text classifier can handle text data in multiple languages.
High accuracy: The tool uses machine learning algorithms that learn from a given set of text to identify patterns and apply them to new, unseen data, resulting in high accuracy.
Speed: The AI-powered multilingual text classifier can analyze large amounts of text data in a matter of seconds, making it an effective solution for companies that generate a large amount of textual data.
Ease of use: The tool has a user-friendly interface that makes it easy to use even for non-technical users.
Customizable: Multilingual Text Classifier with AI allows users to create custom categories based on their specific needs.

Advantages of the Multilingual Text Classifier with AI:

Automation: Multilingual Text Classifier with AI automates the text classification process, reducing the time and effort required for manual classification of multilingual text data.
Accuracy: The tool’s high accuracy ensures that data is classified correctly, leading to better decision-making and analysis.
Improved productivity: The AI-enabled multilingual text classifier allows users to handle large amounts of text data quickly, leading to increased productivity.
Reduced costs: By automating the text classification process, companies can reduce the costs associated with manual work.

How does the multilingual text classifier with AI work?

The AI Multilingual Text Classifier uses natural language processing (NLP) techniques and machine learning algorithms to analyze and categorize text data. The tool starts by analyzing a given set of texts to identify patterns and associations between words, phrases, and sentences. The algorithm then applies these patterns to new, unseen data to classify it into predefined categories.

Questions about the multilingual text classifier:

Can the multilingual text sorter with AI handle text data in multiple languages?

Yes, the tool can handle text data in multiple languages, such as English, Spanish, German, French and many more.

Is the AI Multilingual Text Classifier accurate?

Yes, the high accuracy of the tool ensures that data is classified correctly, enabling better decisions and analysis.

Do you need to migrate your website to several languages?

With our tools and experience in migration management, you can have your multilingual website ready to rank in less time and your content ready to achieve amazing results.

Alvaro Peña de Luna

Head SEO y co-CEO iSocialWeb

Co-CEO and Head of SEO at iSocialWeb, an agency specializing in SEO, SEM and CRO that manages more than +350M organic visits per year and with a 100% decentralized infrastructure.

In addition to the company Virality Media, a company with its own projects with more than 150 million active monthly visits spread across different sectors and industries.

Systems Engineer by training and SEO by vocation. Tireless learner, fan of AI and dreamer of prompts.

Would you like to improve your project?

Related articles

best seo agency spain

How to rank in Google AI Overviews Optimize Your SEO Strategy

Encyclopedia-Style Platforms

how i got chatgpt to recomend my content