Semantic Keyword Clustering For 10,000+ Keywords [With Script]
![Semantic Keyword Clustering For 10,000+ Keywords [With Script]](https://altwhed.com/wp-content/uploads/2023/01/1673418946_Semantic-Keyword-Clustering-For-10000-Keywords-With-Script-780x470.png)
Semantic keyword groups can help take your keyword research to the next level.
In this article, you will learn how to use the Google Colaboratory paper that is shared exclusively with the readers of the Search Engine Journal.
This article will walk you through using a Google Colab Sheet, a high-level view of how it works under the hood, and how to make adjustments to suit your needs.
But first, why cluster keywords at all?
Common use cases for keyword grouping
Here are some use cases for keyword grouping.
Faster keyword search:
- Filter out branded keywords or keywords that have no commercial value.
- Group related keywords together to create more in-depth articles.
- Group related questions and answers together to create an FAQ.
Paid search campaigns:
- Build negative keyword lists for ads that use large datasets faster – stop wasting money on spam keywords!
- Group similar keywords into ad campaign ideas.
Here is an example script that groups similar questions together, perfect for an in-depth essay!
Problems with earlier versions of this tool
If you’ve been following my work on Twitter, you’ll know that I’ve been experimenting with keyword clustering for a while now.
Previous versions of this script were built to the premium level PolyFuzz Library Utilization TF-IDF conformance.
While getting the job done, there were always some scratch combinations that I felt the original score could be improved upon.
Words that share a similar pattern of letters will be grouped even if they are not semantically related.
For example, he was unable to group words such as “bicycle” with “bicycle”.
Previous versions of the script had other issues as well:
- It did not do well in languages other than English.
- It created a large number of groups that were not able to come together.
- There wasn’t a lot of control over how groups were created.
- The script was limited to approximately 10,000 rows before it timed out due to a lack of resources.
Semantic keyword clustering using deep learning natural language processing (NLP)
Fast-forward four months to the latest edition which has been completely rewritten to take advantage of the latest science in deep learning syntax trappings.
Check out some of these great semantic combos!
Notice that heating, thermal, and warm are included in the same keyword group?

Or how about wholesale and bulk?

Dog, dachshund, Christmas and Christmas?

It can even group keywords in over a hundred different languages!

New text features versus previous iterations
In addition to semantic keyword grouping, the following improvements have been added to the latest version of this script.
- Support to aggregate more than 10,000 keywords at a time.
- Reduce the lack of cluster groups.
- Ability to choose different pre-trained models (although the default model works just fine!).
- The ability to choose how closely related groups should be.
- Choose the minimum number of keywords to use in each group.
- Auto detect character encoding and CSV delimiters.
- Multilingual compilation.
- Works with many popular keyword exports out of the box. (Search Console, AdWords data, or third-party keyword tools like Ahrefs and Semrush).
- Works with any CSV file with a column called “Keyword”.
- Easy to use (the script works by inserting a new column called Cluster Name into any loaded keyword list).
How to use the script in five steps (Quick Start)
To get started, you will need Click on this linkthen choose the Open in Colab option as shown below.

Change the runtime type to GPU by selecting the show length > Change the runtime type.

Choose the show length > He runs All from the top navigation bar from within the Google Colaboratory, (or press Ctrl+F9).

Upload a .csv file containing a column called “Keyword” when prompted.

The aggregation should be fairly fast, but it ultimately depends on the number of keywords and the template used.
In general, you should be good to 50,000 keywords.
If you see an out of memory error in Cuda, you are trying to bundle too many keywords at the same time!
(It is worth noting that this script can easily be compiled to run on a local machine without Google Colaboratory restrictions.)
Script output
The script will run and append clusters to your original file with a new column called Cluster Name.
Group names are assigned using the shortest keyword in the group.
For example, the group name for the following keyword group is set as “alpaca socks” because that is the shortest keyword in the group.

Once the compilation is complete, a new file is automatically saved, with a compilation in a new column appended to the original file.
How does the master collector work?
This scenario depends on fast compilation algorithm It uses models that are extensively pre-trained on large amounts of data.
This makes it easy to calculate the semantic relationships between keywords using ready-made models.
(You don’t have to be a data scientist to use it!)
In fact, while I made it customizable for those who like to tinker and experiment, I chose some well-balanced defaults that should be reasonable for most people’s use cases.
Different paradigms can be switched in and out of the script according to the requirements, (faster compilation, better multi-language support, better semantic performance, etc.).
After a lot of testing, I found the perfect balance between speed and accuracy with the All-MiniLM-L6-v2 adapter which provided a nice balance between speed and accuracy.
If you prefer to use your own model you can just experiment, you can replace your existing pre-trained model with any of the models listed here or on Hugging Face Model Hub.
Swap in pre-trained models
Switching in forms is as easy as replacing the variable with the name of your preferred adapter.
For example, you can change the default form all-miniLM-L6-v2 to all-mpnet-base-v2 by editing:
adapter = ‘all-miniLM-L6-v2’
to
adapter = ‘all-mpnet-base-v2.0.0-mod.apk“
Here is where you can edit it in a Google Colaboratory paper.

Trade-off between cluster accuracy and lack of clustering
A common complaint with previous iterations of this script is that it generated a large number of unbundled results.
Unfortunately, it will always be a balancing act of block accuracy against the number of clusters.
A higher group resolution setting will result in a higher number of non-clustered results.
There are two variables that can directly affect the size and accuracy of all clusters:
min_cluster_size
And
block accuracy
I set a default value of 85 (/100) for block precision and a minimum block size of 2.
During testing, I found this to be the right place, but feel free to experiment!
Here’s where to set those variables in the script.

That’s it! I hope this keyword collection script is useful for your business.
More resources:
- Introduction to Python and machine learning for technical SEO
- 6 SEO Tasks to Automate Using Python
- Advanced Technical SEO: A Complete Guide
Featured image: Graphic Grid / Shutterstock