Unless you've been living on a remote planet without internet access, you probably believe that hashtags are important. If you are looking for a way to extract those hashtags, along with mentions, you'll be happy to read this tutorial. It shows ways of finding hashtags and mentions using Python.
Hashtags are a way of identifying certain terms in a social media post as being important. It is believed that Twitter started them, but most social media platforms have followed suit and support them, now.
NOTE: website owner may receive compensation for purchases made from the links and banners on this page.
It's easy for people to mark a term as a hashtag by placing the '#' character before the term. For instance, if you are posting about an investing concept, you could identify this as #investing.
When you do this, the platform (like Twitter) will group that term with other posters who identified the same term in their posts. That means when people search on the hashtag, you have a chance of your post being included in the search results.
If people click on your post and you have included a link to your website, you could receive more traffic by using hashtags.
Mentions are another useful feature of many social media platforms. They allow posters to broadcast that a person or company was mentioned in a post. When they do this, the person or company mentioned will be alerted that they were mentioned. Mentions are designated with an '@' symbol before the name.
For instance, if a company XYZ is mentioned in a post (and they have the account named XYZ on the platform), a poster can use @XYZ to alert someone from the entity that it is the topic of conversation. This feature is now supported on most of the major social media platforms.
How to Find the Mentions and the Hashtags
Suppose you had a way to extract a series of social media posts as a text file. Knowing the hashtags and mentions could be something you find useful. For instance, knowing the highest frequency of hashtags may help you determine what to write about for your posts.
Mentions can also help you discover which people and companies are influential. Well-known people and companies are obvious. But people who are growing influence won't be as obvious. Finding this information will provide insight with this.
You could find several ways to accomplish extracting the needed hashtags or mentions. For this article, we'll concentrate an easy way via regular expressions and the regexp_tokenizer function in nltk.tokenizer.
Disclaimer: this article assumes a basic knowledge of Python. At the end of the article, I will include resources if you need a refresher (or to start learning.)
We'll use one package for this exercise: nltk
You are going to be amazed at just how easy it is to accomplish this. First, though, we'll need to load up the proper packages:
Package not installed?
If you experience an error when running this line, you probably need to install the package. To do this:
pip install nltk
Check the help guide for the Python installation you are running on how to execute the pip command.
Now let's load up a string variable for the text we want to process:
You can fill the string with any text you want. Just make sure to include a few hashtags and mentions (# and @).
In practice, you would use text that comes from a downloaded csv file or from an API like the Twitter or Facebook APIs, etc. But for this example, we'll use a hardcoded string for simplicity.
Define the pattern to search the hashtags. Here is where simplicity shines. All we need for this to work is:
Then, feed this pattern and tweets string into regexp_tokenize():
Here is the result of this:
You can probably guess that finding the mentions (@) is just as easy, and you'd be correct.
Create a new pattern and feed it with the tweets string into regexp_tokenize again:
Here is the complete listing of the code:
We performed some rather powerful tasks with just six lines of code. This may have tirggered the wheels in your brain to spin thinking about the possibilities.
A Few Caveats
- As mentioned, hardcoding strings is not the most optimal way to use this. You can either use a tool that captures the text into a csv file or use an API. Both of these are beyond the scope of this article.
- If you are unfamiliar with the patterns in this example, it's part of a programming concept called Regular Expressions (regex). Regex is a bit tricky to learn as it is often cryptic. Sometimes, the best way to understand concepts is to accept them without worrying about how they work and then learn more as you get familiar with their concepts later.