Text Analysis with Machine Learning: Social Media Data Mining

Text Analysis with Machine Learning: Social Media Data Mining

Social Networks dramatically changed the world we live in. The majority of human living on the Earth cannot imagine their life without visiting Facebook, LinkedIn, WeChat, Weibo, Vkontakte or another social network once a day.

This fact completely changed our vision. People trust their data on social platforms and sometimes occasionally reveal their information to strangers. Probably everyone thought about showing their brand new iPhone to friends via posting a photo to Facebook.

Moreover, after that, the photo poster will receive several offers for iPhone accessories.

So Is it harmless, or not? It depends on a purpose you are interested in it. The curious question is: “How did they know the right person to reach out? “

Today we will talk about social media scraping, text extraction, data analysis and the benefits of social media data mining for your business.

Interested? Keep on reading!


What social media scrapping is

Web Scrapping  – (also known as web data extraction) – data scraping used for extracting data from websites. The human can do it manually, but it is slow and inefficient. This way, there are a lot of different software tools that can help you to automate data mining and extraction processes.

The process is quite simple and easy to understand: there is the bot (named web-spider or crawler) that collects the HTML code from the website. After data harvesting, it analyzes the input data and extracts the data that is needed. This is how it works for the regular sites.

Social Networks are different from the regular websites. The core difference between the socials and regular sites: the social networks understand the value of the data their users trust them.

As was mentioned above, sometimes users do not understand the value of the personal data they make public. The simplest example is the boarding pass theft.

To tell the story short, it is possible to restore the boarding pass when you have its photo. This way posting the picture of your boarding pass to  Facebook or Instagram may be dangerous. You can miss your flight after that.

With the deep understanding of the real value of data comes great responsibility for the stored data. Social platforms have particular departments that are responsible for the data storage and processing. These people make social networks the most coveted target for the researchers and data miners.

The most significant example of data protection policies is LinkedIn. Let’s have a closer look: you can access almost no data without authentication. LinkedIn also uses the JavaScript framework as its FrontEnd core, which makes the HTML code render on the client side – in the browser. These methods are not the only methods LinkedIn is using to prevent the data scrapping.

Knowing that facts any specialist with the competence in web scrapping would say: “The majority of the data mining app (and even enterprise software) won’t handle with the LinkedIn scrapping!”

These are why social networks are different – they know the actual value of the data and store it securely. Social Platforms data extraction is complicated and quite expensive even today after the software revolution.


Is there anything problematic with data analysis gained from social networks?

Data analysis is another problem the developers will face after social networks scrapping. What do people use social networks for? For the communications, conversations, and data exchange mostly.

Let’s have a closer look at the top of the social networks for the clear understanding.

Top Social Networks (August 2018)


The most commonly used social networks today are Facebook, Youtube, Instagram, Qzone (China only), Weibo and Twitter. As we can see, all that social networks store the content of different types.

We can categorize the content data into several types:

– Text Data (human speech mostly)

– Audio/Video

– Images

Every data category needs a personal approach. And it is the place where Machine Learning is the best choice. Machine Learning can handle the data processing with ease.


Text data processing is not as complicated as it considered to be, but it is still not the natural process. There are a lot of different libraries, frameworks, and even platforms that can help the computer understand the human speech. But the social networks are different. Again.

English is different in every country. The excellent example is Australia. Even for the native English speaker, it may be quite challenging to understand the meaning of the way Aussies (people for Australia) talk.

For example the sentence: “Hey, ya. Wanna pop around for a cuppa?” in Australian English means “Hello! Want to have a cup of tea?”.

And it is not only about the slang. The same language is different all over the globe. And do not forget about the majority of nations that use the same words differently.

Can Machine Learning handle different language versions? Yes, machine learning can understand the slang and different language versions, unfortunately not with ease. This is possible with three natural language processing (NLP) techniques: tokenization, stemming, lemmatization. Even understanding the fact how to do it,  it takes a lot of resources and time to teach the neural network how to extract the words from the complicated sentences.


Another thing you need to think about is image processing.

For gaining valuable business insights your social media scraper should accurately process images. There are a lot of different techniques that are used for image processing, let’s look at the most simple to understand but not less effective one – Convolution.

In fact, during the Convolution, we split the huge image to the number of small pictures, to process them separately with tiny neural networks. Such an approach helps us to find objects more accurately, but it takes time to make the neural network to understand the image.

How convolution works


The complex neural network in the image shown above may find several objects:

– The ground is covered in grass and concrete

– There is a child

– The child is sitting on a bouncy horse

– The bouncy horse is on top of the grass

– There is a fence

– There is a baby carriage

In fact, for every object, we should develop the separate machine learning model. As you already understand, after social media scrapping the millions of photos with billions of different objects. And it makes data analysis quite complex and laborious.  


Audio and Video processing is similar to the text and image processing.

We convert audio to plain text with speech recognition, and after that translate the plaintext to the understandable for machine format. Speech recognition is another level of abstraction for the NLP. There are open source libraries to deal with it.

Video processing is entirely different from image processing but uses the same methodologies. The most common way to process the video with machine learning – split the video into several frames and process it separately. After the processing is complete – merge the output data from the different frames.


What benefits can social media web scrapping bring to the business

Information about the customers is everything today, even for the manufacturer. The data about our customer habits gives us the deep vision of how to develop our services and to make our product better. Moreover, there are industries where information about the potential customer is not even critical – it is everything.

The most obvious example of such industries is Insurance. Let us have a closer look.

We live in a fascinating world, where almost everyone makes its’ life less private from year to year. When you apply for a job, your potential employer will make the small social media research. If not, he will at least check your profiles on Facebook and LinkedIn. The same do insurance agents today.

The majority of insurance companies are made for profit. But they are quite fair with their potential clients. For the life insurance, the agent evaluates every client. It searches for different risks the client may face.

For example, the average whole life insurance for a 35-year non-smoker female will cost about  $731 per year, is she wants the compensation of about $1M. This means, that insurance company would pay as maximum as $1M if something serious happens. The insurance agent will most likely evaluate the risks and social network will make great pleasure for that.

From social networks, the real estate agent may learn about your true habits: fast food, extreme sports, risky driving, travel locations and so on. The data that the customer makes public in his social networks helps both to the real estate agent and the customer. The agent provides the high-quality insurance policy, that covers all the customer needs and lowers the price for the customer via removing unnecessary options.

Frankly saying the social networks scrapping or data mining becomes more and more affordable every year. Today, it makes sense both for huge enterprises and for small local businesses.

The best way to improve your business – to know who is your target audience. The deep knowledge achieved with social networks scrapping gives your understanding of how to improve your product to fit customer needs.


Interested in social media scrapping? Wand to build your private scrapping engine? Contact us for consultation: call +1 201 464 6906 or fill the form below. 


Your Name (required)

Your Email (required)