Customer Profile
Scraping

 

At Azati Labs, our business analysts helped our partner to build progressive web scraping platform for US-based real estate firm. The main idea of this solution was to generate a customer profile using the information extracted from various websites.

Project idea:

 

As the customer is a real estate agency located in northern California, there are a lot of different law limitations and restrictions related to personal data aggregation, storage, and processing. Real Estate is a quite competitive industry, so it is crucial to understand your clients, as a considerable number of clients trust and follow recommendations and reviews from their friends or family members.

Real estate firm wanted to learn more about a potential customer before they sign up the contract. Our partner had an suitable experience for this project, but there were no business analysts experienced in the real estate. So, our team decided to take part in this project.

We were asked to build a system that looks for a specific person on different websites, scraps information from there, and represents it in a single dashboard.

Main challenges:

 

If a project is easy to implement, it is not worth writing a case study. Web-scraping is a resource- intensive and time-consuming process with a considerable number of challenges. Let’s have a closer look at the main issues we faced.

Challenge #1:

The first and the most complicated tech-related issue was a set of limitations and restrictions of the websites we wanted to extract data from. For example, a customer wanted the solution to collect information from Yelp, TripAdvisor, Facebook, Airbnb, and other well-known resources with strict privacy policies.

It is a challenging process: if a website detects abnormal behavior or abusive activities, it automatically bans a user account, and from time to time bans the IP address.

This way, if you want to extract information from, for example, Facebook you’ll need some additional resources – a set of accounts and a list of proxies (private proxies or VPN servers) you can rotate to reduce the probability of being banned.

Hopefully, both proxies and accounts you can buy on the aftermarket, and it is not an actual problem. Our team spent plenty amount of time learning how social networks and popular review websites track user behavior and determine bots and crawlers.

The information collected while solving this challenge helped us to build robust and reliable algorithms that look like a regular user for any tracking systems. Sure, a customer paid us extra for doing a research, but the algorithms we developed help a customer cut down maintenance costs in a long-term run.

Challenge #2:

Another tech-related challenge was data extraction from the HTML. In recent years front-end frameworks like React, Angular, Ember, and Vue became incredibly popular and replaced old-fashioned AJAX calls and jQuery.

It means, that traditional web-scrapers now are less usable than five years ago, as some critical parts of websites are generated using a JavaScript after the main content of a page was already loaded. So, one of the main requirements for a modern web-scraper is JavaScript rendering. And that was an issue.

As JavaScript rendering is a comparably resource-intensive task, it takes much more resources and time to process a single page. Multithreading and multiprocessing are a solution to this issue. The team decided to develop the core of an application with Golang – programming language known for its simultaneous processing capabilities.

The existing open-source Go library – WebLoop helped us to build efficient JS rendering engine, but made an app resource-demanded. The solution could not be hosted on a cheap $5 VPS and required a small but dedicated server with several cores and threads available.

Challenge #3:

The third challenge was not related to data extraction – it concerned the collected data. Our team quickly understood that data collection and aggregation does not provide the required insights for the customer by itself.

The engine uses basic information about the customer for doing initial research, but if the name and surname are quite common and widespread – it is problematic to find a required John Smith across thousands of John Smiths from different countries across several continents.

This way, intelligent data matching is the main and the most critical issue our team faced. As we were building a prototype, our team discussed the level of data matching complexity with the customer. It would take thousands of dollars to create truly intelligent algorithms, so the customer agreed to match records partly manually, relying on some algorithmic calculations.

We briefly trained a solution to find similarities in usernames, education information, email addresses, natural language, to make an MVP entirely usable.

Prototype description:

 

The project team developed a reliable solution that scraps information from various websites on the fly – without a database. Customer types clients’ name into the search field, and the solution automatically looks for mentions. When the process is complete, the application builds an interactive dashboard using the collected data.

This prototype was built in several phases:

01.

Pre-alpha:

we developed a set of scripts that extracts information from the most popular websites

02.

Alpha:

our engineers connected these scripts into one system and built up a fundamental UI with Vue.js

03.

MVP 1.0:

business analysts determined the optimal way to match extracted records and associate data into a single user profile

As soon as the first version of MVP was released, we presented it to a customer. The feedback was very positive, and our team was asked to build a complete solution.

Screenshots:

 

Technology stack:

 
 
 
 

Results:

 

The development process was quite challenging both the partner and our BA’s. We established development workflow and communication from scratch, while our development partner built the core of application functionality.

It took about three weeks for our team to create a prototype: from initial business analysis and requirements gathering to application deployment. The team consisted of two business analysts, two developers, and a QA engineer.

Ways of improvement:

 

Once the prototype is developed, and core functionality is thoroughly tested and seems to be working – the central area of future development is intelligent data matching and record linking.

We proposed a client to try machine learning to match data intelligently, but it takes additional time to collect records and analyze how they are interconnected.

There are some traditional algorithms we can also try, but these methods require impressive processing powers and cannot be launched using frontend only.

Now:

 

Together with our partner, we are building a second version of an MVP according project specification.