connotate

vincent sgro

– Vincent Sgro, Co-Founder and Chief Technology Officer at Connotate

Do your Web scraper efforts let you down?

To be a high-performance business today requires agility and innovation, not just to meet executive and shareholder demands, but also to keep pace with competition. The ability to mobilize Big Data can provide the means to create and sustain competitive advantage.

A homegrown Web scraping technology to access critical data located on Web sites may be holding you back. Here are the top ten features your Web scraper — and your business — could be missing out on:

Top Feature #1: Scalability

Traditional programmatic approaches to Web site scraping isolate “moving parts” of a solution to make it easier for the programmer to solve the problem. They are isolated from runtime usage scenarios. But when a non-programmatic approach creates code, it opens up the possibility to receive clues about intended usage of extracted data. An automated Web data extraction and monitoring agent can, for example:

  • Bypass useless links and arrive at desired data more rapidly
  • Consume fewer hardware resources
  • Create a lighter load footprint on target websites

This functionality is imperative to extract data at scale. Further, non-programmatic methods can capture knowledge about individual sites and leverage it to speed learning across multiple sites, adding to the ability to scale efficiently and effectively.

Top Feature #2: Minimized errors

Visual abstraction is a non-programmatic approach that uses machine learning to generate efficient code we call an agent. Visual abstraction interprets each Web page in the way a human perceives the page visually. Unlike a homegrown web scraping solution, an automated Web data extraction and monitoring solution can support a higher level of abstraction and does not require knowledge of HTML structures. And, it doesn’t break when it detects page changes.

Top Feature #3: Resiliency

Do the words, “maintenance problem” sound familiar? A homegrown data scraping solution that depends on HTML delimiters most likely stops working when Web sources change their underlying page templates. It’s difficult, if not impossible, to write code that can adjust itself to HTML formatting changes—the changes are just too dynamic. Lack of resiliency, then, means the programmer has to be on call to repair broken scripts, because broken scripts will happen. Often.

But using a non-programmatic approach changes the game here. Based on research from Rutgers University and the University of SouthernCalifornia,  a hybrid, non-programmatic platform that uses machine learning and visual data abstraction alleviates the “maintenance problem.” This allows an agent to learn what “content” looks like and recognize when a data element moves from one location on a Web page to another. The agent adapts to formatting changes without breaking. The ultimate result? No programming and that “maintenance problem” is no longer a burden.

Top Feature #4: Integration with business processes and operational workflow

In today’s data-driven business environment, multiple teams of workers often interact with data collection and data analysis processes. Organizations seeking web scraping at scalemust support the requirements, often diverse, of the different purposes for the data. Because the requirements are distinct, built-in features supporting the variety of needs are critical to scaling to high volumes and high frequencies of data collection.

Top Feature #5: Ability to work with dynamic unstructured data

Homegrown scripts and web scraper software are dependent on HTML delimiters, which break when the underlying HTML changes and the need for fixes have to be manually monitored. An automated Web data extraction and monitoring solutions detect changes and additions with precision, returning only the desired data.

Top Feature #6: Ability to create and manage scripts and agents

Certain capabilities in an automated Web data extraction and monitoring solution can help to streamline processes and workflows at scale, easily producing productivity gains. These include:

  • Shared schemas and request lists to support the management of large, ongoing projects and consistent practices across a team
  • Tools that easily invoke mass adjustment actions
  • Automatic agent deployment and load management
  • Bulk manipulation of job scheduling
  • Migration of agents and data user subscriptions among systems
  • Improved quality assurance

Top Feature #7: Unstructured data transformed to usable structured data

Unstructured data is designed for the human eye while structured data is designed for computers. A homegrown Web scraper uses HTML tags and other Web page text as delimiters or landmarks, scanning code and discarding irrelevant delimiters to extract the data. A homegrown Web scraper and an automated web data extraction and monitoring solution can both turn unstructured data into .csv, xls, .xml or other form of structured data usable by a computer, facilitating analysis and application of the data to inform better business decisions. However, the automated solution incorporates data normalization and transformation methods to ensure the structured data can easily be turned into actionable insights. For example, Connotate provides at least 20 normalization templates, ready for use, and clients can use these to define others.

Top Feature #8: Intelligent agents

Agents can be created in minutes without programming.  Based on machine learning and visual abstraction, they can easily monitor, gather, analyze and deliver high-value Web content. Unlike traditional Web scraping software, an automated Web data extraction and monitoring solution — through the use of intelligent agents — produces a lighter load on target websites while retrieving the data more quickly.

Top Feature #9: Full-story extraction

Each news website, be it The New-York TimesThe Washington Post or similar news source, has a unique look and feel for story detail pages—even though their RSS feeds are standard. An automated Web data extraction and monitoring solution employs algorithms that combine the measurements — e.g., the length of sentences, frequency of punctuation, impact and dependency of neighboring sections — from the experience of many agents collecting news.  Automation of full-story text extraction across thousands of Web sites also enhances scalability.

Top Feature #10: Flexible deployment

Does your homegrown Web scraper mean your clients must reconfigure processes and infrastructure? Ideally, it should be the other way around. Web data extraction should accommodate the IT department’s current processes and be able to adapt as needed.