– Rob Sobers, technical manager at Varonis (www.varonis.com), says:
Considering the concerns organizations may have about managing human-generated big data, the fact that Web crawlers are now scouring the many pages of the Web and indexing that information shows that companies need to treat their human generated content more carefully.
The Web is full of all kinds of data – some of which is being cleaned up and interpreted by services such as Infochimps or Microsoft’s Windows Azure Marketplace. Other services – such as Datafiniti and Factual – are also, it is reported, building entire businesses based on scraping the content from Web sites and then creating customized databases for clients.
Other companies – like Yelp, which went public based on the content provided by users, content that it then vigorously defended from Google’s indexing – are also taking advantage of human-generated data (e.g. reviews and comments) to enrich themselves.
More than anything, this highlights the fact that the power of the content that humans create is only now being realized. And, of course, organizations have a lot of human generated content, some on the web, but most resides inside their organizations.
It is now clear that external entities can glean enough information from the Internet to make a business, but the data held inside most organizations is largely untapped by external organizations (thankfully) at present.
With a growing number of companies whose business is harvesting and interpreting externally available data, it should be clear to any IT professional that – as well as protecting their internal data from external and prying eyes – they also need to harness its power to take advantage of new opportunities.
This is where the idea of human generated big data – which is defined as data sets generated by people (rather than machines) that grow so large and change so quickly that the content, usage, and permissions become difficult to capture, store, search and analyze – enters the frame.
The same kinds of big data technologies that crawl web sites can also be applied to human generated content inside the organization so corporations can better manage, protect, and take more advantage of their human generated content – their documents, spreadsheets, emails, presentations, audio and video files, allowing companies to analyze it and benefit from its “hidden” value. It is also extremely important, of course, to prevent the human generated big data from being stolen by anyone – whether an insider or an information harvesting company.
This makes it all the more imperative that companies understand the need to use analytics on their human generated big data. Conventional data security software rarely provides adequate coverage for this content, so IT professionals need to take action to fully protect all of their digital data assets.