fbpx
Data Automation 101
Share on linkedin
Share on twitter
Share on reddit
Share on facebook
Share on email

What is data automation, how to use it and where we’re going

Have you ever checked Google’s mission? 

To organize the world’s information.” That’s pretty slick, right? 

Well, our mission at Captain Data is “To let anyone easily access web data.” 

Not too far from Google, in some ways, but with a very different objective: Google lets you access that data, but you can’t really make sense of it or use it yourself. 

And Google doesn’t really want you to do anything other than consume it because… well, because selling ads based on this data is very lucrative. 

Technology

Zapier really paved the way for automation. If you’ve ever wanted to automate your business without coding, Zapier is (was?) the tool to use. 

It connects 3rd party services together thanks to triggers and actions. Under the hood, Zapier uses websites’ APIs and allows you to play with them, even without knowing how to code.

We’ve recently seen a lot of similar entrants in this market:

A few companies are also trying to re-invent the spreadsheet, like Actiondesk or DashDash.

Problem is, 99% of websites do not provide a way to access their data

There’s a lot of reasons for this, between legacy development, the lack of (good) implementation, or simply a lack of desire to do so.

But recent developments have made it easier than ever to extract data, with headless browsers and notably Puppeteer

Don’t get me wrong, the technology is not that new: for example, Selenium has been around for more than 15 years now (since 2004). 

At first, this was meant to “Automate Browser Testing.” Over time, the mantra has evolved to something more like “Automate Anything.” 

What’s missing

Semantic

One problem remains: The web is not yet semantic. By that, I mean there’s no way to just navigate to a website and get something we’ll call its “schema”. 

In data, everything is in the schema. This is the representation of the data: what you get and, by extension, how you get it. 

I’m not talking about a random and arbitrary object composed of keys and values, that you, as a website owner, would define. I’m talking about a universal schema: schema.org for example. 

Let’s imagine you navigate to Facebook:

  • You’d know that Facebook is a registered company selling ads with a $XXXb valuation.
  • Description: “Facebook’s platform connects and empowers people.” (That actually sounds like a lot of bullshit, but that’s another topic!)
  • When navigating on a person’s profile, you’d have their full name along with whatever they decide to display publicly. (So yes, don’t just publish everything out in the open.)
  • You’ll probably get a basic understanding of how this person is connected to you or a company.

If you’re thinking “Huh, that’s pretty useless to me”, well, think about it this way:

  • Every website on the web has the same way to describe people; so when you’re on LinkedIn, you get the same description of the data (metadata) as on Facebook, Twitter, or any other social network.
  • The same goes for companies and other organizations.

It’s actually pretty neat. You’d (almost) have instant access to valuable data.

Access

Let’s say for a minute that every website on the web has been upgraded and displays its (well-formed) schema: how do you access that data? 

Well, the only way to retrieve such data would be to use and implement an API (application programming interface). 

Everything becomes so much more complicated, because now you have to implement something on your backend (i.e. your application server) instead of just saying “Hey, look at my schema.

 There are a few reasons this could go badly:

  • Legacy: there are dinosaur architectures; ever heard of Cobol? Well, good luck implementing an API
  • Heterogeneous datasets: you’ve got a bit of data here and a bit of data there; it looks like a mess and you don’t even understand your own semantics
  • Interoperability: not everything is HTTP; you have to think about every other component of your architecture

Bots to the rescue

Instead of having to implement an API, you could use a bot. 

Remember how we were talking about having a unified schema? Anyone, based on the universal schema description, for example, could easily extract data. 

When extracting data on the web, you have two options:

  • Scraping
  • Crawling

Scraping is great because you know exactly what you get at the end; for each website, you select which data you want to extract. 

The downside with web scraping is that you have to maintain each bot since a website can change anytime.

Crawling (and not scrolling 🤦‍♂️) is a bit more complicated, because it is more “generalized” and usually involves a lot more volume. 

It follows a set of (more or less) intelligent rules: go to that website, try to extract this type of content, then continue to the next level/website, and so on.

As always, the trick lies in the metadata (the description of the data). 

If you start crawling every PDF out there, great, you’ll have an awesome collection of PDFs … but without semantics, what do you make out of it? Not much. 

You’d have to use Machine Learning to try and make sense out of it. And that suddenly becomes a lot more complicated.

Now, if there’s a unique schema, there could also be a unique bot. You’d plug in the URL and start extracting the semantics from any page on the web – that would be truly awesome! 

Any developer would be able to code this bot in 5 minutes. And of course, there would be companies, like Captain Data, that would empower non-coders to also access that data. 

It’s not artificial intelligence

Sorry to disappoint you, but there’s no big data or artificial intelligence involved in any way. 

I mean, sure, you could develop a bot that uses “AI” to extract data. 

But the bot would have to be very specialized: for example you could train it to extract job boards, because the data is kind of the same, so it’s easier to define the semantics and the expected output.

At the moment, web scraping is pretty stupid, but I’m sure it will evolve with time. 

The job of the developer is to basically replicate the human behavior in a script (program).

The road to web automation

If you’re using Zapier or something equivalent, you know you can’t do things like:

  • Automatically connect a LinkedIn profile given an email
  • Automatically extract results from a search; pretty much any website these days has a search functionality
  • Collect someone’s followers on Instagram or Twitter
  • Extract every customer review on Yelp or TripAdvisor
  • And so on…

At Captain Data, we want to bridge this gap, by enabling sales, marketing and business operatives to easily extract and automate such data. 

We provide a platform that allows you to extract and automate web data in real time. 

You can use premade bots that we maintain, or code you own, as well as using native integrations like Hubspot or SalesForce.

And like Zapier, you can connect and chain multiple bots between them and create what we call a recipe.

A recipe is an automated workflow made of bots and integrations.

But what’s fundamentally different in our approach is that you also get access to data that was previously not accessible, thanks to web scraping. 

For example, you can:

You can create as many workflows as you need, using as many bots and integrations as you want.

The only limit is your imagination 👻


You want to work with us? Shoot us an email at sales@captaindata.co.

If you’re looking for a job/internship in marketing or engineering, shoot us an email at join@captaindata.co.

Just want to say hello? Ping us at hello@captaindata.co or come visit us.

Astronaut

Subscribe to our Newsletter

Read about successful use of web data, new scraping techniques and ways to optimize your business with web data.

Share this post with your friends

Share on facebook
Share on google
Share on twitter
Share on linkedin

Your Free Trial

14 days of unlimited data extraction 🤖

Start extracting, automating and building processes & databases faster and get more done with web data.