Most people don’t really care about Big Data anymore. This is an early 2010s buzzword, after all. That was before the whole AI, and VR and TikTok stuff.
They should care: Big Data is what made all that other stuff possible.
A Simple Definition
So what is Big Data? Big Data is a very, very large collection of different types of information which change a lot. At least, that’s the simple definition.
Doesn’t sound all that scary, now, does it? Kinda just sounds like a library, or an archive. But Big Data is slightly more alive than that. Slightly more troubling. Let’s take a look.
A Real-World Example
Let’s start with an example to illustrate.
As of this writing, Facebook reports having about 3 billion active users per month, making it the most used social media platform in the world. Assuming these users are all people, that means that slightly less than 40% of all humans go on Facebook each month. You might be one of them.
Let’s say you make a post on Facebook. What happens then?
Well, Facebook is going to store the information about that post in a database. That way, they can load it for your Facebook friends to see. The exact information they save will of course be the text in the post, and the picture if you added one. They will also store the time and location encoded in the digital picture, the location of your computer where you posted from, and the time at that location according to your computer, so your friends can see which tropical island you’re posting beach photos from at 10 in the morning on a Tuesday. Facebook already stores your profile and your friend network, so it knows who to show the post to. Your profile has your name, your email, your phone number, your gender, your birthday, the town you grew up in, the high school you went to, your favorite TV show, and pictures of what you look like, all the information you knowingly and willingly shared when you first created your account. They also know the type of devices you like to use, as well as the places you’ve lived in and traveled to and accessed social media from. They know the time of day at which you like to scroll, the type of post you spend just slightly longer looking at, all the tells that you unknowingly share. They note the users you are Facebook friends with, the users you like to chat with, the users you marked as family, the users you were in relationships with, and the users who you don’t know, but whose behavior and interests are similar to yours.
And then you make your post, and they use that information to show your post to other users, and they save which users spend time looking at it, and who reacts to it, and who comments on it, and the comments themselves, and who replies to those comments, and whether the words in the comments relate to positive emotions (happy, great, pretty!!!!), or negative (hate, ugh, angry emoji), and whether or not users are more likely to click on an ad for a beach vacation after seeing your post.
And that’s just one Facebook post. Now multiply that by 3 billion every month. That’s Big Data.
What Makes Data Big?
When you have a very, very large collection of data, you can decide whether it qualifies as Big Data with the help of the three V’s: Volume, Velocity and Variety.
Volume
This one is easy: do you have a lot of data? Gigabytes and terabytes of it? Do you need to keep it on the Cloud because it’s bigger than your entire laptop memory? Does downloading it as a file take forever? Does your computer keep freezing when you try to open the file to look at it?
Probably the single most defining feature of Big Data is its volume. It’s so big that you need special data tools to look at it. This is way out of Excel’s league.
Velocity
We mentioned archives earlier. Archives also have a lot of volume, but they are not Big Data. Why not? Because archives are usually ink on paper, which is a fairly permanent way of saving information. Maybe a new book will be added once in a while, or a revised edition brought in. But in general, there isn’t a whole lot of change.
Big Data isn’t like that. It grows. It changes. A lot. All the time. Thousands of entries updating at once every millisecond. Think of everyone who’s out there writing Facebook posts right now. Those posts are not only going to add to Facebook’s already gigantic data collection. They are also going to create ripple effects across the data, as timestamps, locations, views and reactions are updated, like a fire hose with no shutoff valve blasting into an already flooded pond. (The name for a database that holds Big Data is literally Data Lake.)
Variety
Libraries have it easy when it comes to storing data. They mostly have books with words printed in them, mostly with proper spelling. The more modern ones might have picture books too, maybe magazines or even dvds, but that’s about it for variety.
Facebook alone can top that. There’s posts, there’s comments, there’s replies to comments, all spelled in every way you can imagine, spellcheck be damned. If you’re searching for all mentions of “cats” in the Facebook databases, you better be prepared to search for “cts” and “ctas” as well. And then do that in almost every language on Earth.
There’s also reactions, emojis, gifs, links to other websites, pictures, voice recordings and videos. We give Facebook a lot of grief for not removing violent and/or lewd content fast enough, but in all honesty, it is a very difficult job to monitor all these different formats in every language. (How hard they are trying is another question, but for now let’s give them the benefit of the doubt.)
Speaking of user moderation, there’s user relationships with other users to track, user messages, user histories, user reports of inappropriate content, track records of removed posts and banned accounts.
Then there’s ads, and user classifications of the types of ads they might like. Track records of everyone the ads were shown to, and how they responded to them. After all, Facebook makes most of its money from targeted advertising, and most of the Big Data they collect is saved to serve this goal.
There’s tags, and hashtags, and user groups, a million and one different columns in a million different tables, all constantly expanding and updating, although of course previous entries are kept, because you wouldn’t want to lose anything. Things are rarely deleted. Old tables from the early days of Facebook, when technology was different, and information had a different structure, are probably still floating around somewhere.
That’s what’s meant by variety. Big Data is not just a bunch of information. It’s a set of groups of subsets of information of lots of different types, all connected in different ways.
And it’s someone’s job to manage that.
Who Has Big Data
Now Facebook may have been a pioneer in the field of compulsive data hoarding, but it is by no means the only guilty party at this point. Most social media platforms will also collect any piece of information you will give them, both intentionally and accidentally, just in case it helps sell you to advertisers.
Non-social media companies do this with information about their clients as well. Amazon really wants to know what kind of pillow you looked at so it can advertise similar pillows to you, and to people who are similar to you. Shipping companies need to know where in the world your pillow is, where it’s going, and how it’s getting there. The healthcare industry wants to compare your medical data to all the other patient medical data so they can find out what all the people with neck problems have in common and find a cure.
And it’s not just data about people being collected. Cybersecurity workers collect Big Data to detect online security threats. Physicists and astronomers get absolutely enormous amounts of numbers about space from their measurement tools and satellites. Finance people obsessively compile stock market numbers as they come in as part of their continued quest to predict future prices based on past prices. Weather channels compile readings from across the world in their far more successful quest to predict future storms based on past storms.
Using Big Data for Dummies
So once you have it: how do you use it?
That’s the work of a data scientist. Remember when this was the job everyone was talking about? Data scientists are somewhere between computer scientists and statisticians. Their first job is to write code to collect the information from all the different sources in a company’s computer system. Then, once they’ve collected the data, they need to clean it. Remember searching for cats above? There’s a lot of that. Data will come in misspelled, with missing timestamps, in the wrong format, encrypted or damaged or downright wrong. A significant portion of a data scientist’s job is to set up filters and safeguards to to keep bad data from tainting the results. Or, as it’s known in the computer science world: “Garbage in, garbage out”.
Once the pipelines are set up to collect and clean the data (remember, it’s still gushing in all the time), the data scientists can actually get to the fun part: exploiting data for profit. And boy do they get creative.
Ever been to a party where you talked to a friend about a new gym they went to? And then a few days later you went on Instagram and saw an ad for that gym? Many people have reported experiencing something like this, to the point where there’s conspiracy theories that social media apps are tapping our phones and listening to our conversations.
I’m here to reassure you, that is not happening. First of all, because speech-to-text is still a hand-wavy technology when used in controlled environments, let alone for espionage. Secondly, that would be illegal.
You know what’s not illegal though? Checking your location at a given time, which you gave them permission to do so they could add times and locations to your posts. Comparing it to the time and location of your Facebook friend. Noting that you two were in the exact same location at the same time, and that you also have similar ages and belong to similar social classes. And therefore probably have similar interests. Then checking your friend’s recent activity, noticing a new gym whose Facebook page they looked up, and sending you the ad for it. None of that is illegal.
Big Data powered targeted advertising is something else. Not to mention incredibly lucrative.
How Did This Happen?
So obviously at this point it’s more or less common knowledge that privacy is dead. What we’re increasingly finding out is that Big Data is the murderer. But when did this happen? When did Data start becoming so big and personal? Before the 2010s, there were already sizable collections of data being compiled and shared. Phone books were a thing. Most census statistics started being collected decades ago. The World Wide Web was literally invented so physicists could share the massive amounts of data from their experiments with each other.
But again, Big Data is different. It’s not just a matter of having a lot of information. It’s a matter of having the tools that make that data incredibly easy to exploit.
A big piece of the puzzle is the Cloud, which entered the mainstream just a short while before Big Data became a thing we worried about. Fast internet connections and large amounts of computer storage space online for relatively cheap suddenly made it realistic to actually manipulate all the data out there. All of a sudden it could be collected, stored and used to find patterns, generalize about behaviors, and even predict future events. And if that sounds familiar, it might be because those are all things that AI is very good at. And that’s no accident. Modern AI requires huge amounts of variable, up-to-date data for training. Without Big Data, there is no AI.
The Last V
We already discussed the three V’s above. Volume, Velocity, Variety. But there’s a fourth V that’s become an increasing point of contention in the past couple of years: Veracity.
Let’s play a little game: what’s the worst thing that could happen if some of the data in Facebook’s Big Data was untruthful?
In 2015, a Facebook group in a Mexican town published posts about two fifteen year old girls going missing. A day later, two young men who’d just arrived in the city were beaten to death and set on fire by an angry mob. According to the local authorities who were powerless to stop the lynching, there was no official report of the girls being missing and no evidence that the two brothers were in any way related to anyone’s disappearance. They had come to the city to conduct a survey on tortilla consumption.
This was not an isolated incident. Mob lynchings fueled by fake news spread on social media have notably been observed in India, Indonesia, Nigeria, Myanmar and Sri Lanka. Sri Lanka has even resorted several times to shutting down Facebook and other social media platforms to prevent the spread of violence.
The big takeaway from this article should be as follows: Big Data is incredibly powerful. It gives us insights into information that we’ve never had access to before. But it is also incredibly dangerous when inaccurate. Big Data knows a lot about you. But it does not know which parts are true.