Big Data (A problem not a technology)

😮😲😧 But Data is the new oil for IT industry

Kajal Kashyap

8 min readMar 13, 2021

Got Confused ???…. Let’s start from basics…

👉 What is Data ?

Facts and statistics collected together for reference or analysis.

👉 What is Big Data ?

Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size.

Why we use Big Data ?

Big Data helps the organizations to create new growth opportunities and entirely new categories of companies that can combine and analyze industry data. These companies have ample information about the products and services, buyers and suppliers, consumer preferences that can be captured and analyzed.

Eric Schmidt (Executive Chairman of Google) said
“There were 5 exabytes of information created berween the dawn of civilization through 2003, but that much information is now created every 2 days”

Want to know how huge Big Data is…. !!!

You will be surprised to know the amount of data generated every minute 🙄

🔴 Examples Of Big Data

Following are some of the Big Data examples-

✅ The New York Stock Exchange generates about one terabyte of new trade data per day.

✅ Social Media

The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.

✅ A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.

🔴 Types Of Big Data

Following are the types of Big Data:

Structured
Unstructured
Semi-structured

👉 Structured

Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’ data. Over the period of time, talent in computer science has achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.

Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.

Looking at these figures one can easily understand why the name Big Data is given and imagine the challenges involved in its storage and processing.

Do you know? Data stored in a relational database management system is one example of a ‘structured’ data.

Examples Of Structured Data

An ‘Employee’ table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 — Rajesh Kulkarni — Male — Finance — 650000
3398- — Pratibha Joshi- — Female- — Admin- — 650000
7465- — Shushil Roy- — Male- — Admin- — 500000
7500- — Shubhojit Das- — Male- — Finance- — 500000
7699- — Priya Sane- — Female- — Finance- — 550000

👉 Unstructured

Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos etc. Now day organizations have wealth of data available with them but unfortunately, they don’t know how to derive value out of it since this data is in its raw form or unstructured format.

Examples Of Un-structured Data

The output returned by ‘Google Search’

👉 Semi-structured

Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data represented in an XML file.

Examples Of Semi-structured Data

Personal data stored in an XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>

Data Growth over the years

🔴Characteristics Of Big Data

Big data can be described by the following characteristics:

Volume
Variety
Velocity
Variability

👉 Volume — The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big Data.

👉 Variety — The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.

👉 Velocity — The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.

👉 Variability — This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

Benefits of Big Data Processing

Ability to process Big Data brings in multiple benefits, such as-

✅ Businesses can utilize outside intelligence while taking decisions

Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their business strategies.

✅ Improved customer service

Traditional customer feedback systems are getting replaced by new systems designed with Big Data technologies. In these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses.

✅ Early identification of risk to the product/services, if any

✅ Better operational efficiency

Big Data technologies can be used for creating a staging area or landing zone for new data before identifying what data should be moved to the data warehouse. In addition, such integration of Big Data technologies and data warehouse helps an organization to offload infrequently accessed data.

Now we know what exactly Big Data is ?

But if Big Data is the problem then what is the solution ??

Apache Hadoop solves the Big data problem using the concept HDFS (Hadoop Distributed File System). Hadoop solves the problem of Big data by storing the data in distributed form in different machines. There are plenty of data but that data have to be store in a cost effective way and process it efficiently.

Hadoop is designed to scale up from a single computer to thousands of clustered computers, with each machine offering local computation and storage. In this way, Hadoop can efficiently store and process large datasets ranging in size from gigabytes to petabytes of data.

🔴 What is Apache Hadoop?
Apache Hadoop is an open-source, a Java-based programming framework that continues the processing of large data sets in a distributed computing environment. It based on the Google File System or GFS.

🔴 Why Apache Hadoop?
Hadoop runs few applications on distributed systems with thousands of nodes involving petabytes of information. It has a distributed file system, called Hadoop Distributed File System or HDFS, which enables fast data transfer among the nodes.

👉 Apache Hadoop Architecture

🔴 How Google Solved Big Data Problem?

This problem tickled google first due to their search engine data, which exploded with the revolution of the internet industry. And it is very hard to get any proof of it that its internet industry. They smartly resolved this difficulty using the theory of parallel processing. They designed an algorithm called MapReduce. This algorithm distributes the task into small pieces and assigns those pieces to many computers joined over the network, and assembles all the events to form the last event dataset.

Well, this looks logical when you understand that I/O is the most costly operation in data processing. Traditionally, database systems were storing data into a single machine, and when you need data, you send them some commands in the form of SQL query. These systems fetch data from the store, put it in the local memory area, process it and send it back to you. This is the real thing which you could do with limited data in control and limited processing capability.

But when you see Big Data, you cannot collect all data in a single machine. You MUST save it into multiple computers (maybe thousands of devices). And when you require to run a query, you cannot aggregate data into a single place due to high I/O cost. So what MapReduce algorithm does; it works on your query into all nodes individually where data is present, and then aggregate the final result and return to you.

It brings two significant improvements, i.e. very low I/O cost because data movement is minimal; and second less time because your job parallel ran into multiple machines into smaller data sets.

🙂 Thank you LinuxWorld Informatics Pvt. Ltd. .. and Vimal13 Sir for your incredible efforts to make the journey of all Arth learners remarkable….