Introduction to Big Data
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is data with such large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also data but with huge size. Some of the examples of Big Data are:
- The New York Stock Exchange is an example of Big Data that generates about one terabyte of new trade data per day.
- The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.
- A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.
Characteristics of Big Data:
There are 5 V’s of Big Data that explains the characteristics:
- Volume
- Variety
- Velocity
- Value
- Veracity
- Volume
The prominent feature of any dataset is its size. Volume refers to the size of data generated and stored in a Big Data system. We’re talking about the size of data in the petabytes and exabytes range. These massive amounts of data necessitate the use of advanced processing technology—far more powerful than a typical laptop or desktop CPU. As an example of a massive volume dataset, think about Instagram or Twitter. People spend a lot of time posting pictures, commenting, liking posts, playing games, etc. With these ever-exploding data, there is a huge potential for analysis, finding patterns, and so much more.
- Variety
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications.
- Velocity
The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.
- Value
Value refers to the benefits that your organisation derives from the data. Does it match your organisation’s goals? Does it help your organisation enhance itself? It’s among the most important big data core characteristics.
- Veracity
Veracity refers to the trustworthiness and quality of the data. If the data is not trustworthy and/or reliable, then the value of Big Data remains unquestionable. This is especially true when working with data that is updated in real-time. Therefore, data authenticity requires checks and balances at every level of the Big Data collecting and processing process.
Types of Big Data:
- Structured Data:
Structured data refers to the data that you can process, store, and retrieve in a fixed format. It is highly organised information that you can readily and seamlessly store and access from a database by using simple algorithms. This is the easiest type of data to manage as you know what data format you are working with in advance. For example, the data that a company stores in its databases in the form of tables and spreadsheets is structured data.
- Unstructured Data:
Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, unstructured data poses multiple challenges in terms of its processing for deriving value out of it. A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos etc. Nowadays organisations have a wealth of data available with them but unfortunately, they don’t know how to derive value out of it since this data is in its raw form or unstructured format.
- Semi-structured data:
Semi-structured data refers to data that is not captured or formatted in conventional ways. Semi-structured data does not follow the format of a tabular data model or relational databases because it does not have a fixed schema. However, the data is not completely raw or unstructured, and does contain some structural elements such as tags and organisational metadata that make it easier to analyse.
Challenges of Big Data:
Some of the key challenges of Big Data include:
- Big Volume: Big Data is characterised by its massive volume. Analysing such large data sets requires specialised infrastructure, hardware, and software, which can be expensive to acquire and maintain.
- Speed of generation of Data: The speed at which data is generated and updated is another challenge of Big Data. Real-time data processing and analysis require advanced tools and techniques that can handle rapid data ingestion, processing, and output.
- Different varieties of Data: Big Data is generated from a wide variety of sources, including structured and unstructured data. This requires advanced tools and techniques that can handle multiple data formats and types, such as text, audio, video, and images.
- Quality of Data: Data quality is a crucial challenge of Big Data. Large data sets often contain errors, inconsistencies, and inaccuracies that can affect the accuracy and reliability of the analysis. Ensuring data quality requires specialised data cleansing and validation tools and techniques.
- Value of Data: Turning Big Data into actionable insights and value is another challenge. Analysing such large data sets can be time-consuming, and it can be difficult to extract relevant insights that can be used to drive business decisions.
- Privacy and Security: Big Data often includes sensitive information such as personal and financial data, making it a potential target for cybercriminals. Protecting the privacy and security of Big Data requires specialised tools and techniques, including data encryption, access controls, and cybersecurity measures.
- Ethical concerns: With Big Data come ethical concerns about privacy, bias, and discrimination. The use of Big Data can have unintended consequences and create ethical dilemmas, such as how to balance the benefits of data analysis with individual privacy rights and societal norms.
Advantages of Big Data:
Big Data can provide several advantages to organisations, including:
- Innovation: Big Data can provide new insights and ideas that can drive innovation and improve product development. By analysing data from a variety of sources, businesses can identify new market trends and customer needs that can inform product development and innovation.
- Improved healthcare: Big data can be used to analyse patient data and provide personalised treatment plans. It can also be used to track disease outbreaks, monitor the spread of diseases, and identify health trends in different populations.
- Enhanced education: Big data can be used to analyse student performance and provide personalised learning experiences. It can also be used to improve student outcomes, measure the effectiveness of teaching methods, and develop new educational tools and resources.
- Increased efficiency: Big data can be used to optimise business processes and improve operational efficiency. It can also be used to reduce waste, improve resource allocation, and increase productivity.
- Improved decision-making: Big data can provide insights that can be used to make better decisions. It can be used to identify opportunities, mitigate risks, and optimise outcomes.
- Enhanced public safety: Big data can be used to analyse crime data and provide insights that can be used to reduce crime rates. It can also be used to improve emergency response times, monitor traffic patterns, and improve transportation safety.
- Better customer service: Big data can be used to analyse customer data and provide personalised service experiences. It can also be used to identify customer needs and preferences, develop new products and services, and improve customer satisfaction.
Introduction to Hadoop
What is Hadoop?
Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. It is meant for storage and processing of big data in a distributed manner. It is the best solution for handling big data challenges.
Some important features of Hadoop are:
- Open Source
Hadoop is an open source framework which means it is available free of cost. Also, the users are allowed to change the source code as per their requirements.
- Distributed Processing
Hadoop supports distributed processing of data i.e. faster processing. The data in Hadoop HDFS is stored in a distributed manner and MapReduce is responsible for the parallel processing of data.
- Fault Tolerance
Hadoop is highly fault-tolerant. It creates three replicas for each block (default) at different nodes.
- Reliability
Hadoop stores data on the cluster in a reliable manner that is independent of the machine. So, the data stored in the Hadoop environment is not affected by the failure of the machine.
- Scalability
It is compatible with the other hardware and we can easily add/remove the new hardware to the nodes.
- High Availability
The data stored in Hadoop is available to access even after the hardware failure. In case of hardware failure, the data can be accessed from another node.
Hadoop Architecture
At its core, Hadoop has two major layers namely:
- Processing/Computation layer (MapReduce), and
- Storage layer (Hadoop Distributed File System).
1. MapReduce
MapReduce is a programming model for data processing. Hadoop can run MapReduce programs written in Java, Ruby and Python. MapReduce programs are inherently parallel, thus very large scale data analysis can be done fastly. In MapReduce programming, Jobs(applications) are split into a set of map tasks and reduce tasks.
Map task takes care of loading, parsing, transforming and filtering.
The responsibility of reduce task is grouping and aggregating data that is produced by map tasks to generate final output.
2. HDFS
Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using low-cost hardware.
HDFS holds a very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available for parallel processing.
Advantages of Hadoop
- Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatically distributes the data and work across the machines and in turn, utilises the underlying parallelism of the CPU cores.
- Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.
- Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.
- Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.