ITCS3190-001 Homework -1
Survey Report: Big Data
By Big data we refer exactly to what its
own name indicates: to the treatment and analysis of huge repositories of data,
so disproportionately large that it is impossible to treat them with
conventional database and analytical tools. This trend of large amount of data
is caused by web pages, image and video applications, social networks, mobile
devices, apps, sensors, the internet of things, etc.
Figure.1 The four Vs of Big Data.
Big Data is defined as any data source that has at least three shared
3-Variety of data
In most cases, in
order to effectively process and handle Big Data, it must be combined with
structured data with the help of one or more conventional commercial
application (Such as Hadoop or Spark).
Hadoop, which marked a milestone to
process data in batch, gave the way to Spark, as a reference platform for the
analysis of large amounts of data in real time. Spark has the advantage of running
100 times faster in memory and up to 10 times more on disk than Hadoop and its
MapReduce paradigm 2. Data intake, is the next challenge.
Figure.2 MapReduce vs Spark processing.
Data intake refers to the process by
which the data obtained in real time is captured temporarily for further
processing. That processing moment is practically instantaneous for the
purposes of time scale. This is happening a lot in the world of sensors and the
Internet of Things.
Data streaming begins with from the
stage of data ingestion. We have to connect to data sources in real time, as we
said, to allow instant processing. In the era of Business Intelligence
platforms (Tableau and microStrategy) conventional methods of extracting data
from the source and bringing into a warehouse are not up to the task. However,
the following tools exist in order to handle Big Data streaming:
Flume: tool for data ingestion in real-time
environments. It has three main components: Source (data source), Channel (the
channel through which data will be processed) and Sink (persistence of data).
For demanding environments in terms of response speed, it is a very good
alternative to traditional warehouse techniques 3.
Kafka: distributed and replicated storage
system. Very fast and agile in readings and writings. It works as a messaging
service and was created by LinkedIn.It is a distributed system of queues is one
of the best ones but there are others such as RabbitMQ, and solutions in the
cloud such as AWS Kinesis 3.
There are many
other technologies that can help with the challenge of data intake. Nevertheless,
data intake is not the only problem when dealing with data processing in order
to gain business insight. The reason why all this systems and technologies exist.
is one of the main drivers of Big Data technology. Before making a decision, business
look at data in order to find possible insight that can help with the following:
better decision making
products and services
The need for companies to extract value
from data has increased the needs of Cloud technologies or cloud computing.
Among these technologies we find elastic computing, a computer system similar
to the efficient use of electricity. The service is given or not, depending on
the demand for use of a certain resource. The elastic computing system allows
to adapt the use of computational resources depending on the data, its size, its
type, speed of these, and therefore give a more effective response.
The phase in which the data are
collected is not the same as the phase in which these data are processed to
generate reports in order to draw key conclusions beneficial to the business.
These computing resources in the cloud can be managed with greater efficiency,
depending on when we perform the Big Data analysis. In this way services can be
released so that another company can use them.
This type of effective use of Cloud
systems for Big Data allows companies to offer their services in the cloud in
three modes depending on the needs of use: infrastructure (IaaS), pre-configured
(PaaS) and software (SaaS)4.
The infrastructure mode facilitates the
use of servers at a low level controlling: operating system, memory usage, disk
storage technology, etc. With the pre-configured environment mode, the provider
offers you in the Cloud service the programming languages you need: Java,
Python, Ruby etc., Apache Hadoop, and Apache Spark. In this mode you only have
to worry about collecting and analyzing the data. The last modality is software
as a Cloud service, which provides you with an environment to work directly
with Big Data.
Data storage is no longer a concern for the
common population. This is especially true for businesses and different firms. Alongside
with cheap storage, we live in an area that is purely driving by data generated
by every single technology we come across. The issue is no longer lack of data,
but rather how to process this massive amount of information in order to gain
value. Programming skills, statistics, and business insight are the core in
today’s data science. As time moves forward, we expect data to be accessible in
terms of seconds which would lead business and firms to move towards a cloud
base model and provide client with easy to use tools in order to access their
media, documents, records, etc.
1Wolfe, Patrick . “Proceedings of the
National Academy of Sciences of the United States of America.” Making Sense of
Big Data, November 15, 2013. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3831484/.
Spark vs Hadoop: Choosing the Right Framework.” Edureka. January 04, 2018.
Data Streaming Tools And Technologies – An Overview.” Algoworks. July 15,
Conner. “SaaS, PaaS, and IaaS: Understand the differences.” ZDNet.
November 07, 2017. http://www.zdnet.com/article/saas-paas-and-iaas-understand-the-differences/.