ITCS3190-001 Homework -1Survey Report: Big Data I.INTRODUCTIONBy Big data we refer exactly to what itsown name indicates: to the treatment and analysis of huge repositories of data,so disproportionately large that it is impossible to treat them withconventional database and analytical tools. This trend of large amount of datais caused by web pages, image and video applications, social networks, mobiledevices, apps, sensors, the internet of things, etc. Figure.1 The four Vs of Big Data.Big Data is defined as any data source that has at least three sharedcharacteristics 1:1-Extremely largedata volumes.
2-Extremely highdata speed.3-Variety of dataextremely broad.In most cases, inorder to effectively process and handle Big Data, it must be combined withstructured data with the help of one or more conventional commercialapplication (Such as Hadoop or Spark).II.
RELATED TECHNOLOGYHadoop, which marked a milestone toprocess data in batch, gave the way to Spark, as a reference platform for theanalysis of large amounts of data in real time. Spark has the advantage of running100 times faster in memory and up to 10 times more on disk than Hadoop and itsMapReduce paradigm 2. Data intake, is the next challenge. Figure.
2 MapReduce vs Spark processing.Data intake refers to the process bywhich the data obtained in real time is captured temporarily for furtherprocessing. That processing moment is practically instantaneous for thepurposes of time scale. This is happening a lot in the world of sensors and theInternet of Things. Data streaming begins with from thestage of data ingestion. We have to connect to data sources in real time, as wesaid, to allow instant processing. In the era of Business Intelligenceplatforms (Tableau and microStrategy) conventional methods of extracting datafrom the source and bringing into a warehouse are not up to the task.
However,the following tools exist in order to handle Big Data streaming:Flume: tool for data ingestion in real-timeenvironments. It has three main components: Source (data source), Channel (thechannel through which data will be processed) and Sink (persistence of data).For demanding environments in terms of response speed, it is a very goodalternative to traditional warehouse techniques 3.Kafka: distributed and replicated storagesystem. Very fast and agile in readings and writings. It works as a messagingservice and was created by LinkedIn.
It is a distributed system of queues is oneof the best ones but there are others such as RabbitMQ, and solutions in thecloud such as AWS Kinesis 3.There are manyother technologies that can help with the challenge of data intake. Nevertheless,data intake is not the only problem when dealing with data processing in orderto gain business insight. The reason why all this systems and technologies exist.III.
RELATED WORKBusiness analysisis one of the main drivers of Big Data technology. Before making a decision, businesslook at data in order to find possible insight that can help with the following:· Cost reduction· Faster,better decision making· Newproducts and servicesThe need for companies to extract valuefrom data has increased the needs of Cloud technologies or cloud computing.Among these technologies we find elastic computing, a computer system similarto the efficient use of electricity.
The service is given or not, depending onthe demand for use of a certain resource. The elastic computing system allowsto adapt the use of computational resources depending on the data, its size, itstype, speed of these, and therefore give a more effective response.The phase in which the data arecollected is not the same as the phase in which these data are processed togenerate reports in order to draw key conclusions beneficial to the business.
These computing resources in the cloud can be managed with greater efficiency,depending on when we perform the Big Data analysis. In this way services can bereleased so that another company can use them.This type of effective use of Cloudsystems for Big Data allows companies to offer their services in the cloud inthree modes depending on the needs of use: infrastructure (IaaS), pre-configured(PaaS) and software (SaaS)4.The infrastructure mode facilitates theuse of servers at a low level controlling: operating system, memory usage, diskstorage technology, etc. With the pre-configured environment mode, the provideroffers you in the Cloud service the programming languages you need: Java,Python, Ruby etc.
, Apache Hadoop, and Apache Spark. In this mode you only haveto worry about collecting and analyzing the data. The last modality is softwareas a Cloud service, which provides you with an environment to work directlywith Big Data. IV.
ConclusionData storage is no longer a concern for thecommon population. This is especially true for businesses and different firms. Alongsidewith cheap storage, we live in an area that is purely driving by data generatedby every single technology we come across. The issue is no longer lack of data,but rather how to process this massive amount of information in order to gainvalue. Programming skills, statistics, and business insight are the core intoday’s data science.
As time moves forward, we expect data to be accessible interms of seconds which would lead business and firms to move towards a cloudbase model and provide client with easy to use tools in order to access theirmedia, documents, records, etc. REFERENCES1Wolfe, Patrick . “Proceedings of theNational Academy of Sciences of the United States of America.” Making Sense ofBig Data, November 15, 2013. https://www.
ncbi.nlm.nih.gov/pmc/articles/PMC3831484/.2ApacheSpark vs Hadoop: Choosing the Right Framework.” Edureka.
January 04, 2018.https://www.edureka.
co/blog/apache-spark-vs-hadoop-mapreduce.3″Real-TimeData Streaming Tools And Technologies – An Overview.” Algoworks. July 15,2017. http://www.algoworks.com/blog/real-time-data-streaming-tools-and-technologies/.
4Forrest,Conner. “SaaS, PaaS, and IaaS: Understand the differences.” ZDNet.November 07, 2017. http://www.zdnet.