I have tried to provide simple explanations for some of the most important technologies and terms we will come across if we are getting into big data.
Here they some of the key terms:
- Algorithm: A mathematical formula or statistical process used to perform an analysis of data.
- Analytics: You are doing ‘analytics’. You are drawing insights from your raw data which can help you make decisions regarding spending for the upcoming year. What if you did the same exercise on tweets or Facebook posts by an entire city’s population? Now we are talking Big Data analytics. It is about making inferences and story-telling with large sets of data.
- Batch processing: is an efficient way of processing high volumes of data where a group of transactions is collected over a period of time.
- Cassandra is a popular open source database management system managed by The Apache Software Foundation. Apache can be credited with many big data technologies and Cassandra was designed to handle large volumes of data across distributed servers.
- Dark Data: Basically, this refers to all the data that is gathered and processed by enterprises not used for any meaningful purposes and hence it is ‘dark’ and may never be analyzed. It could be social network feeds, call center logs, meeting notes and what have you. There are many estimates that anywhere from 60-90% of all enterprise data may be ‘dark data’.
- Data Lake: A Data Lake is a large repository of enterprise-wide data in raw format. Data warehouses, which are similar in concept in that they, too, are repositories for enterprise-wide data – but in a structured format after cleaning and integrating with other sources. Data warehouses are typically used for conventional data (but not exclusively). Supposedly, a data lake makes it easy to access enterprise-wide data you really need to know what you are looking for and how to process it and make intelligent use of it.
- ETL: ETL stands for extract, transform, and load. It refers to the process of ‘extracting’ raw data, ‘transforming’ by cleaning/enriching the data for ‘fit for use’ and ‘loading’ into the appropriate repository for the system’s use. Even though it originated with data warehouses, ETL processes are used while ‘ingesting i.e. taking/absorbing data from external sources in big data systems.
- Hadoop: When people think of big data, they immediately think about Hadoop. Hadoop (with its cute elephant logo) is an open source software framework that consists of what is called a Hadoop Distributed File System (HDFS) and allows for storage, retrieval, and analysis of very large data sets using distributed hardware. If you really want to impress someone, talk about YARN (Yet Another Resource Scheduler) which, as the name says it, is a resource scheduler. Apache foundation, which came up with Hadoop, is also responsible for Pig, Hive, and Spark.
- In-memory computing: In general, any computing that can be done without accessing I/O is expected to be faster. In-memory computing is a technique to moving the working datasets entirely within a cluster’s collective memory and avoid writing intermediate calculations to disk. Apache Spark is an in-memory computing system and it has a huge advantage in speed over I/O bound systems like Hadoop’s MapReduce.
- IoT: The latest buzzword is Internet of Things or IOT. IOT is the interconnection of computing devices in embedded objects (sensors, wearables, cars, fridges etc.) via the internet and they enable sending / receiving data. IOT generates huge amounts of data presenting many big data analytics opportunities.
- R: ‘R’ is a programming language that works very well with statistical computing.
- Spark (Apache Spark): Apache Spark is a fast, in-memory data processing engine to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark is generally a lot faster than MapReduce that we discussed earlier.
Which of these concepts are added to your knowledge about big data?
There are more useful terms in this article.