Jennifer Kauffman - Marketing Major, Digital Marketing Minor

E-portfolio Points assignment

Brief Overview

MapR is short for MapReduce, which is a programming model that is used in processing big data with a parallel, distributed algorithm on a cluster. MapR is a software company that develops and sells Hadoop-derived software.The contribute to projects such as HBase, Pig, Apache Hive, and Apache ZooKeeper. MapR promises to provide data protection, no single points of failure, and improved performances.

Case Study

Ancestry.com has more than 12 billion records that are a part of a 10-petabyte data store and rely on MapR availability. There are five ways that it makes data useful to customers. There are more than 30,000 record collections including birth, death, census, military, and immigration records. The first way is by mining data patterns in search behavior. For example, there will only be a section of users that will be interested in newly released Mexican census data. The second way is by mining data to provide product development direction to the product team. It analyzes behavior such as where a subscriber may get stuck or where they leave the service. The third way is by relying on big data stores to develop new statistical approaches; such as record linking and search relevance algorithms. The forth way is by using data forensics to mine data for security purposes to ensure security of information. The firth way is by genotyping to provide information about genetic genealogy. Using molecular tests, computational analysis to predict a person’s ethnicity and identify relatives in the databases, tests the customers. The solution is determined by using the data processed by three clusters. The clusters are used for DNA matching, machine learning, and data mining. High availability that MapR provides enables the company to run different tasks on the same cluster.

How this related to MIS2502 material

The topics relate to MIS2502 by the use of data mining to create artificial intelligence, pattern recognition, and statistics. In a large set of data, these types of patterns are not obvious and need to use analytics software to predict an outcome. Just as we used in SAS, there is an extraction of implicit, previously unknown, and potentially useful information from data, and then its explored to discover meaningful patterns; such as the patterns of search for relatives in Ancestory.com. MapR also uses variables to predict unknown of other variables, like the certainty of relevance. The map is the master node that takes input and divides them into smaller sub-problems. This will eventually lead to a tree. The reduce step is the master node that collects the answers and combines them to form an output to the problem. This is how SAS gives an output, by providing a model to predict values of dependent variables based on one or more predictor variables. For example, for every Ancestry DNA customer there is 700,000 distinct variables regions in DNA, it is measured and analyzed, resulting in 10 million cousin predictors for every user to date. As we learned in SAS, it is possible to run different tasks on the same cluster. MapR also allows different tasks being run on the same cluster because of the high availability to this data.

Works Cited

“Ancestry.com Relies on the High Availability of MapR to Run Their DNA Pipeline Constantly, with No Interruptions.” MapR.com. N.p., n.d. Web.

“MapR Technologies Announces HP Vertica Analytics Platform on MapR.”MapR. N.p., n.d. Web. 29 Apr. 2014.

“MapR Technologies Announces Presentations on Big Data and Hadoop at Upcoming Big Data Conferences in May.” MarketWatch. N.p., n.d. Web. 29 Apr. 2014.

Jennifer Lynn Kauffman

E-portfolio Points assignment