CodeKill : 10 Awesome Tools (Open Source) For Your Big Data Needs

Monday, 30 June 2014

10 Awesome Tools (Open Source) For Your Big Data Needs

Open Source tools, Big Data, Gluster, Lucene, Sqoop, Chukwa, Terracotta, Rattle, ZooKeeper, Oozie, Apache Flume, KNIME

Big Data as well as Open Source go hand in hand. We live in an age where both the aforementioned technologies are being increasingly embraced by one and all. While Big Data holds the distinction of emerging as one of the fastest growing technologies around, the Open Source technology is meanwhile pulling the rug from underneath closed software. Here are 10 awesome Open Source tools for your Big Data needs.

1.Gluster

GlusterFS is an open source, distributed file system capable of scaling to several petabytes (actually, 72 brontobytes!) and handling thousands of clients. GlusterFS clusters together storage building blocks over Infiniband RDMA or TCP/IP interconnect, aggregating disk and memory resources and managing data in a single global namespace.

2.Lucene

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

3.Sqoop

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

4.Chukwa

Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a ﬂexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

5.Terracotta

BigMemory Max lets you create a distributed in-memory data store made up of pairs of servers, or stripes. Each stripe consists of an active server and a mirror server, which stores a copy of the active. If the active goes offline, the mirror takes over, ensuring high availability.

6.Rattle

Rattle (the R Analytical Tool To Learn Easily) presents statistical and visual summaries of data, transforms data into forms that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets.

7.ZooKeeper

ZooKeeper is a centralised service for maintaining configuration information, naming, providing distributed synchronisation, and providing group services. All of these kinds of services are used in some form or another by distributed applications.

8.Oozie

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).

9.Apache Flume

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS.

10.KNIME

KNIME is a user-friendly graphical workbench for the entire analysis process: data access, data transformation, initial investigation, powerful predictive analytics, visualisation and reporting. The open integration platform provides over 1000 modules (nodes), including those of the KNIME community and its extensive partner network.

CodeKill

Monday, 30 June 2014

10 Awesome Tools (Open Source) For Your Big Data Needs

No comments:

Post a Comment

Blog Archive

Popular Posts

Labels