Communiquez avec les autres et partagez vos connaissances professionnelles

Inscrivez-vous ou connectez-vous pour rejoindre votre communauté professionnelle.

Suivre

Is HDFS efficient with Multi-Join Query on large-Scale Data?

user-image
Question ajoutée par Ebrahim Alhussam , Marketing Manger - Social Media and ERP , Fission Technology
Date de publication: 2014/10/06
mohammad feroz feroz
par mohammad feroz feroz , Software Designer , Royal Bank of Scotland

hdfs involve extensive disk i/o in its mapreduce process. Therefore if the number of nodes are high, parallelism will be high and query will be efficient. On the other hand less number of nodes will lead to poor query performance owing to high latency.

Ahmed Lateef
par Ahmed Lateef , Member Technical Services , eBay

It depends - if you have partitioning/bucketing defined on tables. And if so how have you used / defined them. The whole idea is (large) that for  table join you need to get soerted merge join to get best performance. 

 

Shekh Firoz Alam
par Shekh Firoz Alam , Senior Data Engineer , Innovative solutions

Yes Is is.go for Hbase or Cassandra

Jassim Moideen
par Jassim Moideen , Senior Data Scientist & Big Data Program Manager , Vodafone

Hadoop framework consists of 2 major basic components : HDFS (storage file system) and MapReduce (processing engine)

 

HDFS stands for “Hadoop Distributed File System” which is a distributed, scalable, and portable file-system written in Java for the Hadoop framework (similar to NTFS or fat32 in Windows OS but not POSIX complaint). It can be mounted on conventional OS like Linux and was designed to run off commodity hardware to allows storage and computation capabilities for big data. This resolves the scaling and processing limitations and minimal risk of data corruption or loss associated with distributed platforms. The key features that automatically allows HDFS to ensure the capabilities are rack awareness, dynamically diagnose the health of the file system and rebalance the data with minimal data motion, redundancy and supports high availability, it requires minimal operator intervention, allowing a single operator to maintain a cluster of 1000s of nodes over commodity hardware. HDFS is optimized for streaming access of large files and store files in various formats like flat csv, Avro, Parquet with various compression formats like Snappy and allows access through Hadoops processing component – MapReduce, to process them in batch mode in a sequential order. Map Reduce jobs are normally written in PIG and PIG scripting lacks full scale modelling due to lack of libraries unlike R or Python

 

Key HDFS properties and modelling limitations:

 

1.       HDFS was designed for mostly immutable files and is not be suitable for systems requiring concurrent write-operations to prevent data corruptions and errors caused due to random read write access and thus not ACID complaint. Hence HDFS follows write-once read-many ideology and under these conditions, Hadoop can perform only batch processing over the entire data which will be accessed by the processor in a sequential manner, which means one has to search the entire dataset even for the simplest of jobs which is time consuming. A huge dataset when processed for running some iterative models results in another huge data set, which should also be processed sequentially again while running some iterative models. At this point, a new solution is needed to access any point of data in a single unit of time (random access). This is where NoSql solutions like HBase comes into picture.

 

2.       HDFS natively supports only mapreduce processing paradigm to deal with embarrassingly parallel computing problems associated with big data. It is to be noted that not all math and statistical models are map reducible e.g. Travelling Salesman problem. Thus MapReduce is suitable only for aggregated batch processing jobs but implementing models with interactive jobs becomes literally impossible. Hence we need other modelling platforms for Hadoop to support it. Use cases with common modelling problems like conditional iteration, hashing functions , state machine operations, cascading tasks..etc.  require quick and random read/write access of the big data and cannot rely on sequential data access. Modelling tasks that has a mutual dependency on each other cannot be parallelized, and is not possible on HDFS. E.g. : A 'full scan' (the most common and generic pattern for map-reduce) is most likely to be much faster with raw file system access than on a NoSQL DB, as map-reduce on HDFS file systems does more effectively to decouple computation from storage and is much easier to scale out. Thus we need to use the right paradigm for the right problem. Our DPI response efficiency would have been much better if ES was indexed over NoSQL than over raw HDFS files with better query efficiency.

 

A NoSql database are non-relational, distributed database modelled after Google's BigTable and written in Java to run on top of HDFS for Hadoop. NoSQL are distributed, persistent, strictly consistent storage system with near-optimal write—in terms of I/O channel saturation—and excellent read performance, and it makes efficient use of disk space by supporting pluggable compression algorithms. It only considers a single index, similar to a primary key in the RDBMS world, offering the server-side hooks to implement flexible secondary index solutions. Row atomicity and read-modify-write operations make up for this in practice, as they cover most use cases and remove the wait or deadlock-related pauses experienced with other systems. They handles shifting load and failures gracefully and transparently to the clients.

 

Key NoSql properties:

 

1.       Random read/write access and compatibility for big data. This allows low latency access to small amounts of data from within a large big data set. Quick response and fast scans across tables and ability scale in terms of writes with no risk of data corruption or delayed returns is the advantage over HDFS.

 

2.       Flexible data model capability to work with any modelling tool that allows flexibility in terms of modelling libraries with no restrictions, on big data.

 

HBase internally uses HDFS for the storage of its data at the end. So HBase sits on top of HDFS and provides capability to store and read data most commonly structured data. It can be used to store/access structured and unstructured data like how MySql is used on Windows OS is used to store tables and data. HBase's important advantage is that it provides faster lookup and also high volume inserts/updates of a random access request on a high scale.  The HBase schema is very flexible and actually variable, where the columns can be added or removed at runtime. HBase supports low-latency and strongly consistent read and write operations.

 

Some specific HBase benefits :

 

·         Random and consistent Reads/Writes access in high volume request

·         Auto failover and reliability

·         Flexible, column-based multidimensional map structure

·         Variable Schema: columns can be added and removed dynamically

·         Integration with Java client, Thrift and REST APIs

·         MapReduce and Hive/Pig integration with other modelling toolsets

·         Auto Partitioning and sharding

·         Low latency access to data

·         BlockCache and Bloom filters for query optimization

·         HBase allows data compression and is ideal for sparse data

 

HBase(NoSQL) is to HDFS what MySql is to NTFS/FAT(Windows).

Adil Kaleem
par Adil Kaleem , Data Engineer , Teradata Corporation

Hive has a relatively poor performance in terms of heavy queries especially in case of joins. If it's a right outer join, the performance will depend upon the partitions of the table. If it's a left outer join, it'll be relatively faster compared to the right one.Just like the case explained above, the answer lies in the use of Hadoop tool which is used on top of it; and the structure of the table as well.You cannot simply tell HDFS' good or poor performance based on one metric factor.

Srinivasa Tata
par Srinivasa Tata , Program Management , Marsh

You could try with Impala, Presto or Apache Drill

More Questions Like This