21 Mar 2021

Interview - HDFS

Hadoop Distributed File System - HDFS, is a distributed files system that allows big data to be stored across multiple nodes.

Yarn is the resource management in hadoop that allows multiple processing engines to manage data in a single platform.

RDBMS vs Hadoop

	RDBMS	HDFS
Data Types	RDBMS relies on the structured data and the schema of the data is always known.	Any kind of data can be stored into Hadoop i.e. Be it structured, unstructured or semi-structured.
Processing	RDBMS provides limited or no processing capabilities.	Hadoop allows us to process the data which is distributed across the cluster in a parallel fashion.
Schema on Read Vs. Write	RDBMS is based on ‘schema on write’ where schema validation is done before loading the data.	On the contrary, Hadoop follows the schema on read policy.
Read/Write Speed	In RDBMS, reads are fast because the schema of the data is already known.	The writes are fast in HDFS because no schema validation happens during HDFS write.
Cost	Licensed software, therefore, I have to pay for the software.	Hadoop is an open source framework. So, I don’t need to pay for the software.
Best Fit Use Case	RDBMS is used for OLTP (Online Trasanctional Processing) system.	Hadoop is used for Data discovery, data analytics or OLAP system.

Problem with Small Files:
- Each small file will take up 150 bytes in memory to store it’s metadata, too much RAM is not cost-efficient;
- Reading is not efficient, as there are lots of seek and jump from datanode to datanode
Block: these are the smallest continuous location on the hard drive where data is stored, hdfs store each files as blocks and distributed it across the hadoop cluster; by default it’s size is 128 MB
Replication: blocks of data are also replicated to provide fault tolerance, so even when a DataNode fails or a data block gets corrupted, the data can still be retrieved
- Rack Awareness: this is an algorithm to make sure that not all the replica of a blcok are stored on the same rack or a single rack.
HDFS Write:
- Client reachout to NameNode for request to writting to two blocks
- NameNode gives a list of IP adress for the DataNode (all replicates)
- The block will also be copied to these three DataNodes

HDFS Write

HDFS Write Acknowledgement