Top 100 Hadoop Interview Questions for Beginners and Professionals

1. What is Apache Hadoop?
Hadoop is an open source software framework for distributed storage and distributed processing of large data sets. Open source means it is freely available and even we can change its source code as per our requirements. Apache Hadoop makes it possible to run applications on the system with thousands of commodity hardware nodes. It’s distributed file system has the provision of rapid data transfer rates among nodes. It also allows the system to continue operating in case of node failure.
Trending and Updated Hadoop Topics - FREE PDF Download
2. Main Components of Hadoop?
Storage layer – HDFS
Batch processing engine – MapReduce
Resource Management Layer – YARN
HDFS – HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave topology.
MapReduce – For processing large data sets in parallel across a Hadoop cluster, Hadoop MapReduce framework is used. Data analysis uses a two-step map and reduces process.
YARN – YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.
3. Why do we need Hadoop?
Storage – Since data is very large, so storing such huge amount of data is very difficult.
Security – Since the data is huge in size, keeping it secure is another challenge.
Analytics – In Big Data, most of the time we are unaware of the kind of data we are dealing with. So analyzing that data is even more difficult.
Data Quality – In the case of Big Data, data is very messy, inconsistent and incomplete.
Discovery – Using a powerful algorithm to find patterns and insights are very difficult.
4. What are the four characteristics of Big Data?
Volume: The volume represents the amount of data which is growing at an exponential rate i.e. in Petabytes and Exabytes.
Velocity: Velocity refers to the rate at which data is growing, which is very fast. Today, yesterday’s data are considered as old data. Nowadays, social media is a major contributor to the velocity of growing data.
Variety: Variety refers to the heterogeneity of data types. In another word, the data which are gathered has a variety of formats like videos, audios, csv, etc. So, these various formats represent the variety of data.
Value: It is all well and good to have access to big data but unless we can turn it into a value it is useless.
5. What are the modes in which Hadoop run?
Local (Standalone) Mode – Hadoop by default run in a single-node, non-distributed mode, as a single Java process.
Pseudo-Distributed Mode – Just like the Standalone mode, Hadoop also runs on a single-node in a Pseudo-distributed mode.
Fully-Distributed Mode – In this mode, all daemons execute on separate nodes forming a multi-node cluster. Thus, it allows separate nodes for Master and Slave.

6. Explain about the indexing process in HDFS.
Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of the data chunk is stored.
7. What happens to a NameNode that has no data?
There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.
8. What is Hadoop streaming?
Hadoop distribution has a generic application programming interface for writing Map and Reduce jobs in any desired programming language like Python, Perl, Ruby, etc. This is referred to as Hadoop Streaming. Users can create and run jobs with any kind of shell scripts or executable as the Mapper or Reducers.
9. What is a block and block scanner in HDFS?
Block – The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.
Block Scanner – Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the data node.
10. What is a checkpoint?
Checkpoint Node keeps track of the latest checkpoint in a directory that has the same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and image file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode.
11. What is commodity hardware?
Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed in RAM. Hadoop can be run on any commodity hardware and does not require any super computer s or high-end hardware configuration to execute jobs.
12. Explain what is heartbeat in HDFS?
Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there are some issues with data node or task tracker.
13. What happens when a data node fails?

  • When a data node fails
  • Jobtracker and name node detect the failure
  • On the failed node all tasks are re-scheduled
  • Namenode replicates the user’s data to another node

14. Explain what happens in textinformat?
In text input format, each line in the text file is a record. Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text
15. Explain what is Sqoop in Hadoop?
To transfer the data between Relational database management (RDBMS) and Hadoop HDFS a tool is used known as Sqoop. Using Sqoop data can be transferred from RDMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS.
16. Mention what are the data components used by Hadoop?
Data components used by Hadoop are

  • Pig
  • Hive

17. What is rack awareness?
Rack awareness is the way in which the name node determines on how to place blocks based on the rack definitions.
18. Explain how to do ‘map’ and ‘reduce’ works.
Namenode takes the input and divide it into parts and assign them to data nodes. These data nodes process the tasks assigned to them and make a key-value pair and return the intermediate output to the Reducer. The reducer collects this key-value pairs of all the data nodes and combines them and generates the final output.
19. What is a Combiner?
The Combiner is a ‘mini-reduce’ process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.
20. Consider case scenario: In M/R system, – HDFS block size is 64 MB
– Input format is FileInputFormat
– We have 3 files of size 64K, 65Mb and 127Mb
How many input splits will be made by Hadoop framework?
Hadoop will make 5 splits as follows −
– 1 split for 64K files
– 2 splits for 65MB files
– 2 splits for 127MB files

Read Remaining Questions From Here

Top 100 Interview Questions and Answers

Comments