Exploring MapReduce: A Crucial Element in Distributed Computing

Exploring MapReduce: A Crucial Element in Distributed Computing

Week2 and Week3

·

2 min read

MapReduce stands out as a Java-based program model and processing technique designed for distributed computing. It is a fundamental component of the Hadoop framework, along with Hadoop File System (HDFS) and YARN.

Its potency lies in processing extensive distributed data, enabling users to efficiently distribute work and execute computations in parallel.

MapReduce contains two steps.:

  • Map - During this phase, data is divided into parallel processing tasks, allowing transformation logic to be applied to each data chunk.

  • Reduce - The Reduce phase oversees the aggregation of data from the map set.

Working of MapReduce:

A typical MapReduce system unfolds in three steps, often viewed as the integration of Map and Reduce operations/functions:

  • Map: Input data undergoes segmentation into smaller blocks, with each block assigned to a mapper for processing. The number of mappers usually corresponds to the number of blocks. Each worker node employs the map function on local data, delivering the output to temporary storage. The primary (master) node ensures the processing of a single copy of redundant input data.

  • Shuffle, Combine, and Partition: Worker nodes redistribute data based on output keys, consolidating data related to one key on the same node. Optionally, a combiner reduces data on each mapper server, simplifying subsequent shuffling and sorting. Partitioning, a non-optional process, dictates data presentation to the reducer.

Map Reduce and its Phases with numerical example. - GeeksforGeeks

  • Reduce: Execution begins once a mapper completes its task. Worker nodes process groups of <key, value> pairs concurrently, directing map output values with the same key to a single reducer. The reducer aggregates values for each key, offering flexibility as the reduce function is optional.

Disadvantages of MapReduce.:

  • Rigid Programming Paradigm:- MapReduce's stringent programming paradigm necessitates fitting even incompatible logic into its structure, resulting in overly extensive code. Moreover, it exclusively integrates with HDFS and YARN, limiting compatibility with other storage or resource managers.

  • Read/Write Intensity:- Operating with minimal data stored in memory, MapReduce demands continuous read and write operations with HDFS.

  • Java Centric:- The language-centric nature of MapReduce, predominantly Java, restricts flexibility to a singular programming language.

  • Transition from Big Data Offerings:- As a legacy feature, MapReduce is gradually transitioning out of Big Data offerings. Emerging tools and cloud alternatives, explored in the upcoming article, are becoming the forefront choices.


Further Readings.:


The resources I consulted for reference are credited in the above section contributing valuable insights to the content presented.

Image Credits.: I do not claim credit for the image; all acknowledgment and appreciation go to the original creator.