📚 MapReduce and Data Clustering, Free Essay

Type of paper:	Essay
Categories:	Software
Pages:	5
Wordcount:	1247 words

11 min read

143 views

The internet is currently undergoing a rapid development resulting in the availability of data in massive volumes. Additionally, the development has seen an exponential increase in the rate of data generation. Unfortunately, the existing multi-core computers are sophisticated resulting in complexities in their performance. This limits their applicability and effectiveness in handling large scales of data. In order to address this problem, programmers have developed a software framework known as MapReduce that uses data clustering techniques to handle big data. This paper describes data clustering processes and techniques in MapReduce.

Definitions of MapReduce and Data Clustering

Li, Hu, Li, Wu, and Yang (2016) define MapReduce as a programming and processing technique used for solving different types of jobs using clustering techniques. The fundamental objective of using MapReduce framework is to simplify various programming processes by use of distributed computing platforms. Remarkably, MapReduce offers two interfaces for data clustering namely map and reduce interfaces. Programmers implement map and reduce algorithms. However, computing systems perform map and reduce scheduling and synchronizing processes. The success and wide adoption of MapReduce framework are attributed to its flexibility, efficiency and fault tolerant nature.

On the other hand, data clustering is the process of partitioning data into similar subjects without labels (Eng, Beng-Chin, Ozsu, & Sai, 2014). The similarity is determined on the basis of either business needs or by clustering algorithm requirements. The main goal is to group similar objects into same clusters. This means that distinct objects are grouped into their own groups. Data clustering can be achieved through various means. They include, but not limited to, canopy clustering, k-means, and min Hash Clustering. Canopy clustering is mainly applied as a pre-clustering algorithm so as to help speed up the actual clustering algorithm. On the other hand, K-clustering is an analysis method that forms k number of groups from an input of n data points. MinHash clustering is applied when the clustering is done on a range of dimensions of data points.

The Process of Data Clustering Using MapReduce

Principally, MapReduce uses four clustering algorithms to achieve data clustering. Essentially, these algorithms work with partitioned data and take into account the nature of the data in order to suitably model the algorithm. Modeling algorithms present a big challenge to programmers due to its complex nature. However, technologies such as Phoenix and Apache Hadoop have paved ways for programmers to model their algorithms without the worry of complexities in the underlying architecture (Nivranshu, Sana, & SN, 2015). In order to understand the process of data clustering, it is important to analyze MapReduce data clustering based on the algorithm and techniques used. These techniques include K-means, canopy, greedy agglomerative, and expectation-maximum clustering.

The K-means clustering process begins with partitioning input data into k sets by means of a given definite method. The centroid for each of the k sets of data is then determined and the algorithm repeated by the association of each point of data to its nearest centroid. A new center is then established. This process is iteratively done until a convergence is achieved. The convergence is reached when the centroid no longer shifts upon successive iterations. It is important to note that the algorithm is extremely fast. Therefore, it is advisable to run the algorithm repeatedly for quite some time. This will ensure that the best clustering is achieved.

Xie, Lu, Mei, Du, and Man (2016) describe canopy clustering as an algorithm intended to combine the use of other algorithms that would rather be impractical when subjected to a direct use. While using this technique, cheap distance matric is applied to partition the data into overlapping canopies. After this, the data is sorted into clusters using other clustering algorithms such as K-means. Generally, canopy clustering provides a rapid and accurate optimal solution through its divide -conquer approach. However, this approach is only applicable when the data set is large with significant time and space complexities, rendering the application of other algorithms such as K-clustering impossible.

Greedy agglomerative clustering is used when merging any two clusters with great similarities. The merging process is repeated until the required number of clusters is attained. Notably, the first step in agglomerative clustering is to merge the objects into clusters. The resulting clusters are then merged on the basis of distance measure. Below is an example that illustrates the agglomerative clustering process.

Figure 1 showing Greedy Agglomerative clustering process

Expectation-maximum (EM) is a clustering technique that is used to estimate the value of an unknown quantity using known values of some correlated quantities. EM has a better convergence property and therefore is preferred over K-means. The first step in EM clustering process is initializing the parameters for distribution. This process is repeated until a convergence point is reached. The process is divided into the E-Step and the M-Step. E-step involves estimating the expected value of the unknown variable while the M-Step covers the re-estimation of the distribution parameter based on the estimated value.

Thus, MapReduce uses the above four clustering algorithms to achieve data clustering. Notably, each algorithm executes the clustering process following some distinct steps in order to achieve a specific objective. The choice of the clustering algorithm to be used in a MapReduce depends on the requirements and the nature of the data.

Application of Data Clustering and MapReduce

MapReduce and data clustering find application in various fields. One such field is data mining where clustering is used for analysis of data. The objective of the analysis is to identify groups of related records. This provides a starting point for the exploration of other relationship between records. Furthermore, data clustering is very useful in developing segmentation model for a population. For example, it is applied in customer segmentation based on the demographic features. Text mining is another area where MapReduce and data clustering finds the application. By definition, text mining is the process of generating high-quality pieces of information from texts. Usually, this process involves structuring of input texts, pattern derivation of the structured data, and evaluation and interpretation of the output text (Beibei, Bo, Weiwei, &Ying, 2017).

MapReduce Workflow Example

MapReduce computation involves breaking a large job into smaller tasks that can be executed in parallel across servers. The smaller execution results are then enjoined to obtain the final results. Below is a workflow example of how the MapReduce program can be used in Word count.

Figure 2 showing MapReduce Workflow

Conclusion

MapReduce is undoubtedly a great innovation in the field of Data Science. Data clustering concepts have, to a great extent, helped to accurately simplify tasks that could otherwise be so tedious and time-consuming. Additionally, MapReduce techniques are very useful in various day-to-day operations in different fields. These programming techniques have helped in the improvement of big data operations, and in essence, advanced the Computer Science field.

References

Beibei, L., Bo, L., Weiwei, L., & Ying, Z. (2017). Performance Analysis of Clustering Algorithm Under Two Kinds of Big Data Architecture. Journal of High-Speed Networks, 23(1), 49-57

Eng, L., Beng-Chin, O., Ozsu, M. T., & Sai, W. (2014). Distributed Data Management Using MapReduce. ACM Computing Surveys, 46(3), 1-42.

Li, R., Hu, H., Li, H., Wu, Y., & Yang, J. (2016). MapReduce Parallel Programming Model: A State-of-the-Art Survey. International Journal Of Parallel Programming, 44(4), 832-866.

Nivranshu, H., Sana, M., & SN, O. (2015). Big Data Clustering Using Genetic Algorithm on Hadoop MapReduce. International Journal of Scientific & Technology Research, Vol 4, Iss 4, Pp 58-62 (2015), (4), 58.

Xie, L., Lu, C., Mei, Y., Du, H., & Man, Z. (2016). An optimal method for data clustering. Neural Computing & Applications, 27(2), 283-289.

Cite this page

MapReduce and Data Clustering, Free Essay. (2022, Mar 09). Retrieved from https://speedypaper.net/essays/mapreduce-and-data-clustering

Request Removal

If you are the original author of this essay and no longer wish to have it published on the SpeedyPaper website, please click below to request its removal: