Big Data Programming with Hadoop and Spark - PSC

January 7, 2018 | Author: Anonymous | Category: Engineering & Technology, Computer Science, Java Programming

Short Description

Download Big Data Programming with Hadoop and Spark - PSC...

Description

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center

Hadoop Overview

• Framework for Big Data • Map/Reduce • Platform for Big Data Applications

Map/Reduce

• Apply a Function to all the Data • Harvest, Sort, and Process the Output

Map/Reduce

Big Data

Split 1

Output 1

Split 2

Output 2

Split 3

Map F(x)

Output 3

Split 4

Output 4

… Split n

… Output n

© 2014 Pittsburgh Supercomputing CenterPittsburgh Supercomputing Center © 2010

Reduce F(x)

Result

4

HDFS

• Distributed FS Layer • WORM fs – Optimized for Streaming Throughput

• Exports • Replication • Process data in place

HDFS Invocations: Getting Data In and Out

• • • • • •

hadoop dfs -ls hadoop dfs -put hadoop dfs -get hadoop dfs -rm hadoop dfs -mkdir hadoop dfs -rmdir

Writing Hadoop Programs

• Wordcount Example: Wordcount.java – Map Class – Reduce Class

Compiling

• javac -cp $HADOOP_HOME/hadoop-core*.jar \ -d WordCount/ WordCount.java

Packaging

• jar -cvf WordCount.jar -C WordCount/ .

Submitting your Job

• hadoop \ jar WordCount.jar \ org.myorg.WordCount \ /datasets/compleat.txt \ $MYOUTPUT \ -D mapred.reduce.tasks=2

Configuring your Job Submission

• Mappers and Reducers • Java options • Other parameters

Monitoring

• Important Ports: – – – –

Hearth-00.psc.edu:50030 – Jobtracker (MapReduce Jobs) Hearth-00.psc.edu:50070 – HDFS (Namenode) Hearth-03.psc.edu:50060 – Tasktracker (Worker Node) Hearth-03.psc.edu:50075 – Datanode

Hadoop Streaming

• Write Map/Reduce Jobs in any language • Excellent for Fast Prototyping

Hadoop Streaming: Bash Example

• Bash wc and cat • hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar \ input /datasets/plays/ \ -output mynewoutputdir \ -mapper '/bin/cat' \ -reducer '/usr/bin/wc -l '

Hadoop Streaming Python Example

• Wordcount in python • hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar \ -file mapper.py \ -mapper mapper.py \ -file reducer.py \ -reducer reducer.py \ -input /datasets/plays/ \ -output pyout

Applications in the Hadoop Ecosystem

• • • • •

Hbase (NoSQL database) Hive (Data warehouse with SQL-like language) Pig (SQL-style mapreduce) Mahout (Machine learning via mapreduce) Spark (Caching computation framework)

Spark

• Alternate programming framework using HDFS • Optimized for in-memory computation • Well supported in Java, Python, Scala

Spark Resilient Distributed Dataset (RDD)

• • • • •

RDD for short Persistence-enabled data collections Transformations Actions Flexible implementation: memory vs. hybrid vs. disk

Spark example

• lettercount.py

Spark Machine Learning Library

• Clustering (K-Means) • Many others, list at http://spark.apache.org/docs/1.0.1/mllib-guide.html

K-Means Clustering

• Randomly seed cluster starting points • Test each point with respect to the others in its cluster to find a new mean • If the centroids change do it again • If the centroids stay the same they've converged and we're done. • Awesome visualization: http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

K-Means Examples

• spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://hearth-00.psc.edu:/datasets/kmeans_data.txt 3 • spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://hearth-00.psc.edu:/datasets/archiver.txt 2

Questions?

• Thanks!

References and Useful Links

•

HDFS shell commands: http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html

•

Writing and running your first program: http://www.drdobbs.com/database/hadoop-writing-and-running-your-first-pr/240153197

•

https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program

•

Hadoop Streaming: http://hadoop.apache.org/docs/stable1/streaming.html https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program/hadoop-streaming

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ •

Hadoop Stable API: http://hadoop.apache.org/docs/r1.2.1/api/

•

Hadoop Official Releases: https://hadoop.apache.org/releases.html

•

Spark Documentation http://spark.apache.org/docs/latest/

Big Data Programming with Hadoop and Spark - PSC

Short Description

Description

Comments

We need your help!