my opinion is my own

EMR PySparkでWordCount

WordCountするデータの準備

head -c 200m /dev/urandom > test.txt
hadoop fs -put test.txt /user/hadoop/
hadoop fs -ls /user/hadoop/

実行用PySparkのスクリプト

from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
from operator import add

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

inputFile = "/user/hadoop/test.txt"
lines = sc.textFile(inputFile)
lines_nonempty = lines.filter( lambda x: len(x) > 0 )
counts = lines_nonempty.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)
output = counts.collect()
for (word, count) in output:
    print("%s: %i" % (word, count))
sc.stop()

実行

spark-submit test.py
---

関連しているかもしれない記事


#AWS #EMR