This is an English translation of a Japanese blog. Some content may not be fully translated.
AWS

WordCount with EMR PySpark

Prepare Data for WordCount

head -c 200m /dev/urandom > test.txt
hadoop fs -put test.txt /user/hadoop/
hadoop fs -ls /user/hadoop/

PySpark Script for Execution

from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
from operator import add

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

inputFile = "/user/hadoop/test.txt"
lines = sc.textFile(inputFile)
lines_nonempty = lines.filter( lambda x: len(x) > 0 )
counts = lines_nonempty.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)
output = counts.collect()
for (word, count) in output:
    print("%s: %i" % (word, count))
sc.stop()

Execute

spark-submit test.py
Suggest an edit on GitHub