Responsive Ad Area

Share This Post

test

unable to pass rdd to a udf in pyspark

I am trying to pass an rdd to a function, but the function is not being called.

My sample code is

def test(x):
    print("in func")
    print(x)

data = sc.textFile("path/to/data")
data.map(lambda x: x.split(",")).map(lambda row: test(row))

when I run this code it gives me nothing

So I have tried to do foreach on rdd and called the function.

data.foreach(test)

I am now able to see the data and the print statement also.
But this happens only when I run in pyspark shell level.
When I try it in code project level it throws me an error like

19/03/16 17:46:04 WARN TaskSetManager: Lost task 1.0 in stage 2.0 (TID 4, ip-172-31-20-247.ec2.internal, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File “/hadoop/yarn/local/usercache/vikastiwarim935995/appcache/application_1552564637687_0467/container_e54_1552564637687_0467_01_000002/pyspark.zip/pyspark/worker.py”, line 216, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File “/hadoop/yarn/local/usercache/vikastiwarim935995/appcache/application_1552564637687_0467/container_e54_1552564637687_0467_01_000002/pyspark.zip/pyspark/worker.py”, line 58, in read_command
command = serializer._read_with_length(file)
File “/hadoop/yarn/local/usercache/vikastiwarim935995/appcache/application_1552564637687_0467/container_e54_1552564637687_0467_01_000002/pyspark.zip/pyspark/serializers.py”, line 170, in _read_with_length
return self.loads(obj)
File “/hadoop/yarn/local/usercache/vikastiwarim935995/appcache/application_1552564637687_0467/container_e54_1552564637687_0467_01_000002/pyspark.zip/pyspark/serializers.py”, line 562, in loads
return pickle.loads(obj)

I could see the approach of rdd.foreach(func) in Apache Spark user guide.

I can see this error if I run the calling function and called the function in the same file or even in the different files.

I am not sure what I am missing here. Can someone please help me understand my mistake.


unable to pass rdd to a udf in pyspark
unable to pass rdd to a udf in pyspark
test
{$excerpt:n}

Share This Post

Leave a Reply

Your email address will not be Publishedd. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Skip to toolbar