Web Snippets: Dynamically create DataFrames

Tuesday, July 10, 2018

Dynamically create DataFrames

We can dynamically create a string of rows and then generate a dataframe.

However it would be considered as a single line and would throw an error.
We need to split lines based on the delimiter. This can be done by writing a split function as shown below

CREATE DATAFRAME

from pyspark.sql.functions  import lit

# create rdd for new id
data_string =""
for rw in baseline_row.collect():
    for i in range(24):
        hour="h" + str(i+1)
        hour_value= str(rw[hour])
        data = 'Row('+ rw.id +', "unique_id"),'
        data_string = data_string + data

#dynamically generated data for hours 
print(hourly_data)
rdds=spark_session.sparkContext.parallelize([data_string])
rdds.map(split_the_line).toDF().show()

SPLIT FUNCTION

# Function to split the line

def split_the_line(x):
    return x.split(',')

Web Snippets

Labels

Tuesday, July 10, 2018

Dynamically create DataFrames

No comments:

Post a Comment

Labels

Blog Archive