Question : I have a data frame with a column for date in the format yyyymmdd
, how can I convert it into yyyy-MM-dd
with appropriate schema in spark/pyspark ?

I would like to answer above question.
Converting yyyyMMdd to yyyy-MM-dd format with appropriate schema
Spark
Code
%spark
var df = Seq("20200601","20200602","20200603").toDF("Date")
.withColumn("Date", to_date(col("Date"), "yyyyMMdd"))
df.show
df.printSchema
Output
+----------+
| Date|
+----------+
|2020-06-01|
|2020-06-02|
|2020-06-03|
+----------+
root
|-- Date: date (nullable = true)
I used some methods below.
- def
to_date
(e: Column, fmt: String): Column
Converts the column into a
See Datetime Patterns for valid date and time format patternse
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such asyyyy-MM-dd
oryyyy-MM-dd HH:mm:ss.SSSS
fmt
A date time pattern detailing the format ofe
whene
is a stringreturns
A date, or null ife
was a string that could not be cast to a date orfmt
was an invalid formatSince
2.2.0
pyspark
Code
%pyspark
from pyspark.sql.functions import to_date, col
df = spark.createDataFrame([("20200601",), ("20200602",), ("20200603",)]).toDF("date")
df = df.withColumn("date", to_date(col("date"), "yyyyMMdd"))
df.show()
df.printSchema()
Output
+----------+
| date|
+----------+
|2020-06-01|
|2020-06-02|
|2020-06-03|
+----------+
root
|-- date: date (nullable = true)
I used some methods below.
pyspark.sql.functions.to_date
(col, format=None)
Converts aColumn
intopyspark.sql.types.DateType
using the optionally specified format. Specify formats according to datetime pattern. By default, it follows casting rules topyspark.sql.types.DateType
if the format is omitted. Equivalent tocol.cast("date")
.
New in version 2.2.0.
That’s all. Thank you.
If you are new to Spark, I recommend Oreilly Safari online learning.