PySpark is the Python API wrapper for Apache Spark, a big data processing framework.
Previously I used Spark in an assignment and my team mates had slightly different results to mine when we were trying to solve the same question.
It turns out, it came down to the inferschema
parameter when reading in a Spark DataFrame.
Table of Contents
The Inferschema Trap
Essentially what happens when you set inferschema=True
is that PySpark tries to infer the data type, and this can result in a discrepancy when you’re filtering.
When a column is a string type, the F.between()
function is inclusive.
However, when the data type is timestamp
it is no longer inclusive.
Code Example
First let’s try and load the data into a
spark = SparkSession.builder.appName("delayed_flights").getOrCreate()
data_path = <path-to-data>
df = spark.read.format("csv") \
.options(
header="true",
inferschema="true"
) \
.load(data_path)
df.printSchema()
# Now let's filter by the dates...
flights_tiny_df = flights_tiny_df.filter(
flights_tiny_df["flight_date"].between(
"1996-01-01",
"1996-12-31"
)
)
flight_cols = ["tail_number", "flight_date"]
flight_dates = flights_tiny_df.select(flight_cols) \
.groupBy("flight_date") \
.count()
# Get the latest date
print(flight_dates.orderBy("flight_date", ascending=False) \
.collect()[0]
)
Filtering the dates gives
The printSchema()
function returns
> root
> |-- flight_id: integer (nullable = true)
> |-- carrier_code: string (nullable = true)
> |-- flight_number: integer (nullable = true)
> |-- flight_date: timestamp (nullable = true)
> |-- origin: string (nullable = true)
> |-- destination: string (nullable = true)
> |-- tail_number: string (nullable = true)
> |-- scheduled_depature_time: string (nullable = true)
> |-- scheduled_arrival_time: string (nullable = true)
> |-- actual_departure_time: string (nullable = true)
> |-- actual_arrival_time: string (nullable = true)
> |-- distance: integer (nullable = true)
Filtering the dates gives
> Row(flight_date=datetime.datetime(1996, 12, 30, 0, 0), count=145)
Now if you change inferschema=False
and run the exact same code again, you will get2
> root
> |-- flight_id: string (nullable = true)
> |-- carrier_code: string (nullable = true)
> |-- flight_number: string (nullable = true)
> |-- flight_date: string (nullable = true)
> |-- origin: string (nullable = true)
> |-- destination: string (nullable = true)
> |-- tail_number: string (nullable = true)
> |-- scheduled_depature_time: string (nullable = true)
> |-- scheduled_arrival_time: string (nullable = true)
> |-- actual_departure_time: string (nullable = true)
> |-- actual_arrival_time: string (nullable = true)
> |-- distance: string (nullable = true)
and the result
> Row(flight_date='1996-12-31', count=129)
This took me longer than necessary to debug because I did not suspect this to be an issue for a while.
Nevertheless I hope this means that you won’t run into this issue in your own work!
You download the dataset here to reproduce the code.