Tests with Spark: how to keep our heads above water

lundi 9 décembre 2019 • 6 minutes de lecture

Are you drowning in your Spark tests? Are you tired of declaring a Spark session every time? We have the solution for you! With Spark-Test™, write your tests without any difficulties!

Great question, Baby Beaver! Spark-Test is one of the numerous libraries in Spark-Tools, a set of tools designed to make life easier for Spark users. We invite you to discover the other tools over the next few articles we'll be publishing. (Pardon our French! There are already some en français, on Spark-ZIO, a library that combines ZIO and Spark, and on Spark Plumbus proposing some creative solutions here and there.)

Let's discover Spark-Test

Today, we start with the library that is the easiest to learn and to use from the set: Spark-Test. This library provides tools that makes it easier to create tests related to dataframes and/or datasets by making them simpler while providing clear and accurate information through metadata!

Some of your might already be familiar with MrPowers' spark-fast-tests (https://github.com/MrPowers/spark-fast-tests). We decided to tackle testing problems from a different angle, which resulted in a very different conception. Apart from having a different architecture, Spark-Test offers more accurate error reporting. Currently, we're working on improving performance for future updates, which will help greatly accelerate testing.

Caption: (For testing, I mean. For testing!) ou ou... (Just kidding. We all love testing, don't we?)

Setting up

It's easy. You just need to add Spark-Test's dependency in build.sbt, then extends SparkTest in your test file and voilà! For those who are starting to dabble with Scala libraries, here's a skeleton of Spark-Test you can build on.

libraryDependencies += "io.univalence" % "spark-test_2.12" % "0.3"

package io.univalence.sparktest

import org.scalatest.FunSuite

class GettingStartedTest extends FunSuite with SparkTest {

  test("some test") {
    // Start the test here...
  }

}

Now that everything is ready, let's start by creating a dataframe that will serve as an example throughout this article. Spark-Test provides tools that help you to create a dataframe or a dataset quickly for your tests. So we're going to use one of them right now!

You don't need to instantiate a SparkSession. Everything is already set up for you.

val df = dataframe("{a:1, b:true}", "{a:2, b:false}")
/*
| a |   b   |
+---+-------+
| 1 | true  |
| 2 | false |
*/

Let's compare dataframes

Now, let's imagine that we want to compare two dataframes. You should know that we can obtain two types of differences: differences with both schemas and differences with the values.

First, we will implement a solution without using Spark-Test. To do this, we will need to retrieve schemas of the two dataframes, retrieve values, and compare them to see if our dataframes are different or not. This is one of the many possible solutions.

//without using Spark-Test
def isEqual(df1: Dataframe, df2: Dataframe): Unit = {
  if (!df1.schema.equals(df2.schema))
    throw new Exception("schemas are different")
  if (!df1.collect().sameElements(df2.collect()))
    throw new Exception("values are different")
}

test("some test without Spark-Test") {
  val df1 = dataframe("{a:1}")
  val df2 = dataframe("{a:1, c:false}")

  isEqual(df1, df2)
  /*
   * java.lang.Exception:
   * Les schemas sont différents
   */
}

This solution has several problems:

The lack of information provided by our Exception: we are able to know if the equality problem comes from the schema or if it comes from values. But we are unable to know exactly what the problem is, how schemas are different to each other, or which values are different.
The lack of flexibility: you may only want to compare the common columns between the two dataframes. As an example, we're only looking at the column "a" in the dataframe above.

This may seem trivial, since our dataframes do not contain more than two columns and two rows. But imagine two dataframes with thousands of columns and millions of rows...

Now, let's solve this problem using Spark-Test.

test("some test with Spark-Test") {
  val df1 = dataframe("{a:1}")
  val df2 = dataframe("{a:1, c:false}")

  df1.assertEquals(df2)
  /*
   * io.univalence.sparktest.SparkTest$SchemaError: 
   * Field c was not in the original DataFrame.
   */
}

We obtain a nice SchemaError. And that's not all! We also have the reason of this difference and that is what Spark-Test is all about. The error here comes from the column "c' which is not present in the original dataframe.

But, but… I wanted to compare columns in common 😣

Don't panic! Don't panic! Simply use the Spark-Test configuration and specify that you want to ignore the extra columns located in df2.

test("some test with custom configuration") {
  val df1 = dataframe("{a:1}")
  val df2 = dataframe("{a:1, c:false}")

  withConfiguration(failOnMissingExpectedCol = false)({ df1.assertEquals(df2) })
  /*
   * Success
   */
}

Yeaaahh! The test goes off without a hitch. 🤗

Now, let's look at the case where two dataframes have the same schema but are still different. This is due to one or more value errors, i.e. dissonances within the columns themselves.

test("some test with same schema") {
  val df1 = dataframe("{a:1, b:true}",  "{a:2, b:true}", "{a:3, b:true}")
  val df2 = dataframe("{a:1, b:false}", "{a:2, b:true}", "{a:4, b:false}")

  df1.assertEquals(df2)
  /*
   * io.univalence.sparktest.SparkTest$ValueError: 
   * The data set content is different :
   *
   * in value at b, false was not equal to true
   * dataframe({b: true, a: 1})
   * dataframe({b: false, a: 1})
   *
   * in value at b, false was not equal to true
   * in value at a, 4 was not equal to 3
   * dataframe({b: true, a: 3})
   * dataframe({b: false, a: 4})
   */
}

Finally, here it is, the second error: ValueError. This error informs us about the differences between the two dataframes by specifying the error, as well as the lines where the said errors are found in eithers df1 or df2.

The precision given by errors, the flexibility, and ease of use, here are the strengths of Spark-Test. This library provides other features (such as a function to check a predicate in a whole Dataset) that can make your life as a data engineer easier. The cherry on top? Spark-Test is fully open source. It is available here:

https://github.com/univalence/spark-tools/tree/master/spark-test.

We welcome any feedback and of course, we'll be delighted to hear about your use cases! 😁