How to Remove Duplicate Rows in a Spark Data Frame

Apache Spark Data Frame API allows you to read data from various sources and creates a Spark Data Frame out of the source data. However, you may have duplicate rows in your Spark Data Frame. Duplicate rows may show up in your Spark Data Frames for various reasons. Your ETL tool that moves data from one place to another place may create duplicate records. However, Spark Data Frame doesn’t prevent you or throw any error for having duplicate records. However, you may not want duplicate rows in your Spark Data Frame. If you have duplicates in your Spark Data Frame, you may want to ask the following questions.

  1. How to identify if you have duplicate rows in your Spark Data Frame
  2. How to identify and list duplicate rows in your Spark Data Frame
  3. How to remove duplicate rows from your Spark Data Frame

Here is an example PySpark code that demonstrates answers to the above questions. An explanation is given below the code.

 
from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession \
                .builder \
                .appName("DuplicateDemo") \
                .master("local[3]") \
                .getOrCreate()

    data_list = [(101, "Mumbai", "Goa"),
                 (102, "Mumbai", "Bangalore"),
                 (102, "Mumbai", "Bangalore"),
                 (103, "Delhi", "Chennai"),
                 (104, "Bangalore", "Kolkata")]

    df = spark.createDataFrame(data_list) \
              .toDF("id", "source", "destination")

    # How to check if you have duplicates
    dups = df.count() - df.distinct().count()
    if dups > 0:
        print("You have {} duplicates".format(dups))

    # How to find the duplicate row
    df.groupBy(df.columns) \
      .count() \
      .filter('count > 1') \
      .show()

    # How to remove duplicates
    df.dropDuplicates() \
      .show()
 


Creating a Test Data Frame

While working with Apache Spark, you will read data from a source file or a source system. However, for testing the ideas or learning concepts, you may want to create a small test Data Frame. Apache Spark allows you to create a Data Frame from a Python List. The code snippet above from Line 10 to Line 14 defines a Python List. Line 16 and 17 define a Spark Data Frame using the Python List.

How to identify if you have duplicate rows in your Apache Spark Data Frame

Checking the existence of duplicate rows in a Apache Spark Data Frame is an easy thing. All you need to do is count the Data Frame and take a distinct count of the same Data Frame. If your count is greater than the distinct count, you have some duplicates. The difference between these two counts is the number of duplicate records. Code snippet from Line 20 to Line 22 shows how to do it.

How to identify and list duplicate rows in your Apache Spark Data Frame

Once you know you have duplicates, you may want to list those duplicate records. In an actual application, you may want to identify duplicate rows and log them or store them somewhere so you can investigate the root cause of duplication or take appropriate actions at a later stage. Identifying duplicate rows is an easy thing to do. You can do it using a three-step process.

  1. Group by your Data Frame for all columns – This is done at line 25.
  2. Calculate count on your grouped data Frame – This is done at line 26
  3. Apply filter and take only those records having a count greater than 1 – These are your duplicate records, and this is done at line 26

The resultant Data Frame contains all the duplicate records. You can log these records or save them in a data file.

How to remove duplicate rows from your Spark Data Frame

Removing duplicate rows is the easiest part of the process. You can simply use the distinct() method on your Data Frame, and the resultant Data Frame will have no duplicates. However, Spark Data Frame API offers you a more flexible method to remove duplicate rows from a Spark Data Frame. The code examples use the dropDuplicates() method at line 31 for removing duplicates. The dropDuplicate() method is more flexible because it can take a list of columns as input to remove duplicates. I will cover the difference between distinct() and dropDuplicates() with an appropriate example in a separate article.

Conclusion

Working with duplicates in a Apache Spark Data Frame is one of the most common data engineering problems, and it is one of the most common questions asked in the interviews, and a problem is given in coding assessments. However, dealing with duplicates is super easy, as explained in this article. However, this article covers one of the most straightforward scenarios of record duplication. I will cover some more scenarios in other articles.

For more details visit www.scholarnest.com

 

Leave A Comment