Найдите повторяющиеся строки в Dataframe на основе всех или выбранных столбцов
In this article, we will be discussing about how to find duplicate rows in a Dataframe based on all or a list of columns. For this we will use Dataframe.duplicated() method of Pandas.
Syntax : DataFrame.duplicated(subset = None, keep = ‘first’)
Parameters:
subset: This Takes a column or list of column label. It’s default value is None. After passing columns, it will consider them only for duplicates.keep: This Controls how to consider duplicate value. It has only three distinct value and default is ‘first’.
- If ‘first’, This considers first value as unique and rest of the same values as duplicate.
- If ‘last’, This considers last value as unique and rest of the same values as duplicate.
- If ‘False’, This considers all of the same values as duplicates.
Returns: Boolean Series denoting duplicate rows.
Let’s create a simple dataframe with a dictionary of lists, say column names are: ‘Name’, ‘Age’ and ‘City’.
# Import pandas libraryimport pandas as pd # List of Tuplesemployees = [("Stuti", 28, "Varanasi"), ("Saumya", 32, "Delhi"), ("Aaditya", 25, "Mumbai"), ("Saumya", 32, "Delhi"), ("Saumya", 32, "Delhi"), ("Saumya", 32, "Mumbai"), ("Aaditya", 40, "Dehradun"), ("Seema", 32, "Delhi") ] # Creating a DataFrame objectdf = pd.DataFrame(employees, columns = ["Name", "Age", "City"]) # Print the Dataframedf |
Выход : 
Example 1 : Select duplicate rows based on all columns.
Here, We do not pass any argument therefore it takes default values for both the arguments i.e. subset = None and keep = ‘first’.
# Import pandas libraryimport pandas as pd # List of Tuplesemployees = [("Stuti", 28, "Varanasi"), ("Saumya", 32, "Delhi"), ("Aaditya", 25, "Mumbai"), ("Saumya", 32, "Delhi"), ("Saumya", 32, "Delhi"), ("Saumya", 32, "Mumbai"), ("Aaditya", 40, "Dehradun"), ("Seema", 32, "Delhi") ] # Creating a DataFrame objectdf = pd.DataFrame(employees, columns = ["Name", "Age", "City"]) # Selecting duplicate rows except first # occurrence based on all columnsduplicate = df[df.duplicated()] print("Duplicate Rows :") # Print the resultant Dataframeduplicate |
Выход :
Example 2 : Select duplicate rows based on all columns.
If you want to consider all duplicates except the last one then pass keep = ‘last’ as an argument.
# Import pandas libraryimport pandas as pd # List of Tuplesemployees = [("Stuti", 28, "Varanasi"), ("Saumya", 32, "Delhi"), ("Aaditya", 25, "Mumbai"), ("Saumya", 32, "Delhi"), ("Saumya", 32, "Delhi"), ("Saumya", 32, "Mumbai"), ("Aaditya", 40, "Dehradun"), ("Seema", 32, "Delhi") ] # Creating a DataFrame objectdf = pd.DataFrame(employees, columns = ["Name", "Age", "City"]) # Selecting duplicate rows except last # occurrence based on all columns.duplicate = df[df.duplicated(keep = "last")] print("Duplicate Rows :") # Print the resultant Dataframeduplicate |
Выход :
Example 3 : If you want to select duplicate rows based only on some selected columns then pass the list of column names in subset as an argument.
# import pandas libraryimport pandas as pd # List of Tuplesemployees = [("Stuti", 28, "Varanasi"), ("Saumya", 32, "Delhi"), ("Aaditya", 25, "Mumbai"), ("Saumya", 32, "Delhi"), ("Saumya", 32, "Delhi"), ("Saumya", 32, "Mumbai"), ("Aaditya", 40, "Dehradun"), ("Seema", 32, "Delhi") ] # Creating a DataFrame objectdf = pd.DataFrame(employees, columns = ["Name", "Age", "City"]) # Selecting duplicate rows based# on "City" columnduplicate = df[df.duplicated("City")] print("Duplicate Rows based on City :") # Print the resultant Dataframeduplicate |
Выход :
Example 4 : Select duplicate rows based on more than one column names.
# import pandas libraryimport pandas as pd # List of Tuplesemployees = [("Stuti", 28, "Varanasi"), ("Saumya", 32, "Delhi"), ("Aaditya", 25, "Mumbai"), ("Saumya", 32, "Delhi"), ("Saumya", 32, "Delhi"), ("Saumya", 32, "Mumbai"), ("Aaditya", 40, "Dehradun"), ("Seema", 32, "Delhi") ] # Creating a DataFrame object df = pd.DataFrame(employees, columns = ["Name", "Age", "City"]) # Selecting duplicate rows based# on list of column namesduplicate = df[df.duplicated(["Name", "Age"])] print("Duplicate Rows based on Name and Age :") # Print the resultant Dataframeduplicate |
Выход :
Внимание компьютерщик! Укрепите свои основы с помощью базового курса программирования Python и изучите основы.
Для начала подготовьтесь к собеседованию. Расширьте свои концепции структур данных с помощью курса Python DS. А чтобы начать свое путешествие по машинному обучению, присоединяйтесь к курсу Машинное обучение - базовый уровень.