Манипулирование данными в Python с использованием Pandas

Опубликовано: 27 Марта, 2022

В машинном обучении модели требуется набор данных для работы, то есть для обучения и тестирования. Но данные не полностью подготовлены и готовы к использованию. Во многих строках и столбцах есть несоответствия, например, значения «Nan» / «Null» / «NA». Иногда набор данных также содержит некоторые строки и столбцы, которые даже не требуются для работы нашей модели. В таких условиях требуется надлежащая очистка и изменение набора данных, чтобы сделать его эффективным входом для нашей модели. Мы достигаем этого, практикуя « Преобразование данных » перед вводом данных в модель.

Хорошо, давайте погрузимся в часть программирования. Наша первая цель - создать фрейм данных Pandas на Python, поскольку вы, возможно, знаете, что pandas - одна из наиболее часто используемых библиотек Python.

Example:

# importing the pandas library
import pandas as pd
  
  
# creating a dataframe object
student_register = pd.DataFrame()
  
# assigning values to the 
# rows and columns of the
# dataframe
student_register["Name"] = ["Abhijit"
                            "Smriti",
                            "Akash",
                            "Roshni"]
  
student_register["Age"] = [20, 19, 20, 14]
student_register["Student"] = [False, True,
                               True, False]
  
student_register

Output:

As you can see, the dataframe object has four rows [0, 1, 2, 3] and three columns[“Name”, “Age”, “Student”] respectively. The column which contains the index values i.e. [0, 1, 2, 3] is known as the index column, which is a default part in pandas datagram. We can change that as per our requirement too because Python is powerful.
Next, for some reason we want to add a new student in the datagram, i.e you want to add a new row to your existing data frame, that can be achieved by the following code snippet.

One important concept is that the “dataframe” object of Python, consists of rows which are “series” objects instead, stack together to form a table. Hence adding a new row means creating a new series object and appending it to the dataframe.

Example:

# creating a new pandas
# series object
new_person = pd.Series(["Mansi", 19, True], 
                       index = ["Name", "Age"
                                "Student"])
  
# using the .append() function
# to add that row to the dataframe
student_register.append(new_person, ignore_index = True)

Output:

Before processing and wrangling any data you need to get the total overview of it, which includes statistical conclusions like standard deviation(std), mean and it’s quartile distributions. Also, you need to know the exact information of each column, i.e. what type of value it stores and how many of them are unique. There are three support functions, .shape, .info() and .describe(), which outputs the shape of the table, information on rows and columns, and statistical information of the dataframe (numerical column only) respectively.

Example:

# for showing the dimension 
# of the dataframe
print("Shape")
print(student_register.shape)
  
# showing info about the data 
print(" Info ")
student_register.info()
  
# for showing the statistical 
# info of the dataframe
print(" Describe")
student_register.describe()

Выход:

In the above example, the .shape function gives an output (4, 3) as that is the size of the created dataframe.

The description of the output given by .info() method is as follows:

  1. “RangeIndex” describes about the index column, i.e. [0, 1, 2, 3] in our datagram. Which is the number of rows in our dataframe.
  2. As the name suggests “Data columns” give the total number of columns as output.
  3. “Name”, “Age”, “Student” are the name of the columns in our data, “non-null ” tells us that in the corresponding column, there is no NA/ Nan/ None value exists. “object”, “int64″ and “bool” are the datatypes each column have.
  4. “dtype” gives you an overview of how many data types present in the datagram, which in term simplifies the data cleaning process.
    Also, in high-end machine learning models, “memory usage” is an important term, we can’t neglect that.

The description of the output given by .describe() method is as follows:

  1. count is the number of rows in the dataframe.
  2. mean is the mean value of all the entries in the “Age” column.
  3. std is the standard deviation of the corresponding column.
  4. min and max are the minimum and maximum entry in the column respectively.
  5. 25%, 50% and 75% are the First Quartiles, Second Quartile(Median) and Third Quartile respectively, which gives us important info on the distribution of the dataset and makes it simpler to apply an ML model.

Внимание компьютерщик! Укрепите свои основы с помощью базового курса программирования Python и изучите основы.

Для начала подготовьтесь к собеседованию. Расширьте свои концепции структур данных с помощью курса Python DS. А чтобы начать свое путешествие по машинному обучению, присоединяйтесь к курсу Машинное обучение - базовый уровень.