How to create new column in Pandas
Understanding DataFrames in Pandas
Before we dive into creating new columns, it's essential to understand the basic structure we're working with. In Pandas, a DataFrame can be thought of as a table, much like one you might create in a spreadsheet program like Microsoft Excel. This table is composed of rows and columns, with each column having a name that describes the data it contains.
Adding a New Column with a Default Value
Let's start with the simplest case: adding a new column where every row gets the same value. This is like handing out the same textbook to every student in a class.
import pandas as pd
# Create a simple DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
})
# Add a new column with a default value
df['Country'] = 'Unknown'
print(df)
This code will output a DataFrame with a new column named 'Country', and each row will have the value 'Unknown'.
Creating a Column Based on Operations with Existing Columns
Imagine you have a fruit basket with apples and oranges, and you want to know the total number of fruits. You simply add the number of apples to the number of oranges. Similarly, you can create a new column in a DataFrame by performing operations on existing columns.
# Assume 'df' is our existing DataFrame
# Add a new column that is the sum of two existing columns
df['Total Fruit'] = df['Apples'] + df['Oranges']
print(df)
This will add a new column 'Total Fruit' where each row contains the sum of the numbers in the 'Apples' and 'Oranges' columns for that row.
Using the assign
Method to Create Columns
Pandas also provides a method named assign
that allows you to create new columns in a more functional style. Think of it as adding a new room to your house using a special construction kit that does it all in one go.
# 'df' is our DataFrame
# Using `assign` to add a new column
df = df.assign(Income_Per_Age = df['Income'] / df['Age'])
print(df)
The assign
method returns a new DataFrame with the new column added. Here, 'Income_Per_Age' is calculated by dividing 'Income' by 'Age' for each row.
Conditional Column Creation with np.where
Sometimes you need to make decisions based on certain conditions, like turning on the lights only when it's dark. In Pandas, you can use np.where
from NumPy to create a column with values based on a condition.
import numpy as np
# 'df' is our DataFrame
# Create a new column with conditions
df['Age Group'] = np.where(df['Age'] < 30, 'Young', 'Old')
print(df)
This will create a new column 'Age Group' where the value is 'Young' if the 'Age' is less than 30 and 'Old' otherwise.
Applying Functions to Create Columns
Sometimes, the operation you need to perform is more complex, like preparing a gourmet dish instead of just making a sandwich. For these cases, you can apply a function to the rows of the DataFrame to create a new column.
# 'df' is our DataFrame
# Define a function that will determine the category
def determine_category(age):
if age < 20:
return 'Teenager'
elif age < 60:
return 'Adult'
else:
return 'Senior'
# Apply the function to the 'Age' column to create a new 'Category' column
df['Category'] = df['Age'].apply(determine_category)
print(df)
The apply
method runs the determine_category
function for each value in the 'Age' column and creates a new 'Category' column with the resulting values.
Concatenating Columns to Create a New Column
Sometimes you might want to combine two pieces of information, like a first name and a last name to get a full name. In Pandas, you can concatenate columns to create a new one.
# 'df' is our DataFrame
# Concatenate two columns to create a new one
df['Full Name'] = df['First Name'] + ' ' + df['Last Name']
print(df)
This will create a new column 'Full Name' by adding together the values in 'First Name' and 'Last Name' for each row, separated by a space.
Using apply
with a Lambda Function
If you need a quick, one-off function to create a column, you don't always have to define it separately. You can use a lambda function, which is like a disposable camera that you use once and then throw away.
# 'df' is our DataFrame
# Create a new column using a lambda function
df['Is Minor'] = df['Age'].apply(lambda x: 'Yes' if x < 18 else 'No')
print(df)
The lambda function here checks if the 'Age' is less than 18 and assigns 'Yes' or 'No' to the new 'Is Minor' column accordingly.
Conclusion
Creating new columns in a Pandas DataFrame is a fundamental skill, akin to learning how to add new ingredients to a recipe to enhance the flavor. Whether you're assigning a default value, performing operations between columns, using conditional logic, or applying functions, each method offers a unique way to enrich your data. Just like a skilled chef knows which spices will perfect their dish, a proficient data analyst knows how to manipulate and extend their data to draw out the most insightful flavors. With practice, you'll find that adding columns becomes second nature, allowing you to deftly prepare your data for whatever analysis you have in mind. Keep experimenting with these techniques, and you'll soon be serving up data masterpieces with the confidence of a gourmet data chef.