How to filter columns in Pandas
Understanding DataFrames in Pandas
Before we dive into the specifics of filtering columns in Pandas, let's understand what a DataFrame is, as it's the central structure we'll be working with. You can think of a DataFrame as a table, much like one you would find in a spreadsheet. It's composed of rows and columns, with each column having a name that you can use to access it.
Getting Started with Pandas
To start working with Pandas, you need to have it installed on your computer. If you haven't done so, you can install it using pip, which is a package manager for Python:
pip install pandas
Once installed, you can import Pandas in your Python script like this:
import pandas as pd
Here, pd
is a common alias used for Pandas, so you don't have to type pandas
every time you want to use a function from the library.
Creating a Sample DataFrame
To show you how to filter columns, we'll need a DataFrame to work with. Let's create a simple one:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [24, 27, 22, 32, 29],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}
df = pd.DataFrame(data)
print(df)
This code will give us a DataFrame that looks like this:
Name Age City
0 Alice 24 New York
1 Bob 27 Los Angeles
2 Charlie 22 Chicago
3 David 32 Houston
4 Eva 29 Phoenix
Basic Column Filtering
Filtering columns in Pandas is like asking a question: "Can you show me only this specific information from the table?" Let's say we want only the 'Name' and 'Age' columns from our DataFrame. Here's how you do it:
filtered_df = df[['Name', 'Age']]
print(filtered_df)
The output will be:
Name Age
0 Alice 24
1 Bob 27
2 Charlie 22
3 David 32
4 Eva 29
Notice how we used double square brackets [[ ]]
. This is because we are passing a list of column names to the DataFrame.
Using loc
and iloc
to Filter Columns
Pandas provides two powerful methods, loc
and iloc
, for more advanced filtering. The loc
method is used for label-based indexing, which means you use the column names to filter. On the other hand, iloc
is used for position-based indexing, which means you use the column's integer positions to filter.
Label-based Indexing with loc
If you want to select all rows but only specific columns by their names, you can use loc
like this:
filtered_df = df.loc[:, ['Name', 'City']]
print(filtered_df)
This will give you:
Name City
0 Alice New York
1 Bob Los Angeles
2 Charlie Chicago
3 David Houston
4 Eva Phoenix
The :
before the comma means "select all rows," and the list after the comma specifies the columns you want.
Position-based Indexing with iloc
Sometimes, you might not know the column names or you just prefer using their integer positions. Here's how you can do this with `