How to Join Two Dataframes in Pandas: A Practical Guide

Share me please

Pandas is a popular Python library for data analysis and manipulation. One of the most common tasks that you may encounter when working with pandas is joining two dataframes. Joining dataframes allows you to combine data from different sources and perform analysis on the merged data.

In this article, you will learn how to join two dataframes in pandas using different methods, such as merge, concat, join, and append. You will also see some examples of code that demonstrate how to use these methods in practice. By the end of this article, you will be able to join two dataframes in pandas with ease and confidence.

What is a dataframe?

A dataframe is a two-dimensional data structure that consists of rows and columns. Each column represents a variable or a feature, and each row represents an observation or a record. A dataframe can store different types of data, such as numbers, strings, booleans, or dates.

You can create a dataframe in pandas by using the pd.DataFrame() function. For example, the following code creates a dataframe with four columns: name, age, gender, and country.

import pandas as pd
 
df1 = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'gender': ['F', 'M', 'M', 'M'],
    'country': ['USA', 'UK', 'Canada', 'Australia']
})
Pandas dataframe with list of people and name, age, gender and country information.

You can also create a dataframe from other sources, such as CSV files, Excel files, SQL databases, or web pages. For more information on how to create dataframes in pandas, you can refer to the official documentation.

Why join two dataframes?

There are many reasons why you may want to join two dataframes in pandas. Some of the common scenarios are:

  • You have data from different sources that you want to combine and analyze together. For example, you may have sales data from one dataframe and customer data from another dataframe, and you want to see how the sales vary by customer attributes.
  • You have data that is split into multiple dataframes and you want to consolidate them into one dataframe. For example, you may have monthly data that is stored in separate dataframes, and you want to create a yearly dataframe that contains all the data.
  • You have data that is missing some values and you want to fill them with data from another dataframe. For example, you may have a dataframe that has some missing values in the country column, and you want to use another dataframe that has the country information for each name.

How to join two dataframes in pandas?

There are different methods that you can use to join two dataframes in pandas, depending on how you want to join them. The main methods are:

  • merge(): This method allows you to join two dataframes based on one or more common columns or indexes. You can specify different types of joins, such as inner, outer, left, or right, depending on which rows you want to keep or discard from the dataframes. You can also specify how to handle duplicate values or missing values in the dataframes.
  • concat(): This method allows you to join two dataframes by appending them either vertically or horizontally. You can specify whether to keep or ignore the indexes of the dataframes, and how to handle the alignment of the columns or rows. You can also specify how to handle missing values or duplicate values in the dataframes.
  • join(): This method allows you to join two dataframes based on their indexes. You can specify different types of joins, such as inner, outer, left, or right, depending on which rows you want to keep or discard from the dataframes. You can also specify how to handle missing values or duplicate values in the dataframes.

append(): This method allows you to join two dataframes by appending them vertically. It is a shortcut for using the concat() method with axis=0. You can specify whether to keep or ignore the indexes of the dataframes, and how to handle missing values or duplicate values in the dataframes

Examples of joining two dataframes in pandas

To illustrate how to use these methods, let us create another dataframe with four columns: name, salary, department, and city.

df2 = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Eve', 'Frank'],
    'salary': [5000, 6000, 7000, 8000],
    'department': ['IT', 'HR', 'Sales', 'Marketing'],
    'city': ['New York', 'London', 'Paris', 'Berlin']
})
 
df2
Pandas dataframe example list of people's names and salaries, departments, cities.

Using merge()

Let us use the merge() method to join the two dataframes based on the name column. We can specify the type of join as inner, which means that we only keep the rows that have matching values in the name column in both dataframes.

df3 = pd.merge(df1, df2, on='name', how='inner')
 
df3
Joined two dataframes in pandas via column name using merge() function

We can see that the resulting dataframe has only two rows, corresponding to Alice and Bob, who are present in both dataframes. The other columns from both dataframes are also merged based on the name column.

We can also specify the type of join as outer, which means that we keep all the rows from both dataframes, and fill the missing values with NaN.

df4 = pd.merge(df1, df2, on='name', how='outer')
 
df4
merged two dataframes with outer join

We can see that the resulting dataframe has six rows, corresponding to all the names from both dataframes. The missing values are filled with NaN.

We can also specify the type of join as left, which means that we keep all the rows from the left dataframe (df1), and fill the missing values with NaN.

df5 = pd.merge(df1, df2, on='name', how='left')
 
df5
join two dataframes in python via merge left join

We can see that the resulting dataframe has four rows, corresponding to all the names from the left dataframe (df1). The missing values are filled with NaN.

We can also specify the type of join as right, which means that we keep all the rows from the right dataframe (df2), and fill the missing values with NaN.

df6 = pd.merge(df1, df2, on='name', how='right')
 
df6
join two dataframes in python via merge right join

We can see that the resulting dataframe has four rows, corresponding to all the names from the right dataframe (df2). The missing values are filled with NaN.

You can also use the merge() method to join two dataframes based on multiple columns or indexes. For example, if you have another dataframe that has the name and city columns, you can use the on parameter to specify a list of columns to join on.

df7 = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'city': ['New York', 'London', 'Toronto', 'Sydney']
})
 
df7
one more dataframe with name and city

To join this dataframe with df1, you can use the following code:

df8 = pd.merge(df6, df7, on=['name', 'city'], how='inner')
 
df8
join two dataframes in pandas on two columns with merge inner

We can see that the resulting dataframe has only two rows, corresponding to Alice and Bob, who have matching values in both the name and city columns in both dataframes.

This is how you can use the merge() method to join two dataframes in pandas based on one or more common columns or indexes. In the next section, we will see how to use the concat() method to join two dataframes by appending them either vertically or horizontally.

Using concat()

Let us use the concat() method to join the two dataframes by appending them either vertically or horizontally. We can specify the axis parameter to indicate whether we want to append the dataframes along the rows (axis=0) or the columns (axis=1).

For example, if we want to append the two dataframes along the rows, we can use the following code:

df9 = pd.concat([df1, df2], axis=0)
 
df9
joining two dataframes in pandas via concat through rows axis (append)

We can see that the resulting dataframe has eight rows, corresponding to all the rows from both dataframes. The columns that are not present in one dataframe are filled with NaN.

We can also specify the ignore_index parameter to indicate whether we want to keep or ignore the indexes of the dataframes. By default, it is set to False, which means that the indexes of the dataframes are preserved. If we set it to True, the resulting dataframe will have a new index that ranges from 0 to the number of rows.

df10 = pd.concat([df1, df2], axis=0, ignore_index=True)
 
df10
joining dataframes via concat with append mode and index reset

We can see that the resulting dataframe has a new index that starts from 0 and ends at 7.

If we want to append the two dataframes along the columns, we can use the following code:

df11 = pd.concat([df1, df2], axis=1)
 
df11
concat two dataframes in pandas - appending via columns axis=1

We can see that the resulting dataframe has four columns, corresponding to all the columns from both dataframes. The rows are aligned based on the indexes of the dataframes.

We can also specify the join parameter to indicate how to handle the alignment of the rows. By default, it is set to outer, which means that all the rows from both dataframes are included, and the missing values are filled with NaN. If we set it to inner, only the rows that have common indexes in both dataframes are included.

df12 = pd.concat([df1, df2], axis=1, join='inner')
 
df12
concat two dataframes in pandas with inner join via columns axis

We can see that the resulting dataframe has only two rows, corresponding to the rows that have common indexes in both dataframes.

This is how you can use the concat() method to join two dataframes in pandas by appending them either vertically or horizontally. In the next section, we will see how to use the join() method to join two dataframes based on their indexes.

Using join()

Let us use the join() method to join the two dataframes based on their indexes. This method is similar to the merge() method, but it only uses the indexes of the dataframes to join them. You can specify different types of joins, such as inner, outer, left, or right, depending on which rows you want to keep or discard from the dataframes.

Before we go to the examples it is important to set index on column for dataframes will be used for joining:

df11 = df1.set_index(['name'])
df21 = df2.set_index(['name'])

For example, if we want to join the two dataframes based on their indexes using an inner join, we can use the following code:

df13 = df11.join(df21, how='inner')
 
df13
inner join of two dataframes in pandas via join function with inner params

We can see that the resulting dataframe has only two rows, corresponding to the rows that have common indexes in both dataframes. The columns from both dataframes are also joined based on the indexes.

We can also specify the type of join as outer, which means that we keep all the rows from both dataframes, and fill the missing values with NaN.

df14 = df11.join(df21, how='outer')
 
df14

We can see that the resulting dataframe has four rows, corresponding to all the rows from both dataframes. The missing values are filled with NaN.

We can also specify the type of join as left, which means that we keep all the rows from the left dataframe (df1), and fill the missing values with NaN.

df15 = df11.join(df21, how='left')
 
df15
join two dataframes in pandas via join function with join left param

We can see that the resulting dataframe has four rows, corresponding to all the rows from the left dataframe (df1). The missing values are filled with NaN.

We can also specify the type of join as right, which means that we keep all the rows from the right dataframe (df2), and fill the missing values with NaN.

df16 = df11.join(df21, how='right')
 
df16
join two dataframes in pandas via join function with right join

We can see that the resulting dataframe has four rows, corresponding to all the rows from the right dataframe (df2). The missing values are filled with NaN.

This is how you can use the join() method to join two dataframes in pandas based on their indexes. In the next section, we will see how to use the append() method to join two dataframes by appending them vertically.

Using append()

Let us use the append() method to join two dataframes by appending them vertically. This method is a shortcut for using the concat() method with axis=0. You can specify whether to keep or ignore the indexes of the dataframes, and how to handle missing values or duplicate values in the dataframes.

For example, if we want to append the two dataframes vertically and keep the indexes of the dataframes, we can use the following code:

df17 = df1.append(df2)
 
df17

We can see that the resulting dataframe has eight rows, corresponding to all the rows from both dataframes. The columns that are not present in one dataframe are filled with NaN. The indexes of the dataframes are preserved.

We can also specify the ignore_index parameter to indicate whether we want to keep or ignore the indexes of the dataframes. If we set it to True, the resulting dataframe will have a new index that ranges from 0 to the number of rows.

df18 = df1.append(df2, ignore_index=True)
 
df18

We can see that the resulting dataframe has a new index that starts from 0 and ends at 7.

This is how you can use the append() method to join two dataframes in pandas by appending them vertically. This method is useful when you want to quickly combine two dataframes without worrying about the alignment of the columns or rows.

Conclusion

In this article, you learned how to join two dataframes in pandas using different methods, such as merge, concat, join, and append. You also saw some examples of code that demonstrate how to use these methods in practice. By using these methods, you can easily combine data from different sources and perform analysis on the merged data.

Here are some key points to remember:

  • The merge() method allows you to join two dataframes based on one or more common columns or indexes. You can specify different types of joins, such as inner, outer, left, or right, depending on which rows you want to keep or discard from the dataframes.
  • The concat() method allows you to join two dataframes by appending them either vertically or horizontally. You can specify whether to keep or ignore the indexes of the dataframes, and how to handle the alignment of the columns or rows.
  • The join() method allows you to join two dataframes based on their indexes. You can specify different types of joins, such as inner, outer, left, or right, depending on which rows you want

Leave a Reply