CSV Processing with Python and Pandas - Quick Examples

Below are examples you may have seen in a presentation and want to review at your own leisure.


Contents

  1. CSV files used
  2. Example: Import CSV -> Pandas. Print. Export to new CSV.
  3. Example: Filter out rows by last name
  4. Example: Fancy cell edits. Add, remove, & rename columns.
  5. Example: Merge 2 CSV files on a multi-column match
  6. Example: Filter rows based on aggregations (“keep oldest person per address”)
  7. Example: Add data based on aggregation (“flag oldest person per address”)
  8. Example: Pivot a transaction log into a “people and what they did” summary
  9. Example: Concatenate unique first+last names from every CSV in a folder
  10. Tips

First, the CSV files within the examples:

Sample CSV #1

Id First Last Email Company
5829 Jimmy Buffet jb@example.com RCA
2894 Shirley Chisholm sc@example.com United States Congress
294 Marilyn Monroe mm@example.com Fox
30829 Cesar Chavez cc@example.com United Farm Workers
827 Vandana Shiva vs@example.com Navdanya
9284 Andrea Smith as@example.com University of California
724 Albert Howard ah@example.com Imperial College of Science

Sample CSV #2

PersonId FirstName LastName Em FavoriteFood
983mv Shirley Temple st@example.com Lollipops
9e84f Andrea Smith as@example.com Kale
k28fo Donald Duck dd@example.com Pancakes
x934 Marilyn Monroe mm@example.com Carrots
8xi Albert Howard ahotherem@example.com Potatoes
02e Vandana Shiva vs@example.com Amaranth

Sample CSV #3

Id First Last D.O.B. Address
69435 Salli Broxup 12/3/1991 305 Grover Lane, Sunny, AK
67121 Quintina Lean 10/14/1963 305 Grover Lane, Sunny, AK
49617 Corny Noller 12/13/1990 305 Grover Lane, Sunny, AK
86605 Yuri Dalton 11/12/1980 800 Golden Leaf Street, Snowy, NM
22276 Doretta Herche 9/21/2010 800 Golden Leaf Street, Snowy, NM
64465 Mata Pierrepont 8/19/1970 800 Golden Leaf Street, Snowy, NM
32443 Othelia Eastbury 8/4/1955 87834 Lyons Terrace, Rainy, OR
22082 Pansy Mallya 8/4/1955 87834 Lyons Terrace, Rainy, OR
67526 Kata Windus 10/4/1991 98 Paget Trail, Cloudy, WY

Sample CSV #4

Id First Name Last Name Program Registered For
29 John Doe BasketWeaving
29 John Doe ScubaDiving
872 Jane Dill ScubaDiving
872 Jane Dill Acrobatics
872 Jane Dill ScubaDiving
75 Mick Jag ComputerProgramming

Second, a note on “°°°”


Example Code: Import CSV -> Pandas. Print. Export to new CSV.

Click here to run code like this.
(Note: first run takes a minute or so.)

import pandas
pandas.set_option('expand_frame_repr', False)
df = pandas.read_csv('c:\\yay\\sample1.csv')
print('---Here are all 7 lines---')
print(df)
print('---Here are the first 5 lines---')
print(df.head())
fivelinedf = df.head()
fivelinedf.to_csv('C:\\yay\\out_fiveline.csv', index=False, quoting=1)
Id First Last Email Company
5829 Jimmy Buffet jb@example.com RCA
2894 Shirley Chisholm sc@example.com United States Congress
294 Marilyn Monroe mm@example.com Fox
30829 Cesar Chavez cc@example.com United Farm Workers
827 Vandana Shiva vs@example.com Navdanya

Test yourself!

Click here and edit the code so that instead of saying “Here are the first 5 lines”, it says, “Here are the last 2 lines”, and edit the next line of code to do just that (display the last 2 lines).

(Note: first run takes a minute or so.)

Keep trying until your output looks like this:

---Here are all 7 lines---
      Id    First      Last           Email                      Company
0   5829    Jimmy    Buffet  jb@example.com                          RCA
1   2894  Shirley  Chisholm  sc@example.com       United States Congress
2    294  Marilyn    Monroe  mm@example.com                          Fox
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers
4    827  Vandana     Shiva  vs@example.com                     Navdanya
5   9284   Andrea     Smith  as@example.com     University of California
6    724   Albert    Howard  ah@example.com  Imperial College of Science
---Here are the last 2 lines---
     Id   First    Last           Email                      Company
5  9284  Andrea   Smith  as@example.com     University of California
6   724  Albert  Howard  ah@example.com  Imperial College of Science

Example Code: Filter out rows whose last names don’t start with a capital C or capital S

This example actually shows off 3 different Pandas operations:

  1. Displaying the contents of the CSV file’s column “Last”
  2. Displaying a single column indicating “True or False: Do the contents of ‘Last’ in this row start with a capital C or capital S?”
  3. Displaying a new table that only includes rows where the contents of “Last” actually started with a capital C or capital S.

Click here to run code like this.
(Note: first run takes a minute or so.)

import pandas
pandas.set_option('expand_frame_repr', False)
df = pandas.read_csv('c:\\yay\\sample1.csv')
print('---What is in "Last" for each row?---')
print(df['Last'])
print('---For each row, does "Last" start with capital "C" or "S"?---')
print(df['Last'].str.startswith('C') | df['Last'].str.startswith('S'))
print('---Show all columns, but only rows where "Last" starts with capital "C" or "S"---')
lastCSdf = df[df['Last'].str.startswith('C') | df['Last'].str.startswith('S')]
print(lastCSdf)
lastCSdf.to_csv('C:\\yay\\out_lastcs.csv', index=False, quoting=1)
Id First Last Email Company
2894 Shirley Chisholm sc@example.com United States Congress
30829 Cesar Chavez cc@example.com United Farm Workers
827 Vandana Shiva vs@example.com Navdanya
9284 Andrea Smith as@example.com University of California

Test yourself!

Click here and edit the code so that instead of saying ‘Show all columns, but only rows where “Last” starts with capital “C” or “S”’, it says, ‘Show all columns, but only rows where “Company” case-insensitively ends with “a” or “Id” is less than 800’, and edit the next line of code to do just that (display only rows where “Company” ends with “A” or “a” or the “Id” is a number less than 800).

(Note: first run takes a minute or so.)

Keep trying until your output looks like this:

---What is in "Last" for each row?---
0      Buffet
1    Chisholm
2      Monroe
3      Chavez
4       Shiva
5       Smith
6      Howard
Name: Last, dtype: object
---For each row, does "Last" start with capital "C" or "S"?---
0     True
1    False
2     True
3    False
4     True
5     True
6     True
dtype: bool
---Show all columns, but only rows where "Company" case-insensitively ends with "a" or "Id" is less than 800---
	 Id    First    Last           Email                      Company
0  5829    Jimmy  Buffet  jb@example.com                          RCA
2   294  Marilyn  Monroe  mm@example.com                          Fox
4   827  Vandana   Shiva  vs@example.com                     Navdanya
5  9284   Andrea   Smith  as@example.com     University of California
6   724   Albert  Howard  ah@example.com  Imperial College of Science

Example Code: Complex Cell Updates and Adding, Removing, and Renaming Columns

Click here to run code like this.
(Note: first run takes a minute or so.)

import pandas
pandas.set_option('expand_frame_repr', False)
df = pandas.read_csv('c:\\yay\\sample1.csv')
theseRowsLastNamesStartWithCapitalS = df['Last'].str.startswith('S')
theseRowsHaveA4InTheirId = df['Id'].astype(str).str.contains('4')
print('---Let\'s see what kind of output "df.loc[]" generates---')
print(df.loc[theseRowsLastNamesStartWithCapitalS,'Last'])
df.loc[theseRowsLastNamesStartWithCapitalS,'Last'] = 'aaa'
df.loc[theseRowsHaveA4InTheirId,'Email'] = 'bbb'
df.loc[theseRowsLastNamesStartWithCapitalS,'New1'] = 'ccc'
df.loc[theseRowsHaveA4InTheirId,'New2'] = 'ddd'
df['New3'] = 'eee'
df = df.drop(['Id','Company'], axis=1)
df = df.rename(columns = {'First':'First Name', 'Last':'Last Name', 'Email':'Email Address'})
print('---We have modified the Python variable "df" to have 3 new rows, plus changes in the "Last" and "Email" columns on specific rows only, and we dropped the "Id" and "Company" rows, and finally, we renamed the "First," "Last," and "Email" columns.---')
print(df)
df.to_csv('C:\\yay\\out_complexupdates.csv', index=False, quoting=1)

The output looks like this:

---Let's see what kind of output "df.loc[]" generates---
4    Shiva
5    Smith
Name: Last, dtype: object
---We have modified the Python variable "df" to have 3 new rows, plus changes in the "Last" and "Email" columns on specific rows only, and we dropped the "Id" and "Company" rows, and finally, we renamed the "First," "Last," and "Email" columns.---
  First Name Last Name   Email Address New1 New2 New3
0      Jimmy    Buffet  jb@example.com  NaN  NaN  eee
1    Shirley  Chisholm             bbb  NaN  ddd  eee
2    Marilyn    Monroe             bbb  NaN  ddd  eee
3      Cesar    Chavez  cc@example.com  NaN  NaN  eee
4    Vandana       aaa  vs@example.com  ccc  NaN  eee
5     Andrea       aaa             bbb  ccc  ddd  eee
6     Albert    Howard             bbb  NaN  ddd  eee
First Name Last Name Email Address New1 New2 New3
Jimmy Buffet jb@example.com eee
Shirley Chisholm bbb ddd eee
Marilyn Monroe bbb ddd eee
Cesar Chavez cc@example.com eee
Vandana aaa vs@example.com ccc eee
Andrea aaa bbb ccc ddd eee
Albert Howard bbb ddd eee

Test yourself!

Click here and make something cool.

(Note: first run takes a minute or so.)

Feel free to backspace over anything you want, except the first 3 lines, which import the CSV file:

import pandas
pandas.set_option('expand_frame_repr', False)
df = pandas.read_csv('sample1.csv')

There’s a lot of code in this example that you didn’t see in the other examples, so poke at them and see what you can get to run! Have fun and get creative.

Bored? Try throwing in some other common Python/Pandas operations once you’re feeling confident.


Example Code: Merging 2 CSV files w/ a multi-column match

Click here to run code like this.

(Note: first run takes a minute or so.)

import pandas
pandas.set_option('expand_frame_repr', False)
df1 = pandas.read_csv('C:\\yay\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\yay\\sample2.csv', dtype=object)
mergedf = df1.merge(df2.rename(columns = {'LastName':'Last', 'FirstName':'First', 'Em':'Email'}), how='outer', on=['Last', 'First'], suffixes=('_csv1', '_csv2'))
print('---Contents of DataFrame "mergedf":---')
print(mergedf)
mergedf.to_csv('C:\\yay\\out_outermerge.csv', index=False, quoting=1)
Id First Last Email_csv1 Company PersonId Email_csv2 FavoriteFood
5829 Jimmy Buffet jb@example.com RCA
2894 Shirley Chisholm sc@example.com United States Congress
294 Marilyn Monroe mm@example.com Fox x934 mm@example.com Carrots
30829 Cesar Chavez cc@example.com United Farm Workers
827 Vandana Shiva vs@example.com Navdanya 02e vs@example.com Amaranth
9284 Andrea Smith as@example.com University of California 9e84f as@example.com Kale
724 Albert Howard ah@example.com Imperial College of Science 8xi ahotherem@example.com Potatoes
Shirley Temple 983mv st@example.com Lollipops
Donald Duck k28fo dd@example.com Pancakes

Test yourself!

Click here and edit the code so that instead of so that instead of displaying everybody in both files, we only display the people from “CSV #2,” but we add in their details from “CSV #1” if they happen to appear in that CSV file, but only consider them a match if Last Name, First Name, AND EMAIL all match (meaning that we won’t pull in any “CSV #1” data about Albert Howard, since he has a different email address in that file).

(Note: first run takes a minute or so.)

Keep trying until your output looks like this – be sure you have Donald Duck and not Jimmy Buffet in your output!

---Contents of DataFrame "mergedf":---
     Id    First    Last                  Email                   Company PersonId FavoriteFood
0   294  Marilyn  Monroe         mm@example.com                       Fox     x934      Carrots
1   827  Vandana   Shiva         vs@example.com                  Navdanya      02e     Amaranth
2  9284   Andrea   Smith         as@example.com  University of California    9e84f         Kale
3   NaN  Shirley  Temple         st@example.com                       NaN    983mv    Lollipops
4   NaN   Donald    Duck         dd@example.com                       NaN    k28fo     Pancakes
5   NaN   Albert  Howard  ahotherem@example.com                       NaN      8xi     Potatoes

Example Code: Filter rows based on aggregations

Click here to run code like this.
(Note: first run takes a minute or so.)

import pandas
pandas.set_option('expand_frame_repr', False)
df = pandas.read_csv('C:\\yay\\sample3.csv', dtype=object, parse_dates=['D.O.B.'])
groupingByAddress = df.groupby('Address')
groupedDataFrame = groupingByAddress.apply(lambda x: x[x['D.O.B.'] == x['D.O.B.'].min()])
outputdf = groupedDataFrame.reset_index(drop=True)
print(outputdf)
outputdf.to_csv('C:\\yay\\out_oldest_person_per_address.csv', index=False, quoting=1)
Id First Last D.O.B. Address
69435 Salli Broxup 1991-12-03 305 Grover Lane, Sunny, AK
22276 Doretta Herche 2010-09-21 800 Golden Leaf Street, Snowy, NM
32443 Othelia Eastbury 1955-08-04 87834 Lyons Terrace, Rainy, OR
22082 Pansy Mallya 1955-08-04 87834 Lyons Terrace, Rainy, OR
67526 Kata Windus 1991-10-04 98 Paget Trail, Cloudy, WY

There seem to be twins at one of the addresses!

Enhance your skills!

This example … has a lot to unpack. So much that if you just start playing around and hitting “run,” you might not learn much.

I recommend reading this blog post for a full explanation of what’s going on in those 8 lines of code.

THEN click here to have some fun trying all sorts of fun with aggregations.

(Note: first run takes a minute or so.)

Feel free to backspace over anything you want, except the first 3 lines, which import the CSV file:

import pandas
pandas.set_option('expand_frame_repr', False)
df = pandas.read_csv('sample3.csv')

After reading the blog post, you might have noticed that some of the intermediate steps were practically an Excel PivotTable.

Yes! Good catch!

You can do things like PivotTables with Python and Pandas. This is one reason scientists and “data scientists” love Python and Pandas. They’re always summarizing data in one way or another.

Getting the output to export to CSV the way you envision it can be a bit tricky, to be honest, as a lot of the commands pertaining to PivotTable-style aggregation aren’t as intuitive as the other commands you’ve seen.

But isn’t it neat that there’s a relatively simple programming language powerful enough to let you hop back and forth between row-modification, filtering, pivot-table-style-aggregations, and then treating that PivotTable as if it were a “normal table?”

Such a process would be so much copying and pasting into new tabs in Excel.


Example Code: Add new data based on aggregation

Instead of eliminating rows that aren’t the oldest person at an address, you can merely add a new column noting who is. The code still starts out by saving data to a variable called “groupingByAddress,” but from there on out, it’s different.

Click here to run code like this.
(Note: first run takes a minute or so.)

import pandas
pandas.set_option('expand_frame_repr', False)
df = pandas.read_csv('C:\\yay\\sample3.csv', dtype=object, parse_dates=['D.O.B.'])
groupingByAddress = df.groupby('Address')
rowIsOldestPersonAtAddress = df['D.O.B.'] == groupingByAddress['D.O.B.'].transform('min')
df['IsOldestAtAddr'] = False
df.loc[rowIsOldestPersonAtAddress, 'IsOldestAtAddr'] = True
print(df)
df.to_csv('C:\\yay\\out_noted_if_is_oldest_per_address.csv', index=False, quoting=1)
Id First Last D.O.B. Address IsOldestAtAddr
69435 Salli Broxup 1991-12-03 305 Grover Lane, Sunny, AK True
67121 Quintina Lean 1963-10-14 305 Grover Lane, Sunny, AK False
49617 Corny Noller 1990-12-13 305 Grover Lane, Sunny, AK False
86605 Yuri Dalton 1980-11-12 800 Golden Leaf Street, Snowy, NM False
22276 Doretta Herche 2010-09-21 800 Golden Leaf Street, Snowy, NM True
64465 Mata Pierrepont 1970-08-19 800 Golden Leaf Street, Snowy, NM False
32443 Othelia Eastbury 1955-08-04 87834 Lyons Terrace, Rainy, OR True
22082 Pansy Mallya 1955-08-04 87834 Lyons Terrace, Rainy, OR True
67526 Kata Windus 1991-10-04 98 Paget Trail, Cloudy, WY True

Note that we add a column “IsOldestAtAddr” and set the whole thing to “False” before adding a conditional “True” to some of its cells with our command on the next line.

If we hadn’t included the first “False” line, our new “IsOldestAtAddr” would have still been created, but there would have been blank spaces instead of “False.”

In some cases, that may be exactly what you’d like to do.


Example Code: Pivot a course-registration log to a “people and what they registered for” summary

Click here to run code like this.
(Note: first run takes a minute or so.)

import pandas
import numpy
pandas.set_option('expand_frame_repr', False)
df = pandas.read_csv('C:\\yay\\sample4.csv')
df['Program Registered For'] = 'Program' + df['Program Registered For']
non_program_columns = list(filter(lambda x: x!= 'Program Registered For', df.keys()))
pivotdf = pandas.pivot_table(df, index=non_program_columns, columns='Program Registered For', aggfunc=numpy.size)
pivotdf[pandas.notnull(pivotdf)] = 'Registered'
pivotdf.reset_index(inplace=True)
print(pivotdf)
pivotdf.to_csv('C:\\yay\\out_pivoted_program_registrations.csv', index=False, quoting=1)
Id First Name Last Name ProgramAcrobatics ProgramBasketWeaving ProgramComputerProgramming ProgramScubaDiving
29 John Doe Registered Registered
872 Jane Dill Registered Registered
75 Mick Jag Registered Registered

(This could be useful for updating a simple “people” table, like a mailing list, with a summary of who already took which courses, so you don’t spam them asking them to take the course again.)


Example Code: Concatenate unique first+last names from every CSV in a folder, if the file has them

Click here to run code like this.
(Note: first run takes a minute or so.)

import pandas
pandas.set_option('expand_frame_repr', False)
inputfolder = 'C:\\yay\\folder\\'
listOfDFsToConcatenate = []
for file in [f for f in os.listdir(inputfolder) if f.endswith('.csv')]:
    df = pandas.read_csv(os.path.join(root, file))
    df = df.rename(columns = {'First Name':'First', 'FirstName':'First', 'Last Name':'Last', 'LastName':'Last'})
    if 'First' in df.columns and 'Last' in df.columns:
        df = df[['First','Last']]
        df['SourceFile'] = file
        listOfDFsToConcatenate.append(df)
concatdf = pandas.concat(listOfDFsToConcatenate, ignore_index=True)
concatdf = concatdf.drop_duplicates(subset=['First','Last'])
print(concatdf)
concatdf.to_csv('C:\\yay\\out_concatenated_unique_names.csv', index=False, quoting=1)
First Last SourceFile
Jimmy Buffet sample1.csv
Shirley Chisholm sample1.csv
Marilyn Monroe sample1.csv
Cesar Chavez sample1.csv
Vandana Shiva sample1.csv
Andrea Smith sample1.csv
Albert Howard sample1.csv
Shirley Temple sample2.csv
Donald Duck sample2.csv
Salli Broxup sample3.csv
Quintina Lean sample3.csv
Corny Noller sample3.csv
Yuri Dalton sample3.csv
Doretta Herche sample3.csv
Mata Pierrepont sample3.csv
Othelia Eastbury sample3.csv
Pansy Mallya sample3.csv
Kata Windus sample3.csv
John Doe sample4.csv
Jane Dill sample4.csv
Mick Jag sample4.csv

In this example, presume our 4 sample CSV files are all stored at C:\yay\folder\.

We loop through every file ending in “.csv” in it.

If the file has a “First” and a “Last” column, it’s a candidate for concatenation.

If the CSV file is a candidate for concatenation, we strip it down to just its “First” and “Last” columns, then add a third “SourceFile” column.

Then, after we’ve set aside all such CSV files into a Python “list” of “Pandas DataFrames,” we concatenate them all.

ignore_index=True indicates that we want to continue our row-numbering system, not start over at 0, when the concatenated file starts from the next CSV file).
Not that it really matters, since when we export to CSV, we’ll be suppressing the row numbers anyway.

Finally, in this case, we decide that we only want to keep the first time we encountered a given First+Last name combination, so we “drop the duplicates.”


Tips for learning more

IMPORTANT: Never, ever, ever use your company’s actual data in an online code editor. You have no idea who’s collecting what on the other side. Always install a proper “IDE” on your hard drive before playing with sensitive data in Python.

Once you practice Python & Pandas enough to understand how the “output values” of “expressions” impact the way you can write code, and to have a sense for how easy it is to daisy-chain little CSV-file transformations into bigger ones, and once you save enough sample files of your “practice” work to have a personal quick-reference (or bookmark this site), you will be well on your way to knowing how to write Python+Pandas programs that actually save you more time than opening up Excel and doing the job by hand.

In my opinion, you’ll get a lot of “bang for your buck” out of Python:

Don’t let your lack of ability to do an entire CSV-handling transformation in Python stop you. I often heed Randall Munroe’s advice from XKCD cartoon #1319 and do things in Excel until they become annoying, do the annoying part in Python, then switch back to Excel.

Believe in yourself, and how much saved-time-in-Excel you’re worth!