Instructions & Code Snippets You Will Need
Table of Contents
- Exercise 1: Make sure you can run Python code
- Exercise 2: Reading, displaying, & analyzing a CSV file
- Exercise 3: Add/Delete/Rename/Reorder Columns Yourself
- Exercise 4: Combine 5 spreadsheets into 1 CampaignMemberRecordsToInsert spreadsheet
- Exercise 5: Make “ContactsToInsert” and “CampaignMembersToInsert” for those not in Salesforce
- Exercise 6: Add a new “Note” column to our concatenated events roster
- Door Prize Script: A little pivot
- DON’T PEEK! Possible answers to exercises 3-6
Exercise 1: Make sure you can run Python code
Open https://link.stthomas.edu/sfpy201901-hello “starter code.”
All it says is this:
print('Hello, world!')
Hit the big green “run” button, top center.
Do you see “Hello, world!
” at right? Hooray! (Let me know if not.)
Now change Hello, world!
to Yay, us!
so that your code looks like this and hit the “run” button again.
Do you see “Yay, us!
” at right? Yay, you! (Let me know if not. We want to make this a “Yay, us!” moment.)
Exercise 2: Reading, displaying, & analyzing a CSV file
Open the https://link.stthomas.edu/sfpy201901-files “starter code.”
Add a new line to the end of the file and type these 4 lines exactly as seen here (hitting “enter” to start a new line as indicated):
df1 = pandas.read_csv(filepath1)
print(df1)
print('--------')
print('There are ' + str(len(df1)) + ' rows')
Hit the “run” button.
Do you see a table full of contacts at right, then a divider line, then an announcement that there are “7 rows”?
Code To Look At: Python 101 Recap
We’ll look together at the code at https://link.stthomas.edu/sfpy201901-101recap
import pandas
pandas.set_option('expand_frame_repr', False)
filepath1 = 'https://raw.githubusercontent.com/pypancsv/pypancsv/master/docs/_data/sample1.csv'
filepath2 = 'https://raw.githubusercontent.com/pypancsv/pypancsv/master/docs/_data/sample2.csv'
filepath3 = 'https://raw.githubusercontent.com/pypancsv/pypancsv/master/docs/_data/sample3.csv'
filepath4 = 'https://raw.githubusercontent.com/pypancsv/pypancsv/master/docs/_data/tallypivotinput.csv'
df4 = pandas.read_csv(filepath4)
print('---This is what a "column" series looks like:---')
print(df4['Program Registered For'])
print('-----------DIVIDER LINE------------\r\n')
print('---These are the unique values in that column:---')
print(df4['Program Registered For'].unique())
print('-----------DIVIDER LINE------------\r\n')
print('---Here is a sorted list of the values in that column:---')
print(sorted(df4['Program Registered For'].unique(), key=str.lower))
print('-----------DIVIDER LINE------------\r\n')
# Note the commas in this output. If you want commas w/o sorting,
# just put list() around the "....unique()"
print('---This is how MANY unique values are in that column:---')
print(len(df4['Program Registered For'].unique()))
print('-----------DIVIDER LINE------------\r\n')
# Note that len() goes around stuff, but .unique() comes out of stuff!
print('---This is how many rows are in the spreadsheet:---')
print(len(df4))
print('-----------DIVIDER LINE------------\r\n')
print('---This code doesn\'t show anything between the descriptor lines; watch the files at left; a new one should appear.---')
#df4['Program Registered For'].to_csv('outputProgCol.csv', index=False, header=True)
print('-----------DIVIDER LINE------------\r\n')
df4['Last Name'] = df4['Last Name'].str.upper()
print('---After upper-casing "Last Name," here\'s what our table looks like now:---')
print(df4)
print('-----------DIVIDER LINE------------\r\n')
df4['Constant'] = 'Hello'
df4['EmptyCol'] = None
df4['Long Name'] = df4['First Name'].str[0] + '. ' + df4['Last Name']
print('---After adding 3 new columns, here\'s what our table looks like now:---')
print(df4)
print('-----------DIVIDER LINE------------\r\n')
df4['First Name'] = df4['First Name'].str.lower()
print('---Note that the computation of "Long Name" based on "First Name" was a one-off. If we lower-case "First Name" NOW the initial in "Long Name" will not lower-case accordingly the way it would with Excel formulas, so plan the order in which you want to compute cells based off each other!---')
print(df4)
print('-----------DIVIDER LINE------------\r\n')
# ####################################
print('---This code doesn\'t show anything between the descriptor lines; watch the files at left; a new one should appear. Note that I indicated I wanted last name before first name, unlike in the source table.---')
#df4[['Last Name','First Name']].to_csv('outputNames.csv', index=False)
print('-----------DIVIDER LINE------------\r\n')
namecols = [x for x in df4.columns if 'Name' in x]
print('---Here are all the column labels with the word "Name" in them---')
print(namecols)
print('-----------Mini-Divider------------')
print('---And here\'s a sub-table of just those columns---')
print(df4[namecols])
print('-----------DIVIDER LINE------------\r\n')
df4[['First Name','Last Name']] = 'Kelly'
print('---After filling everyone down as "Kelly Kelly," here\'s what our table looks like now:---')
print(df4)
print('-----------DIVIDER LINE------------\r\n')
df4[['First Name','Last Name']] = ['Anush', 'Lopez']
print('---After filling everyone down as "Anush Lopez," here\'s what our table looks like now:---')
print(df4)
print('-----------DIVIDER LINE------------\r\n')
df4[['First Name','Last Name']] = df4[['Last Name','First Name']]
print('---After swapping the CONTENTS but not the HEADERS of first & last, here\'s what our table looks like now -- everyone\'s a "Lopez Anush!" Woah! Careful!---')
print(df4)
print('-----------DIVIDER LINE------------\r\n')
df4[['First Name','Last Name']] = df4[['Last Name','First Name']]
df4 = df4[['Id','Last Name','First Name','Program Registered For','Constant','EmptyCol','Long Name']]
print('---We swapped the contents back (everyone\'s an Anush Lopez again), and then we REORDERED the columns WITH their headers -- aaaaah, that\'s better! Note the difference in how you type things! It\'s the same idea as when we made "outputNames.csv" with last names listed to the left of first names, only we included all the column names in the table this time and overwrote the contents of our variable "df4" with this new "sub-table" of the old "df4" contents.---')
print(df4)
print('-----------DIVIDER LINE------------\r\n')
# ####################################
df4 = df4[['Id','Long Name','Program Registered For']]
print('---After discarding all but 3 columns here\'s what our table looks like now:---')
print(df4)
print('-----------DIVIDER LINE------------\r\n')
print('---Here\'s what our table would look like with only the first record for every "Long Name:"---')
print(df4.drop_duplicates(subset=['Long Name'], keep='first'))
print('-----------DIVIDER LINE------------\r\n')
print('---Here\'s what our table would look like having deleted all records that "had a duplicate" on "Long Name" (note that it\'s more complicated to "keep all duplicates and discard single-appearance records")---')
print(df4.drop_duplicates(subset=['Long Name'], keep=False))
print('-----------DIVIDER LINE------------\r\n')
df4['HasADuplicate'] = df4.duplicated(subset=['Long Name'], keep=False)
df4['IsFirstOfADupeSet'] = df4.duplicated(subset=['Long Name'], keep='first')
print('---Here\'s our table after adding some flags about duplication:---')
print(df4)
print('-----------DIVIDER LINE------------\r\n')
print('---Here\'s what our table would look like having KEPT ONLY records that "had a duplicate" on "Long Name"---')
print(df4[df4['HasADuplicate']])
print('-----------DIVIDER LINE------------\r\n')
print('---Here\'s how many dupes in how many dupe sets we have -- Power of One trick!---')
print('There are ' + str(df4['HasADuplicate'].sum()) + ' duplicated records in ' + str(df4['IsFirstOfADupeSet'].sum()) + ' groups.')
print('-----------DIVIDER LINE------------\r\n')
Exercise 3: Add/Delete/Rename/Reorder Columns Yourself
Open the https://link.stthomas.edu/sfpy201901-123 “starter code.”
If this “starter code” had a print(df1)
command in it, df1
would look like this to start:
Id First Last Email Company
0 5829 Jimmy Buffet jb@example.com RCA
1 2894 Shirley Chisholm sc@example.com United States Congress
2 294 Marilyn Monroe mm@example.com Fox
3 30829 Cesar Chavez cc@example.com United Farm Workers
4 827 Vandana Shiva vs@example.com Navdanya
5 9284 Andrea Smith as@example.com University of California
6 724 Albert Howard ah@example.com Imperial College of Science
- At the end of the program, add code that modifies the contents of
df1
as follows, and make your last line of codeprint(df1)
:- Add a column called
Hello
with the phraseYay Us
filled in all the way down - Rename
Last
toLast Name
andFirst
toFirst Name
- Delete the
Email
column - Reorder the columns to be
Hello
,Last Name
,Company
,First Name
, & thenId
.
- Add a column called
- You should see the following output:
Hello Last Name Company First Name Id 0 Yay Us Buffet RCA Jimmy 5829 1 Yay Us Chisholm United States Congress Shirley 2894 2 Yay Us Monroe Fox Marilyn 294 3 Yay Us Chavez United Farm Workers Cesar 30829 4 Yay Us Shiva Navdanya Vandana 827 5 Yay Us Smith University of California Andrea 9284 6 Yay Us Howard Imperial College of Science Albert 724
Don’t forget that:
- You might need to surround column names with single quotes to indicate that they’re text, not code
- When typing out a list in your code, you need a pair of square brackets around them to indicate it’s a list (in addition to whatever other square brackets might be present in your code due to Pandas loving square brackets)
Cheat Sheet: Add/Delete/Reorder/Rename Columns
Adding columns, approach 1 of 2
dfVarName['ColumnName'] = 'abc'
or
dfVarName['UpperCaseFirstName'] = dfVarName['FirstName'].str.upper()
etc.
Adding columns, approach 2 of 2
dfVarName = dfVarName.assign(ColumnName1 = 'abc', ColumnName2 = 'def')
or
dfVarName = dfVarName.assign(UpperCaseFirstName = dfVarName['FirstName'].str.upper(), LowerCaseFirstName = dfVarName['FirstName'].str.lower())
etc.
Deleting columns, approach 1 of 2 (keep only columns named)
dfVarName = dfVarName[['ColumnName1','ColumnName2','ColumnName3']]
or
dfVarName = dfVarName[varNameHoldingAListOfColumnNames]
etc.
Deleting columns, approach 2 of 2 (discard columns named)
dfVarName = dfVarName.drop(columns=['ColumnName1','ColumnName2','ColumnName3'])
or
dfVarName = dfVarName.drop(columns=varNameHoldingAListOfColumnNames)
etc.
Reordering columns
dfVarName = dfVarName[['ColumnName1','ColumnName2','ColumnName3']]
or
dfVarName = dfVarName[varNameHoldingAListOfColumnNames]
etc.
(They’ll come out in the order you specified.)
Renaming columns
dfVarName = dfVarName.rename(columns={'ColumnName1':'NewColumnName1','ColumnName2':'NewColumnName2'})
or
dfVarName = dfVarName.rename(columns=varNameHoldingAMapOfColumnRenameLogic)
etc.
Door Prize Script: Dynamic Rename & Reorder
Tweet me if you use this in real life – remember to hashtag #AwesomeAdmin & #YayUs !
This code makes use of intermediate-level Python functionality called “list comprehensions” and “dict comprehensions” to build the contents of columnsWithProgramInTheName
(a list), theRestOfTheColumns
(a list), and renamingMap
(a “dict”, which is a data structure to indicate that certain things should “become” other things as specified).
They’re just concise-to-type “for loops,” to be honest.
To build newColumnOrder
(a “list”), this code takes advantage of Python’s ridiculously simple grammar for combining two lists: you just put a “+
” between them.
What’s really neat, though, is that it:
- dynamically extracts a list of all columns that have the word
Program
in their name - strips that word from those column names (a rename), and then
- shoves them all to the beginning of the spreadsheet ahead of “the rest” of the columns (“the rest” being something it also computed dynamically)
Code:
import pandas
pandas.set_option('expand_frame_repr', False)
df = pandas.read_csv('https://raw.githubusercontent.com/pypancsv/pypancsv/master/docs/_data/tallypivotoutput.csv', dtype=object)
print('---At first, "df" looks like this:---')
print(df)
columnsWithProgramInTheName = [x for x in df.columns if 'Program' in x]
theRestOfTheColumns = [x for x in df.columns if x not in columnsWithProgramInTheName]
newColumnOrder = columnsWithProgramInTheName + theRestOfTheColumns
df = df[newColumnOrder]
renamingMap = {x:x.replace('Program','') for x in columnsWithProgramInTheName}
df = df.rename(columns=renamingMap)
print()
print('---After reordering and renaming dynamically, "df" looks like this:---')
print(df)
Output:
---At first, "df" looks like this:---
Id First Name Last Name ProgramAcrobatics ProgramBasketWeaving ProgramComputerProgramming ProgramScubaDiving
0 29 John Doe NaN Registered NaN Registered
1 872 Jane Dill Registered NaN NaN Registered
2 75 Mick Jag NaN NaN Registered Registered
---After reordering and renaming dynamically, "df" looks like this:---
Acrobatics BasketWeaving Computerming ScubaDiving Id First Name Last Name
0 NaN Registered NaN Registered 29 John Doe
1 Registered NaN NaN Registered 872 Jane Dill
2 NaN NaN Registered Registered 75 Mick Jag
Exercise 4: Combine 5 spreadsheets into 1 CampaignMemberRecordsToInsert spreadsheet
Open the https://link.stthomas.edu/sfpy201901-eventmerge “starter code.”
Your data looks like this.
Event attendance spreadsheet #1 (evdf1
):
First Last Email Event Name Event Date Attendance Status
0 Revkah Lilburn rl@example.com Python for Salesforce 101 2018-10-20 Attended
1 Haskel Southerns hs@example.com Python for Salesforce 101 2018-10-20 No-Show
2 Ermanno Withinshaw ew@example.com Python for Salesforce 101 2018-10-20 Attended
Event attendance spreadsheet #2 (evdf2
):
First Last Email Event Name Event Date Attendance Status
0 Haskel Southerns hs@example.com Python for Salesforce 101-Office Hours 2018-11-10 No-Show
Event attendance spreadsheet #3 (evdf3
):
First Last Email Event Name Event Date Attendance Status
0 Julianna Judron jj@example.com Python for Salesforce 102 2019-01-26 No-Show
1 Haskel Southerns hs@example.com Python for Salesforce 102 2019-01-26 Attended
2 Adah Dimmock ad@example.com Python for Salesforce 102 2019-01-26 Cancelled
Salesforce Contact object existing record dump:
ID FIRSTNAME LASTNAME EMAIL PHONE
0 003X01 Anna Appleton aa@example.com 555-555-0101
1 003X02 Adah Dimmock dima@example.com 555-555-0202
2 003X03 Ben Bensalem bs@example.com 555-555-0303
3 003X04 Julianna Judron jj@example.com 555-555-0404
4 003X05 Julianna Judron jj@example.com 555-555-0505
5 003X06 Zainab Zahar zz@example.com 555-555-0606
Salesforce Campaign object existing record dump:
ID NAME HAPPENED_ON__C
0 701X01 Parasailing Fun Day 2017-07-20
1 701X02 Python for Salesforce 101 2018-10-20
2 701X03 Python for Salesforce 101-Office Hours 2018-11-10
3 701X04 Hockey Outing 2019-01-01
4 701X05 Python for Salesforce 102 2019-01-26
We’re going to create a file called CampaignMemberRecordsToInsert.csv of people we can find in Salesforce already that looks like this:
ContactId | CampaignId | CampaignMemberStatus | Last | First | Event Name | Event Date | |
---|---|---|---|---|---|---|---|
003X04 | 701X05 | No-Show | Judron | Julianna | jj@example.com | Python for Salesforce 102 | 2019-01-26 |
003X05 | 701X05 | No-Show | Judron | Julianna | jj@example.com | Python for Salesforce 102 | 2019-01-26 |
Wow! Just Julianna?
- What about Adah? (Oh, his email address is different in Salesforce.)
- Note that we’re going to create CampaignMember records on both Julianas in Salesforce. Maybe we have a tiny data set and we catch this. Maybe we have a huge data set and we don’t, and we catch it later when deduplicating Contacts. For now, we’ll go with the latter. Deciding how to deal with this issue is more of a business logic problem than a programming problem.
Here are the steps we’ll follow to get there:
- Concatenate the 3 EventBrite sheets vertically ↕ and save it as “eventsdf”
- Do an “inner” merge from “eventsdf” to “contactsdf” (“inner” implication: drops any attendees not yet in Salesforce – we’ll get to them later) matching on the FIRSTNAME, LASTNAME, & EMAIL; save the result as “merge1df”.
- Delete columns from “merge1df” so that only the columns of “eventsdf” and the “ID” column remain; ensure the change persists to “merge1df”.
- Rename the “ID” column of “merge1df” to “ContactId”; ensure “merge1df” changes.
- Merge “merge1df” against “campaignsdf” on event name & start date; “inner” merge; save the result as “merge2df”.
- Rename the “ID” column of “merge2df” to “CampaignId”; ensure “merge2df” changes.
- Rename the “Attendance Status” column of “merge2df” to “CampaignMemberStatus”; ensure “merge2df” changes.
- Re-order the fields of “merge2df” to be: ContactId, CampaignId, CampaignMemberStatus, Last, First, Email, Event Name, Event Date. Don’t bother including “NAME” or “HAPPENED_ON__C” in your final output if they exist.
- Export your data to “CampaignMemberRecordsToInsert.csv” and have a look. Does it look like it should?
Whew! That’s a lot!
Whether you’re coding in your own “solo” console or helping type into the “group” console, we’ll solve this one out loud together.
Cheat Sheet: Concatenation Examples ↕
Example 1: List all unique e-mail addresses in a spreadsheet, whether they be under the “Email
,” “WorkEmail__c
,” or “SchoolEmail__c
” columns.
concat_series = pandas.concat([df['Email'], df['WorkEmail__c'], df['SchoolEmail__c']])
nonnullemails_series = concat_series.dropna()
uniquenonnullemails_ndarray = nonnullemails_series.unique()
pandas.Series(uniquenonnullemails_ndarray, name='Emails').to_csv('c:\\example\\uniqueemails.csv', index=False, header=True)
If we ran the above code after reading a spreadsheet into the “df
” variable that looked like this:
Name | WorkEmail__c | SchoolEmail__c | |
---|---|---|---|
Annie Appleton | aa@example.com | aa@work.com | aa@school.com |
Andrew Appleton | aa@example.com | andrew@work.com | andrew@school.com |
Berenice Benita | bb@example.com | bb@work.com | bb@school.com |
Then the file we saved to, uniqueemails.csv, looks like this when opened (note that aa@example.com
only appears once):
Emails |
---|
aa@example.com |
bb@example.com |
aa@work.com |
andrew@work.com |
bb@work.com |
aa@school.com |
andrew@school.com |
bb@school.com |
Note that the “
.unique()
” operation, which you can append to any “Series”-typed data, produces output that is not a Pandas Series (whereas “.dropna()
,” for example, produces another Series as output).It’s yet another list-like data structure called a
numpy.ndarray
.To write back to disk as a 1-column CSV or Excel file, we’ll want to turn it back into a Series, which is what we do on the last line (giving the first column the name “Emails” because that feels like a good name), so we can take advantage of “
.to_csv(...)
.”If instead we had put
uniquenonnullemails_ndarray
insideprint(...)
, we would have seen the data surrounded by square brackets, separated by a space but no commas. If we had wanted commas as a separator, we would have put it inside ofprint(list(...))
to formprint(list(uniquenonnullemails_ndarray))
.
Example 2: List all unique e-mail addresses between 2 spreadsheets, whether they be under #1’s “Email
,” “WorkEmail__c
,” or “SchoolEmail__c
” columns, or under #2’s “EMAILADDR
,” “EMAIL2__C
,” “EMAIL3__C
,” or “EMAIL4__C
” columns.
concat_series = pandas.concat([df1['Email'], df1['WorkEmail__c'], df1['SchoolEmail__c'], df2['EMAILADDR'], df2['EMAIL2__C'], df2['EMAIL3__C'], df2['EMAIL4__C']])
nonnullemails_series = concat_series.dropna()
uniquenonnullemails_ndarray = nonnullemails_series.unique()
pandas.Series(uniquenonnullemails_ndarray, name='Emails').to_csv('c:\\example\\uniqueemails2.csv', index=False, header=True)
** HEY LOOK! **
It’s the exact same code as example 1. We just added more things to the list inside “
pandas.concat(...)
” in the first line!
If we ran the above code after reading a spreadsheet into the “df1
” variable that looked like this…
Name | WorkEmail__c | SchoolEmail__c | |
---|---|---|---|
Annie Appleton | aa@example.com | aa@work.com | aa@school.com |
Andrew Appleton | aa@example.com | andrew@work.com | andrew@school.com |
Berenice Benita | bb@example.com | bb@work.com | bb@school.com |
…and after reading a spreadsheet into the “df2
” variable that looked like this:
Name | EMAILADDR | EMAIL2__C | EMAIL3__C | EMAIL4__C |
---|---|---|---|---|
Berenice Benita | bb@example.com | bb@school.com | bb@play.com | bb@example.com |
Then the file we saved to, uniqueemails2.csv, looks like this when opened (there was just 1 email address in the 2nd spreadsheet that we didn’t yet know about):
Emails |
---|
aa@example.com |
bb@example.com |
aa@work.com |
andrew@work.com |
bb@work.com |
aa@school.com |
andrew@school.com |
bb@school.com |
bb@play.com |
Example 3: Spreadsheet 1 has columns “First
,” “Last
,” & “Email
.”
Spreadsheet 2 has columns “LastName
,” “Em
,” & “FirstName
.”
Concatenate appropriately (e.g. Em
= Email
) & dedupe (by all 3 fields together).
col2EquivInCol1 = {'FirstName':'First', 'LastName':'Last', 'Em':'Email'}
rawconcat_df = pandas.concat([df1, df2.rename(columns=col2EquivInCol1)], sort=False)
dedupedconcat_df = rawconcat_df.drop_duplicates(subset=df1.columns)
dedupedconcat_df.to_csv('c:\\example\\concatdedupedtables.csv', index=False)
Note: The “
sort=False
” option in “pandas.concat(...)
” keeps that command from rearranging the columns in alphabetical order.
If we ran the above code after reading a spreadsheet into the “df1
” variable that looked like this…
First | Last | |
---|---|---|
Jimmy | Buffet | jb@example.com |
Shirley | Chisholm | sc@example.com |
Marilyn | Monroe | mm@example.com |
Cesar | Chavez | cc@example.com |
Vandana | Shiva | vs@example.com |
Andrea | Smith | as@example.com |
Albert | Howard | ah@example.com |
…and after reading a spreadsheet into the “df2
” variable that looked like this:
LastName | Em | FirstName |
---|---|---|
Temple | st@example.com | Shirley |
Smith | as@example.com | Andrea |
Duck | dd@example.com | Donald |
Monroe | mm@example.com | Marilyn |
Howard | ahotherem@example.com | Albert |
Shiva | vs@example.com | Vandana |
Then the file we saved to, concatdedupedtables.csv, looks like this when opened:
First | Last | |
---|---|---|
Jimmy | Buffet | jb@example.com |
Shirley | Chisholm | sc@example.com |
Marilyn | Monroe | mm@example.com |
Cesar | Chavez | cc@example.com |
Vandana | Shiva | vs@example.com |
Andrea | Smith | as@example.com |
Albert | Howard | ah@example.com |
Shirley | Temple | st@example.com |
Donald | Duck | dd@example.com |
Albert | Howard | ahotherem@example.com |
Example 4: Spreadsheet 1, Spreadsheet 2, & Spreadsheet 3 all have columns “Name
,” “DOB
,” & “AttendedOrNot
” (they’re EventBrite exports).
Add a “WhichSheet
” column to each of them saying “Event1,” “Event2,” or “Event3,” concatenate, and sort by “Name
,” “DOB
,” & “WhichSheet
.”
rawconcat_df = pandas.concat([df1.assign(WhichSheet='Event1'), df2.assign(WhichSheet='Event2'), df3.assign(WhichSheet='Event3')], sort=False)
sortedconcat_df = rawconcat_df.sort_values(by=['Name','DOB','WhichSheet'])
sortedconcat_df.to_csv('c:\\example\\concatsortedtables.csv', index=False)
Note: The “
sort=False
” option in “pandas.concat(...)
” keeps that command from rearranging the columns in alphabetical order.
If we ran the above code after reading a spreadsheet into the “df1
” variable that looked like this…
Name | DOB | AttendedOrNot |
---|---|---|
Annie Appleton | 1950-02-02 | Yes |
Berenice Benita | 2000-01-01 | No |
…and after reading a spreadsheet into the “df2
” variable that looked like this…
Name | DOB | AttendedOrNot |
---|---|---|
Annie Appleton | 1950-02-02 | No |
Christina Cruz | 1975-03-03 | Yes |
…and after reading a spreadsheet into the “df3
” variable that looked like this:
Name | DOB | AttendedOrNot |
---|---|---|
Christina Cruz | 1975-03-03 | No |
Then the file we saved to, concatsortedtables.csv, looks like this when opened:
Name | DOB | AttendedOrNot | WhichSheet |
---|---|---|---|
Annie Appleton | 1950-02-02 | Yes | Event1 |
Annie Appleton | 1950-02-02 | No | Event2 |
Berenice Benita | 2000-01-01 | No | Event1 |
Christina Cruz | 1975-03-03 | Yes | Event2 |
Christina Cruz | 1975-03-03 | No | Event3 |
Cheat Sheet: Merge Examples ↔
Example 1: Spreadsheet #1 is full of people.
Spreadsheet #2 is full of data about countries of the world.
Add “Country Code
” & “Country Capital
” columns to #1, using people’s “MailingCountry
” data as a matching key to #2’s “Name
” column.
columnstokeep = list(df1.columns) + ['Code','Capital']
mergedf = df1.merge(df2, how='left', left_on=['MailingCountry'], right_on=['Name'])
mergedf = mergedf[columnstokeep]
mergedf = mergedf.rename(columns={'Code':'Country Code', 'Capital':'Country Capital'})
mergedf.to_csv('c:\\example\\countryenhanced.csv', index=False)
If we ran the above code after reading a spreadsheet into the “df1
” variable that looked like this…
Id | First | Last | MailingCountry |
---|---|---|---|
30829 | Cesar | Chavez | United States |
827 | Vandana | Shiva | India |
9284 | Andrea | Smith | United States |
724 | Albert | Howard | United Kingdom |
…and after reading a spreadsheet into the “df2
” variable that looked like this:
Name | Code | Capital | SqKm |
---|---|---|---|
Argentina | AR | Buenos Aires | 2,780,400 |
Barbados | BB | Bridgetown | 430 |
Iceland | IC | Reykjavik | 103,000 |
India | IN | New Delhi | 3,287,263 |
Tunisia | TS | Tunis | 163,610 |
United Kingdom | UK | London | 242,495 |
United States | US | Washington, D.C. | 9,525,067 |
Vanuatu | NH | Port Vila | 12,189 |
Then the file we saved to, countryenhanced.csv, looks like this when opened:
Id | First | Last | MailingCountry | Country Code | Country Capital |
---|---|---|---|---|---|
30829 | Cesar | Chavez | United States | US | Washington, D.C. |
827 | Vandana | Shiva | India | IN | New Delhi |
9284 | Andrea | Smith | United States | US | Washington, D.C. |
724 | Albert | Howard | United Kingdom | UK | London |
Example 2: Combine 2 spreadsheets full of people and things you know about them on “First
,” “Last
,” & “Email
” as a matching key.
col2EquivInCol1 = {'FirstName':'First', 'LastName':'Last', 'Em':'Email'}
mergedf = df1.merge(df2.rename(columns=col2EquivInCol1), how='outer', on=['First','Last','Email'], indicator=True)
mergedf.to_csv('c:\\example\\personmerge.csv', index=False)
If we ran the above code after reading a spreadsheet into the “df1
” variable that looked like this…
Id | First | Last | Company | |
---|---|---|---|---|
5829 | Jimmy | Buffet | jb@example.com | RCA |
2894 | Shirley | Chisholm | sc@example.com | United States Congress |
294 | Marilyn | Monroe | mm@example.com | Fox |
30829 | Cesar | Chavez | cc@example.com | United Farm Workers |
827 | Vandana | Shiva | vs@example.com | Navdanya |
9284 | Andrea | Smith | as@example.com | University of California |
724 | Albert | Howard | ah@example.com | Imperial College of Science |
…and after reading a spreadsheet into the “df2
” variable that looked like this:
PersonId | FirstName | LastName | Em | FavoriteFood |
---|---|---|---|---|
983mv | Shirley | Temple | st@example.com | Lollipops |
9e84f | Andrea | Smith | as@example.com | Kale |
k28fo | Donald | Duck | dd@example.com | Pancakes |
x934 | Marilyn | Monroe | mm@example.com | Carrots |
8xi | Albert | Howard | ahotherem@example.com | Potatoes |
02e | Vandana | Shiva | vs@example.com | Amaranth |
Then the file we saved to, personmerge.csv, looks like this when opened:
Id | First | Last | Company | PersonId | FavoriteFood | _merge | ||
---|---|---|---|---|---|---|---|---|
5829 | Jimmy | Buffet | jb@example.com | RCA | left_only | |||
2894 | Shirley | Chisholm | sc@example.com | United States Congress | left_only | |||
294 | Marilyn | Monroe | mm@example.com | Fox | x934 | Carrots | both | |
30829 | Cesar | Chavez | cc@example.com | United Farm Workers | left_only | |||
827 | Vandana | Shiva | vs@example.com | Navdanya | 02e | Amaranth | both | |
9284 | Andrea | Smith | as@example.com | University of California | 9e84f | Kale | both | |
724 | Albert | Howard | ah@example.com | Imperial College of Science | left_only | |||
Shirley | Temple | st@example.com | 983mv | Lollipops | right_only | |||
Donald | Duck | dd@example.com | k28fo | Pancakes | right_only | |||
Albert | Howard | ahotherem@example.com | 8xi | Potatoes | right_only |
Just to help drive home how this “horizontal” “merge” operation is different from a “vertical” “concat” operation, check out this code:
col2EquivInCol1 = {'FirstName':'First', 'LastName':'Last', 'Em':'Email'}
concatdf = pandas.concat([df1,df2.rename(columns=col2EquivInCol1)], sort=False)
concatdf.to_csv('c:\\example\\personconcat.csv', index=False)
The output file we’d have saved to, personconcat.csv, would like this when opened:
Id | First | Last | Company | PersonId | FavoriteFood | ||
---|---|---|---|---|---|---|---|
5829 | Jimmy | Buffet | jb@example.com | RCA | |||
2894 | Shirley | Chisholm | sc@example.com | United States Congress | |||
294 | Marilyn | Monroe | mm@example.com | Fox | |||
30829 | Cesar | Chavez | cc@example.com | United Farm Workers | |||
827 | Vandana | Shiva | vs@example.com | Navdanya | |||
9284 | Andrea | Smith | as@example.com | University of California | |||
724 | Albert | Howard | ah@example.com | Imperial College of Science | |||
Shirley | Temple | st@example.com | 983mv | Lollipops | |||
Andrea | Smith | as@example.com | 9e84f | Kale | |||
Donald | Duck | dd@example.com | k28fo | Pancakes | |||
Marilyn | Monroe | mm@example.com | x934 | Carrots | |||
Albert | Howard | ahotherem@example.com | 8xi | Potatoes | |||
Vandana | Shiva | vs@example.com | 02e | Amaranth |
As you can see, we did manage to “stack” the First, Last, & Email columns by renaming column names in df2
before adding it to the list of sheets to vertically concatenate (because pandas.concat(...)
looks for column names that are exact matches of each other and just “adds on” columns not found in every spreadsheet passed do it).
However, we’ve lost any sense of connection between the fact that Marilyn Monroe works for Fox and the fact that Marilyn Monroe likes to eat carrots.
- We could alphabetize it for easier visual scanning (
cdf = cdf.sort_values(by=['First','Last','Email'])
) … - We could facilitate visual scanning for duplicates on First+Last+Email by adding a helper column (
cdf['isDupe'] = cdf.duplicated(subset=['First','Last','Email'], keep=False)
) … - We could even get “non-duplicates” out of our way by filtering on that column (
cdf = cdf[cdf['isDupe']]
) …
…but why? A horizontal merge probably serves our “find out that Marilyn works for Fox and likes carrots” use case better.
There might be business processes where we actually want our output to be a “vertical concat,” however.
- For example, perhaps we actually are just trying to look for duplicates by first/last/email, and data like
Company
andFavoriteFood
are “extra” data we couldn’t care less about.- We don’t care if they show up, but we don’t care if they don’t.
- We want to take advantage of vertical-concatenation’s ability to process dozens of spreadsheets in one command.
Point is, think about whether you would be copy-pasting vertically in Excel or whether you would be VLOOKUP-ing in Excel before you decide which is appropriate for you!
Example 3: Cross-check 2 financial transaction logs that should be identical, ensuring no Transaction ID exists in only one spreadsheet, nor has a different timestamp between the two spreadsheets.
mergedf = df1.merge(df2, how='outer', left_on=['TransactionId'], right_on=['Id'], indicator=True)
mergedf['HasProblem'] = (mergedf['_merge'] != 'both') | (mergedf['Timestamp_x'] != mergedf['Timestamp_y'])
mergedf = mergedf[['HasProblem','TransactionId','_merge','Timestamp_x','Amount_x','Timestamp_y','Amount_y']]
mergedf.to_csv('c:\\example\\transactionverify.csv', index=False)
If we ran the above code after reading a spreadsheet into the “df1
” variable that looked like this…
TransactionId | Amount | Timestamp | CCLast4 |
---|---|---|---|
28499202 | $87.71 | 2018-06-14T04:02:00-05:00 | 3885 |
17689183 | $1,508.82 | 2014-11-13T18:27:00-05:00 | 9274 |
92840068 | $1.08 | 2016-02-08T15:53:00-05:00 | 9274 |
92848928 | $981.46 | 2019-01-03T07:01:00-05:00 | 1784 |
…and after reading a spreadsheet into the “df2
” variable that looked like this:
Timestamp | BudgetCode | Amount | Id |
---|---|---|---|
2016-02-08T15:53:00-05:00 | 8294 | $1.08 | 92840068 |
2014-12-01T08:23:00-05:00 | 7140 | $1,508.82 | 17689183 |
2018-06-14T04:02:00-04:00 | 8294 | $87.71 | 28499202 |
2013-02-09T08:01:00-05:00 | 8294 | $517.84 | 82947820 |
Then the file we saved to, transactionverify.csv, looks like this when opened:
HasProblem | TransactionId | _merge | Timestamp_x | Amount_x | Timestamp_y | Amount_y |
---|---|---|---|---|---|---|
True | 28499202 | both | 2018-06-14T04:02:00-05:00 | $87.71 | 2018-06-14T04:02:00-04:00 | $87.71 |
True | 17689183 | both | 2014-11-13T18:27:00-05:00 | $1,508.82 | 2014-12-01T08:23:00-05:00 | $1,508.82 |
False | 92840068 | both | 2016-02-08T15:53:00-05:00 | $1.08 | 2016-02-08T15:53:00-05:00 | $1.08 |
True | 92848928 | left_only | 2019-01-03T07:01:00-05:00 | $981.46 | ||
True | right_only | 2013-02-09T08:01:00-05:00 | $517.84 |
If we were working with millions of records, most of which would be false, we might want to filter out the rows where “
HasProblem
” isFalse
.We haven’t gotten to this yet, but you could use an approach for filtering series by adding the following line of code right before
mergedf = mergedf[['HasProblem','TransactionId',...,'Amount_y']]
:
mergedf = mergedf[mergedf['HasProblem']]
If you look at our output data carefully, you might notice that transaction ID 28499202 is only problematic because of a time zone difference (UTC-5 vs. UTC-5).
June 14th, 2018 is in the summer, so this might be a simple problem of one of our systems logging Daylight Savings Time incorrectly.
- We could update our Python script to ignore this particular kind of issue by converting “
Timestamp
” indf1
anddf2
to be interpreted as real dates, rather than as plaintext, before we check for equivalency betweenTimestamp_x
andTimestamp_y
. That might be the right approach if:- One system logged “noon in New York” as “T12:00:00-05:00” (winter) / “T12:00:00-04:00” (summer)
- The other system logged “noon in New York” as “T17:00:00Z” (winter) / “T16:00:00Z” (summer), which means it’s logging transactions when they happened in Greenwich Mean Time / UTC.
However, for this particular issue, that’s not quite what we’re seeing. It’s probably one of the systems producing the logs that is forgetting to account for Daylight Savings Time, so there’s probably a bug we should ask to have fixed “upstream” of our logs.
Door Prize Script: Event-Attendance-Concatenating Loop ↕
Tweet me if you use this in real life – remember to hashtag #AwesomeAdmin & #YayUs !
To watch this run, visit https://repl.it/@rplrpl/Door-Prize-Multi-Event-Concatenator
I’ve saved 4 files inside of my codebase at Repl.it (the 3 we saw earlier, plus a new one full of RSVPs for “Python 103”).
PLEASE BE CAREFUL about the code you’re writing when you run code against “every file in a folder.” You can overwrite so many files in a milisecond.
Just like I say, “Don’t write your output to the same file that you’re reading your input from”:
- Don’t write a loop that writes output files to the same filenames OR the same directory as you’re reading your input from”
Code:
import os
import pandas
lookForCSVsInThisFolder = 'C:\\FolderWhereIPutAllTheFiles\\'
listOfDataFrames = []
for x in os.listdir(lookForCSVsInThisFolder):
if x.endswith('.csv'):
xdf = pandas.read_csv(x)
xdf = xdf.assign(WhichCSV = x)
listOfDataFrames.append(xdf)
concatdf = pandas.concat(listOfDataFrames)
concatdf = concatdf.sort_values(by=['First','Last','Email','WhichCSV'])
concatdf.to_csv('c:\\example\\loopconcat.csv', index=False)
(Note: in the Repl.it, instead of setting “
lookForCSVsInThisFolder
” to'C:\\FolderWhereIPutAllTheFiles\\'
, in the Repl.it, I set it to “'.'
” because that means “the same folder I’m running my Python from” – this is specific to Repl.it and isn’t normally the way I want you to code.).
Note that if you don’t need the “WhichCSV” column added, your code can be much shorter, bringing the creation of “concatdf
” down to a single line thanks to “list comprehensions”:
import os
import pandas
lookForCSVsInThisFolder = 'C:\\FolderWhereIPutAllTheFiles\\'
concatdf = pandas.concat([pandas.read_csv(x) for x in os.listdir(lookForCSVsInThisFolder) if x.endswith('.csv')])
concatdf = concatdf.sort_values(by=['First','Last','Email','WhichCSV'])
concatdf.to_csv('c:\\example\\loopconcat-alternate.csv', index=False)
Then the file we saved to, loopconcat.csv, looks like this when opened:
First | Last | Event Name | Event Date | Attendance Status | WhichCSV | |
---|---|---|---|---|---|---|
Adah | Dimmock | ad@example.com | Python for Salesforce 102 | 2019-01-26 | Cancelled | mergehandson_event3.csv |
Adah | Dimmock | ad@example.com | Python for Salesforce 103 | 2019-12-29 | RSVPed | mergehandson_event4.csv |
Ermanno | Withinshaw | ew@example.com | Python for Salesforce 101 | 2018-10-20 | Attended | mergehandson_event1.csv |
Harwilll | Menhenitt | hm@example.com | Python for Salesforce 103 | 2019-12-29 | RSVPed | mergehandson_event4.csv |
Haskel | Southerns | hs@example.com | Python for Salesforce 101 | 2018-10-20 | No-Show | mergehandson_event1.csv |
Haskel | Southerns | hs@example.com | Python for Salesforce 101 | 2019-11-26 | Attended | mergehandson_event3.csv |
Haskel | Southerns | hs@example.com | Python for Salesforce 103 | 2019-12-29 | RSVPed | mergehandson_event4.csv |
Haskel | Southerns | hs@example.com | Python for Salesforce 101-Office Hours | 2018-11-10 | No-Show | mergehandson_event2.csv |
Julianna | Judron | jj@example.com | Python for Salesforce 102 | 2019-01-26 | No-Show | mergehandson_event3.csv |
Revkah | Lilburn | rl@example.com | Python for Salesforce 101 | 2018-10-20 | Attended | mergehandson_event1.csv |
Revkah | Lilburn | rl@example.com | Python for Salesforce 103 | 2019-12-29 | Cancelled | mergehandson_event4.csv |
Exercise 5: Make “ContactsToInsert” and “CampaignMembersToInsert” for those not in Salesforce
If you accidentally closed your work, open the https://link.stthomas.edu/sfpy201901-eventmerge2 “starter code.”
You’re going to pick up with eventsdf
already existing (that was the first concatenation you did in the previous exercise), and you’ll be shooting to create a file called CampaignMemberRecordsToInsert2.csv that looks like this:
ContactId | CampaignId | CampaignMemberStatus | Last | First | Event Name | Event Date | |
---|---|---|---|---|---|---|---|
003X61 | 701X02 | Attended | Lilburn | Revkah | rl@example.com | Python for Salesforce 101 | 2018-10-20 |
003X62 | 701X02 | No-Show | Southerns | Haskel | hs@example.com | Python for Salesforce 101 | 2018-10-20 |
003X63 | 701X02 | Attended | Withinshaw | Ermanno | ew@example.com | Python for Salesforce 101 | 2018-10-20 |
003X64 | 701X03 | No-Show | Southerns | Haskel | hs@example.com | Python for Salesforce 101-Office Hours | 2018-11-10 |
003X65 | 701X05 | Attended | Southerns | Haskel | hs@example.com | Python for Salesforce 102 | 2019-01-26 |
003X66 | 701X05 | Cancelled | Dimmock | Adah | ad@example.com | Python for Salesforce 102 | 2019-01-26 |
The ContactIds come from a command we’re going to do that I wrote to imitate exporting a DataFrame called “merge3df
” to a file called “ContactsToInsert.csv
,” pushing that file into Salesforce Data Loader as a Contact insertion operation, getting the “success” file back, and re-loading that “success” file back into Python as the new value of “merge3df
” (with merge3df = pandas.read_csv('successBlahBlah.csv')
).
Here are the steps we’ll follow to get this file:
- Do a “left” merge from “eventsdf” to “contactsdf” matching on the FIRSTNAME, LASTNAME, & EMAIL; turn on the “indicator=True” flag; save the result as “merge3df”.
- Remove from “merge3df” any rows where the value in the “_merge” column is not “left_only”; ensure change persists. (We do this by building an expression that becomes a “Series” of True/False values with the same “Item Ids” that “merge3df” has as “row IDs,” then putting that expression inside “merge3df = merge3df[…]”)
- Run the following command:
merge3df = doFakeDataLoad(merge3df)
- Delete columns from “merge3df” so that only the columns of “eventsdf” and “ID” remain; ensure the change persists to “merge3df”. (Note: from here on out, we’ve done this before, just merge1->merge3 & merge2->merge4.)
- Rename the “ID” column of “merge3df” to “ContactId”; ensure “merge3df” changes.
- Merge “merge3df” against “campaignsdf” on event name & start date; “inner” merge; save the result as “merge4df”.
- Rename the “ID” column of “merge4df” to “CampaignId”; ensure “merge4df” changes.
- Rename the “Attendance Status” column of “merge4df” to “CampaignMemberStatus”; ensure “merge4df” changes.
- Re-order the fields of “merge4df” to be: ContactId, CampaignId, CampaignMemberStatus, Last, First, Email, Event Name, Event Date. Don’t bother including “NAME” or “HAPPENED_ON__C” in your final output if they exist.
- Export your data to “CampaignMemberRecordsToInsert2.csv” & have a look. Does it look like it should?
Again, whether you’re coding in your own “solo” console or helping type into the “group” console, we’ll solve this one out loud together. The only hard part is really steps 1-3; we can just copy/paste the rest of the code and change “merge1df” to “merge3df” and “merge2df” to “merge4df” and “….csv” to “…2.csv”
Exercise 6: Add a new “Note” column to our concatenated events roster
If you accidentally closed your work, open the https://link.stthomas.edu/sfpy201901-eventnotes “starter code.”
You’ll COPY the contents of “eventsdf
” into a new DataFrame called “notesdf
.” You’ll edit “Event Name” in notesdf
to make it a little easier to skim, and then you’ll add a new “Note
” column to notesdf
and selectively fill it in so that your output data looks like this:
First | Last | Event Name | Event Date | Note |
---|---|---|---|---|
Revkah | Lilburn | PySF101 | 2018-10-20 | |
Haskel | Southerns | PySF101 | 2018-10-20 | Flag A: 2018-10-20 |
Ermanno | Withinshaw | PySF101 | 2018-10-20 | |
Haskel | Southerns | PySF101-Office Hours | 2018-11-10 | Flag B: HASKEL |
Julianna | Judron | PySF102 | 2019-01-26 | Flag B: JULIANNA |
Haskel | Southerns | PySF102 | 2019-01-26 | Flag B: HASKEL |
Adah | Dimmock | PySF102 | 2019-01-26 | Flag B: ADAH |
Here are the steps we’ll follow to get this file:
- Make a clean copy of
eventsdf
into a new DataFrame callednotesdf
(note: you can’t just saynotesdf = eventsdf
… you have to saynotesdf = eventsdf.copy()
, lest you simultaneously edit the contents of the originaleventsdf
). - Overwrite the contents of the “Event Name” column of
notesdf
to replace “'Python for Salesforce '
” with “'PySF'
” for better skimmability. (Note: all text-filled “Series” have a.str.replace(thingToReplace, replaceItWith)
operation.) - Get rid of the “Email” & “Attendance Status” columns. They’re just wasting screen space right now.
- Add a new blank column called “Note” to notesdf (add a new column & fill it all the way down as the value None).
- Selectively edit the value of Note to say “Flag A: ” along with the Event Date from that row if the person’s last name starts with a capital S.
- Selectively edit the value of Note to say “Flag B: ” along with an upper-cased version of the person’s first name if they’re on the roster for an event in November 2018 or later.
- print(…) or .to_csv(…) your data & have a look. Does it look like it should?
Note that two of Haskel Southerns’ 3 notes are “flag B,” even though he’s eligible for “flag A,” having a last name that begins with “S.” Why do you think that is?
Again, whether you’re coding in your own “solo” console or helping type into the “group” console, we’ll solve this one out loud together. We’ll piece things together a little bit at a time, and using “placeholders” for data you intend to build later (like “'Hi There'
” as a stand-in for the “Flag A + date” data) when you’re not sure what a command would be is a great idea, while you test that you got a different part of the code right, just like you would do when building a complicated Excel formula!
First | Last | Event Name | Event Date | Attendance Status | |
---|---|---|---|---|---|
Revkah | Lilburn | rl@example.com | Python for Salesforce 101 | 2018-10-20 | Attended |
Haskel | Southerns | hs@example.com | Python for Salesforce 101 | 2018-10-20 | No-Show |
Ermanno | Withinshaw | ew@example.com | Python for Salesforce 101 | 2018-10-20 | Attended |
Haskel | Southerns | hs@example.com | Python for Salesforce 101-Office Hours | 2018-11-10 | No-Show |
Julianna | Judron | jj@example.com | Python for Salesforce 102 | 2019-01-26 | No-Show |
Haskel | Southerns | hs@example.com | Python for Salesforce 102 | 2019-01-26 | Attended |
Adah | Dimmock | ad@example.com | Python for Salesforce 102 | 2019-01-26 | Cancelled |
Door Prize Script: A little pivot
Here’s one more script for you to take home.
I’ve always been a bit “meh” about statistics and pivottables.
I suppose I lean towards programming, rather than data analysis, because my heart lies with beating a computer into giving “question-askers” the answers they seek … not coming up with all the great questions.
But many of you are probably the question-askers in your organizations! So it IS important to be able to summarize data!
And honestly, one thing I love, as a Salesforce admin, about doing pivot tables and aggregations with Python, is incorporating them into scripts I’m writing to automate repetitive, boring work. Sometimes in Excel I DO make a PivotTable just to copy it back into a new tab without any formatting, tweak it a bit, and keep on editing as if it had always been an ordinary data table. I definitely do that, too, in Python.
However, the commands and approaches quickly get varied even faster than all that nonsense about a million different meanings to df[...]
.
If you want to become a pivot table expert for business analytics, you have to check out the blog Practical Business Python. I recommend going through the archives and poking around 2014 and early 2015 (like all blogs, it can get more complicated over time as the author has “covered the easy stuff”).
But let’s do one simple example!
Starting with our trusty concatenated event roster, eventsdf
…
First | Last | Event Name | Event Date | Attendance Status | |
---|---|---|---|---|---|
Revkah | Lilburn | rl@example.com | Python for Salesforce 101 | 2018-10-20 | Attended |
Haskel | Southerns | hs@example.com | Python for Salesforce 101 | 2018-10-20 | No-Show |
Ermanno | Withinshaw | ew@example.com | Python for Salesforce 101 | 2018-10-20 | Attended |
Haskel | Southerns | hs@example.com | Python for Salesforce 101-Office Hours | 2018-11-10 | No-Show |
Julianna | Judron | jj@example.com | Python for Salesforce 102 | 2019-01-26 | No-Show |
Haskel | Southerns | hs@example.com | Python for Salesforce 102 | 2019-01-26 | Attended |
Adah | Dimmock | ad@example.com | Python for Salesforce 102 | 2019-01-26 | Cancelled |
…we’ll pivot people into a single line (treating a name+email as a “person”) apiece and display a bit of attendance information to the right of their name.
With this code:
import numpy
import pandas
evdf1 = pandas.read_csv('https://raw.githubusercontent.com/pypancsv/pypancsv/master/docs/_data/mergehandson_event1.csv')
evdf2 = pandas.read_csv('https://raw.githubusercontent.com/pypancsv/pypancsv/master/docs/_data/mergehandson_event2.csv')
evdf3 = pandas.read_csv('https://raw.githubusercontent.com/pypancsv/pypancsv/master/docs/_data/mergehandson_event3.csv')
eventsdf = pandas.concat([evdf1, evdf2, evdf3])
pivotdf = pandas.pivot_table(eventsdf, index=['First','Last','Email'], columns='Event Date', values='Attendance Status', aggfunc=numpy.min)
pivotdf = pivotdf.reset_index()
pivotdf.columns.name = None
eventDatesOffered = list(eventsdf['Event Date'].unique())
pivotdf['RSVPed'] = pivotdf[eventDatesOffered].count(axis='columns')
pivotdf['Came'] = pivotdf[eventDatesOffered].isin(['Attended']).sum(axis='columns')
pivotdf['Didnt'] = pivotdf[eventDatesOffered].isin(['No-Show','Cancelled']).sum(axis='columns')
pivotdf.to_csv('outputpivot.csv', index=False)
We’ll end up with a CSV file as output that looks like this:
First | Last | 2018-10-20 | 2018-11-10 | 2019-01-26 | RSVPed | Came | Didnt | |
---|---|---|---|---|---|---|---|---|
Adah | Dimmock | ad@example.com | Cancelled | 1 | 0 | 1 | ||
Ermanno | Withinshaw | ew@example.com | Attended | 1 | 1 | 0 | ||
Haskel | Southerns | hs@example.com | No-Show | No-Show | Attended | 3 | 1 | 2 |
Julianna | Judron | jj@example.com | No-Show | 1 | 0 | 1 | ||
Revkah | Lilburn | rl@example.com | Attended | 1 | 1 | 0 |
We often need to import a 2nd “module” (extension to Python commands) called “numpy” when doing pivots, so that we can refer to certain … well … pieces of that “module” that pivots work better with.
When we call the pandas.pivot_table(...)
command, we specify that we want to process eventsdf
, that we want the “which data to use as a “single row” to be first+last+email, that we want to set up as many new columns to the right of that as there are unique values in eventsdf
’s “Event Date
” column, and that at the intersection of a given person and a given date, we want to put the “minimum” (earliest-in-the-alphabet) value found among all rows of eventsdf
that had that person and that event date.
(In our eventsdf
data, we never have more than 1 RSVP per person. Picking a “minimum” function or a “maximum” function on such data is always a neat trick when you’re forced to “aggregate” your data down to a single value, and yet you know it already is a single value.)
If we were to print(...)
the output of that command, it’d look pretty hideous, so to get things back to looking like a normal table (this is the part that’s like copy-pasting from a PivotTable into a new blank tab and stripping formatting and doing a little tweaking), we run these two commands back-to-back: pivotdf = pivotdf.reset_index()
to get the row and column headers back where they belong, and pivotdf.columns.name = None
to get a weird “Event Date
” out of the upper-left-hand corner of the table.
Next we’re going to take advantage of the fact that we had just taken all unique values found in eventsdf
’s “Event Date
” column and turned them into columns of pivotdf
.
First, we’ll manually scan eventsdf['EventDate']
for its unique values, and we store them into a variable as a list. We could type them, because there are only 3, but what if this were a much bigger table? Best to let the computer do it.
Now … what do you get if you say someDataFrame[someListOfColumnNames]
?
You get a “sub-table” copy of that DataFrame for those columns that is, in and of itself, also a DataFrame!
That’s what the next few lines do in the parts to the right of the =
that say pivotdf[eventDatesOffered]
.
- Tacked onto the end of “DataFrame”-typed data,
.count(axis='columns')
produces a “Series”, whose item IDs correspond to the DataFrame’s row IDs, indicating how many non-null values appear in that row of the DataFrame. - Tacked onto the end of “DataFrame”-typed data, .isin([‘No-Show’,’Cancelled’]) produces a copy of that DataFrame, only with all the cell values replaced by whether that value is among “No Show”/”Cancelled.”
In turn, tacked onto the end of “DataFrame”-typed data that’s full of numeric values,.sum(axis='columns')
produces a “Series”, whose item IDs correspond to the DataFrame’s row IDs, indicating the sum of the values in that row of the DataFrame (and remember, True = 1; False = 0 … power of 1!). - And of course, putting such “Series”-typed expressions to the right of
pivotdf['NewColumnName'] =
adds a new column and fills the values down as indicated. - Notice that this whole operation didn’t actually really have anything to do with “pivoting.” This was normal “table-editing” stuff. We turned our “pivot” back into a “normal table” a long time ago with
pivotdf = pivotdf.reset_index()
. This is the kind of “normal” stuff that you might do in Excel after you’ve copied & pasted your PivotTable back into a “normal worksheet.”
Finally, we write our table out to CSV.
And that’s a wrap for class – thanks for coming!
(“101” recording here)
Don’t peek! Possible answers for exercises 3, 4, 5, & 6
3
import pandas
pandas.set_option('expand_frame_repr', False)
df1 = pandas.read_csv('https://raw.githubusercontent.com/pypancsv/pypancsv/master/docs/_data/sample1.csv', dtype=object)
df1['Hello'] = 'Yay Us'
df1 = df1.rename(columns={'Last':'Last Name','First':'First Name'})
df1 = df1.drop(columns=['Email'])
df1 = df1[['Hello', 'Last Name', 'Company', 'First Name', 'Id']]
print(df1)
4 - 6
import pandas
pandas.set_option('expand_frame_repr', False)
evdf1 = pandas.read_csv('https://raw.githubusercontent.com/pypancsv/pypancsv/master/docs/_data/mergehandson_event1.csv')
evdf2 = pandas.read_csv('https://raw.githubusercontent.com/pypancsv/pypancsv/master/docs/_data/mergehandson_event2.csv')
evdf3 = pandas.read_csv('https://raw.githubusercontent.com/pypancsv/pypancsv/master/docs/_data/mergehandson_event3.csv')
contactsdf = pandas.read_csv('https://raw.githubusercontent.com/pypancsv/pypancsv/master/docs/_data/mergehandson_sf_contacts.csv')
campaignsdf = pandas.read_csv('https://raw.githubusercontent.com/pypancsv/pypancsv/master/docs/_data/mergehandson_sf_campaigns.csv')
def doFakeDataLoad(dfToFakeInserting):
idNos = range(61, 61+len(dfToFakeInserting))
dfToFakeInserting['ID'] = ['003X'+str(x) for x in idNos]
dfToFakeInserting['STATUS'] = 'Item Created'
dfToFakeInserting = dfToFakeInserting[['ID'] + [x for x in dfToFakeInserting if x not in ['ID','STATUS']] + ['STATUS']]
return dfToFakeInserting
# ######## EXERCISE 4 #########
eventsdf = pandas.concat([evdf1, evdf2, evdf3])
merge1df = eventsdf.merge(contactsdf, how='inner', left_on=['First','Last','Email'], right_on=['FIRSTNAME','LASTNAME','EMAIL'])
merge1df = merge1df[list(eventsdf.columns) + ['ID']]
merge1df = merge1df.rename(columns={'ID':'ContactId'})
merge2df = merge1df.merge(campaignsdf, how='inner', left_on=['Event Name','Event Date'], right_on=['NAME','HAPPENED_ON__C'])
merge2df = merge2df.rename(columns={'ID':'CampaignId','Attendance Status':'CampaignMemberStatus'})
merge2df = merge2df[['ContactId', 'CampaignId', 'CampaignMemberStatus', 'Last', 'First', 'Email', 'Event Name', 'Event Date']]
print(merge2df)
# ######## EXERCISE 5 #########
merge3df = eventsdf.merge(contactsdf, how='left', left_on=['First','Last','Email'], right_on=['FIRSTNAME','LASTNAME','EMAIL'], indicator=True)
notInSFSeries = merge3df['_merge']=='left_only'
merge3df = merge3df[notInSFSeries]
merge3df = doFakeDataLoad(merge3df)
merge3df = merge3df[list(eventsdf.columns) + ['ID']]
merge3df = merge3df.rename(columns={'ID':'ContactId'})
merge4df = merge3df.merge(campaignsdf, how='inner', left_on=['Event Name','Event Date'], right_on=['NAME','HAPPENED_ON__C'])
merge4df = merge4df.rename(columns={'ID':'CampaignId','Attendance Status':'CampaignMemberStatus'})
merge4df = merge4df[['ContactId', 'CampaignId', 'CampaignMemberStatus', 'Last', 'First', 'Email', 'Event Name', 'Event Date']]
print(merge4df)
# ######## EXERCISE 6 #########
notesdf = eventsdf.copy()
notesdf['Event Name'] = notesdf['Event Name'].str.replace('Python for Salesforce ','PySF')
notesdf = notesdf.drop(columns=['Email','Attendance Status'])
notesdf['Note'] = None
conditionAseries = notesdf['Last'].str.startswith('S')
notesdf['Note'][conditionAseries] = 'Flag A: ' + notesdf['Event Date']
conditionBseries = notesdf['Event Date'] > '2018-10-31'
notesdf['Note'][conditionBseries] = 'Flag B: ' + notesdf['First'].str.upper()
print(notesdf)