After doing a bit of Googling, I found genderize.io, a nice little API that gives you a best guess for a gender if you give it a name. If you send it this string:
https://api.genderize.io/?name=richardyou get back this result:
{"name":"richard","gender":"male","probability":"1.00","count":4381}In other words, genderize.io believes with 100% confidence that "richard" is a male name. (From Genderize's documentation, the count "represents the number of data entries examined in order to calculate the response.")
I have more than 2,300 connections on LinkedIn, so getting a breakdown of everyone's gender was going to be too time-consuming. Instead of doing the names one at a time, I signed up for a developer account and paid for up to 100,000 queries/month. (For more than a handful of queries, Genderize.io will rate-limit you; with a developer account, you get an access token that bypasses the rate limits.)
With an access token, here are the steps I used to get a breakdown of my LinkedIn network's gender split:
- Export LinkedIn connections
- Import the file into a Google Sheet
- Delete everything but the first name field ("Given Name")
- In a separate column, create a a URL string that appends the contents of the Given Name column to a tokenized URL that includes your Genderize.io access token. For me this looked like:
=CONCATENATE("https://api.genderize.io/?apikey=ACCESSTOKEN&name=",B2) - In a new column, use Google Sheets's "ImportHTML" function to execute the query represented in the adjacent column:
=importdata(C2) - Step 5 creates several columns, as Google Sheets will bring in the Genderize.io query results into the spreadsheet; unfortunately, it does not properly split the gender result into its own columns. Create a new column and use the "Split" command to break the string [gender:"female"] into separate cells, then use "CountIF" to count how many times the word "female" appears in your worksheet. Divide that number by the total number of rows in your spreadsheet, and you have your % of female contacts.
(If I was a better programmer, I could have built a simple Python script using Genderize.io's API to do this automatically. Maybe someone who reads this will want to build it? Let me know!)
Here you go! Replace the "names" array with the result of your LinkedIn export. NB the genderize.io API limits non-dev users to 1000 queries per day. Also the comments field is eating my indentations, which renders Python nonsensical - argh. Hopefully it's obvious from context.
ReplyDeleteI noticed that a few of my contacts include a surname in the "First Name" field ('Hillary Rodham' in the below example), and those names come back as undetermined. Might be enough to skew your data, if you assume that women are more likely than men to do this.
import requests
import json
names = [
'John',
'Jane',
'Hillary Rodham'
];
female = 0
male = 0
cant_tell = 0
undetermined_names = []
for name in names:
request_string = "http://api.genderize.io/?name=" + name
r = requests.get(request_string)
result = json.loads(r.content)
if result['gender'] == 'female':
female = female + 1
elif result['gender'] == 'male':
male = male + 1
else:
cant_tell = cant_tell + 1
undetermined_names.append(name)
ratio = float(female)/(female + male)
print "Female: " + str(female)
print "Male: " + str(male)
print "Percent female " + '{:.1%}'.format(ratio)
print "Undetermined: " + str(cant_tell)
print undetermined_names
Amazing! Thanks, Jill!
DeleteCode is up on github (with a fix for the multiple-given-names problem): https://github.com/jillh510/genderize-contacts
ReplyDelete