Construct a Python Crawler to Get Exercise Stream with GitHub API

I wish to get these actions like under

ShusenTang starred lyprince/sdtw_pytorch
chizhu starred markus-eberts/spert
Hexagram-King starred BrambleXu/knowledge-graph-learning
Yevgnen starred BrambleXu/knowledge-graph-learning
......

2.1 GitHub API

First, we check out GitHub API documentation. In the event you do not allow the two-factor authentication, you would run the under command to check the API. After inputting the password, it is best to see the response.

$ curl -u '<usename>' https://api.github.com
Enter host password for consumer '<usename>': <password>
{
"current_user_url": "https://api.github.com/consumer",
"current_user_authorizations_html_url": "https://github.com/settings/connections/purposes/client_id",
"authorizations_url": "https://api.github.com/authorizations",
......

However in case you have enabled the two-factor authentication, we have to use private entry tokens to for authentication. Observe the assistance web page to create a private token. As for the entry permission/scope, as a result of we solely wish to get exercise streams, deciding on the notification is sufficient.

If issues go nicely, we must always have a token proper now. Observe the authentication instruction and check the token with the under command.

$ curl -H "Authorization: token <TOKEN>" https://api.github.com

{
"current_user_url": "https://api.github.com/consumer",
"current_user_authorizations_html_url": "https://github.com/settings/connections/purposes/client_id",
"authorizations_url": "https://api.github.com/authorizations",
.......

2.2 Get Activate Stream

In Different Authentication Strategies, we will use the under command to get consumer knowledge straight.

$ curl -u <usename>:<TOKEN> https://api.github.com/consumer{
"login": "xxxx",
"id": xxxxx,
"node_id": "xxxx",
"avatar_url": "xxxx",
......

Subsequent, we have to undergo the API documentation to seek out the exercise associated API. In the suitable toggle listing, there are “Occasions”, “Feeds”, “Notifications” beneath the “Exercise”. However we have to determine which fits our wants.

After glancing over the documentation, we will know “Record occasions consumer has obtained” beneath “Occasions” is we want.

We are able to check the curl command with this API command.

$ curl -u <usename>:<TOKEN> https://api.github.com/customers/<usename>/received_events

This may return lots of messages, so it’s higher to save lots of the response as a JSON file.

$ curl -u <usename>:<TOKEN> https://api.github.com/customers/<usename>/received_events > github_stream.json

We are able to take a look at the JSON knowledge to be conversant in the format.

It appears there are completely different sorts of occasion varieties. We are able to write a easy script to get all types of occasion varieties.

import jsonwith open('github_stream.json', 'r') as f:
data_string = f.learn() # learn object as string
knowledge = json.hundreds(data_string) # convert JSON (str, bytes or bytearray) to a Python object
# Get all occasion varieties
occasions = set()
for occasion in knowledge:
occasions.add(occasion['type'])
print(occasions)
# output
'IssueCommentEvent', 'ForkEvent', 'PushEvent', 'PullRequestEvent', 'WatchEvent', 'IssuesEvent'

By evaluating the exercise stream we see on the GitHub residence web page, we will see solely three occasion varieties exist, WatchEvent (starred), ForkEvent (forked), PushEvent (pushed).

So we solely have to get actions from these three occasion varieties. Beneath is the script.

# github-api-json-parse.pyimport jsonwith open('github_stream.json', 'r') as f:
data_string = f.learn() # learn object as string
knowledge = json.hundreds(data_string)
event_actions = 'WatchEvent': 'starred', 'PushEvent': 'pushed to'for occasion in knowledge:
if occasion['type'] in event_actions:
title = occasion['actor']['display_login']
motion = event_actions[occasion['type']]
repo = occasion['repo']['name']
print('title '.format(title=title, motion=motion, repo=repo))
if occasion['type'] == 'ForkEvent':
title = occasion['actor']['display_login']
repo = occasion['repo']['name']
forked_repo = occasion['payload']['forkee']['full_name']
print('title forked from '.format(title=title, forked_repo=forked_repo,repo=repo))

Run the script, we will get the output under.

ShusenTang starred lyprince/sdtw_pytorch
chizhu starred markus-eberts/spert
Hexagram-King starred BrambleXu/knowledge-graph-learning
icoxfog417 pushed to arXivTimes/arXivTimes
icoxfog417 pushed to arXivTimes/arXivTimes
Yevgnen starred BrambleXu/knowledge-graph-learning
......

Till now, we will get the JSON knowledge file by API and parse the JSON file to get the exercise stream. Subsequent, we’ll remove the JSON file and get the exercise stream straight.

Take a look at my different posts in Medium!
GitHub: https://github.com/BrambleXu
LinkedIn:
www.linkedin.com/in/xu-liang
Weblog:
https://bramblexu.org

Leave a Reply

Your email address will not be published. Required fields are marked *