I should preface this post by saying that I’m not a Git expert so this is based on my experimentation rather than any deep knowledge. I bet I’m not the only one who merely skimmed the internals chapter in Pro Git. This post is the result of me setting out to learn the Git internals a little better and help anybody else who is trying to use pygit2 for something awesome. With that said, corrections are welcome.
Installation
While this is obvious to some, I think it’s worth pointing out that pygit2 versions track libgit2 versions. If you have libgit2 v0.18.0 installed, then you need to use pygit2 v0.18.x. Compiler errors will flow if you don’t. The docs don’t exactly mention this (pull request coming). Other than that, just follow the docs and you should be set. On MacOS, you can brew install libgit2
(make sure you brew update
first) to get the latest libgit2 followed by pip install pygit2
.
The repository
The first class almost any user of pygit2 will interact with is Repository. I’ll be traversing and introspecting the Twitter bootstrap repository in my examples.
1 2 3 4 5 6 7 8 9 |
>>> import pygit2 >>> repo = pygit2.Repository('/Users/dfischer/Projects/bootstrap') >>> repo.head.hex # sha1 hex hash of the commit pointed to by HEAD u'd9b502dfb876c40b0735008bac18049c7ee7b6d2' >>> repo.path '/Users/dfischer/Projects/bootstrap/.git/' >>> repo.workdir '/Users/dfischer/Projects/bootstrap/' |
There’s quite a bit more to the repository object and I’ll show it after I introduce some other git primitives and terminology.
Git objects
There are four fundamental git objects — commits, tags, blobs and trees — which reference snapshots of the git working directory (commits), potentially annotated named references to commits (tags), chunks of data (blobs) and an organization of blobs or directory structure (trees). I’ll save blobs and trees for a later post, but here’s some examples of using commits and tags in pygit2.
Since commits and tags are user facing, most git users should be familiar with them. Essentially commits point to a version of the working copy of the repository in time. They also have various interesting bits of metadata.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
>>> commit = repo.revparse_single('042bb9b5') # equivalent to repo[u'042bb9b5'] >>> commit.hex u'042bb9b51510573a9a1db6bc66cb16311d0d580b' >>> commit.message u'Merge pull request #6780 from ...' >>> commit.author.email # clearly I'm editing out author info u'xxx@gmail.com' >>> commit.author.name u'xxx' >>> commit.author.offset -480 >>> commit.author.time # epoch time 1360128599 |
One tip/issue with using the repository bracket notation (repo[hex | oid]) is that the key MUST be either a unicode string if specifying the object hash or if it is a byte string pygit2 assumes that it points to the binary version of the hash called the oid.
Tags are essentially named pointers to commits but they can contain additional metadata.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
>>> tag = repo.revparse_single('v2.3.1') >>> tag.tagger.email u'xxx@gmail.com' >>> tag.tagger.name u'xxx' >>> tag.message u'v2.3.1\n' >>> tag.target # binary version of the hex commit hash called an "oid" '\xeb$q\x8a\xddM\xd3o\xe9/\xdb\xdby\xe6\xffL\xe5\x91\x93\x00' >>> commit = repo[tag.target] >>> commit.hex u'eb24718add4dd36fe92fdbdb79e6ff4ce5919300' >>> repo[tag.target].hex == repo.revparse_single('v2.3.1^0').hex True |
You can read all about the types of parameters that revparse_single
handles at man gitrevisions or in the Git documentation under specifying revisions.
Typically, you won’t need to ever convert between hex encoded hashes and oids, but in case you do the the conversion is trivial:
1 2 3 4 |
>>> import base64 >>> base64.b16encode(tag.oid).lower() == tag.hex True |
Walking commits
The Repository
object makes available a walk
method for iterating over commits. This script walks the commit log and writes it out to JSON.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
import json import sys from datetime import datetime import pygit2 def main(repository): repo = pygit2.Repository(repository) commits = [] for commit in repo.walk(repo.head.oid, pygit2.GIT_SORT_TIME): commits.append({ 'hash': commit.hex, 'message': commit.message, 'commit_date': datetime.utcfromtimestamp( commit.commit_time).strftime('%Y-%m-%dT%H:%M:%SZ'), 'author_name': commit.author.name, 'author_email': commit.author.email, 'parents': [c.hex for c in commit.parents], }) print(json.dumps(commits, indent=2)) if __name__ == '__main__': if len(sys.argv) != 2: print("USAGE: {0} <repository>".format(__file__)) sys.exit(0) main(sys.argv[1]) |
Dump repository objects
This script dumps all tags and commits in a repository to JSON. It shows how repositories are iterable and sort of puts the whole tutorial together.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
""" Writes all the tags and commits from a repository to JSON """ import base64 import json import sys from datetime import datetime import pygit2 def main(repository): repo = pygit2.Repository(repository) objects = { 'tags': [], 'commits': [], } for objhex in repo: obj = repo[objhex] if obj.type == pygit2.GIT_OBJ_COMMIT: objects['commits'].append({ 'hash': obj.hex, 'message': obj.message, 'commit_date': datetime.utcfromtimestamp( obj.commit_time).strftime('%Y-%m-%dT%H:%M:%SZ'), 'author_name': obj.author.name, 'author_email': obj.author.email, 'parents': [c.hex for c in obj.parents], }) elif obj.type == pygit2.GIT_OBJ_TAG: objects['tags'].append({ 'hex': obj.hex, 'name': obj.name, 'message': obj.message, 'target': base64.b16encode(obj.target).lower(), 'tagger_name': obj.tagger.name, 'tagger_email': obj.tagger.email, }) else: # ignore blobs and trees pass print(json.dumps(objects, indent=2)) if __name__ == '__main__': if len(sys.argv) < 2: print("USAGE {0} <repository>".format(__file__)) sys.exit(1) main(sys.argv[1]) |
Notes
- There’s talk of changing the pygit2 API to be more Pythonic. I was using v0.18.x for this and significant things may change in the future.
- It helps to think of a Git repository as a tree (or directed acyclic graph if you’re into that sort of thing) where the root is the latest commit. I used to think about version control where the first commit is the root, but instead it is a leaf!
- If your repository uses annotated or signed tags, there will be longer messages or PGP signatures in the tag message.
- I’ve glossed over huge chunks of pygit2 — pretty much anything that writes to the repository — but if I don’t leave something for later my loyal readers won’t come back to read more. =)
I’ve only used GitPython before (it’s used by https://github.com/kennethreitz/legit also).
This looks very similar from the code examples above.
The interesting thing about pygit2 as opposed to GitPython is that pygit2 doesn’t shell out to the git command line but instead relies on a C library (libgit2). Also, GitPython seems somewhat stagnant although I don’t know how much Git has changed over that time and so maybe it doesn’t need to change. Thanks for the tip on legit.