How to set up a group project on GitHub

Matthew Lipman
6 min readOct 16, 2020

--

Collaboration is a key element to working as a Data Scientist. No one individual is able to accomplish the work that can be accomplished by a team, or collectively by an entire company/organization. The best companies and organizations are cognizant of how important group chemistry and efficiency are to its success. Think about your first day at work — before even being presented with your daily/weekly tasks or assignments, you are introduced to your teammates and onboarded with the relevant systems that will allow you to complete your responsibilities. Your ability to work together as a cohesive group on data science projects will depend on not only how the group meshes from an energetic or behavioral perspective, but will depend how easily you are able to send and receive relevant and useful information. If we translate this to the work you will be doing as a data scientist, the hours, days, and career you spend locally on your computer will be meaningless if you are unable to share them with your team or publish them to the world.

That is why Data Scientists use a Git repository hosting service. There are many great ones out there; the one I will be covering in this brief tutorial is GitHub. Let’s begin!

Github’s Octocat [github.com]

Step 1: Create your GitHub repository

Your repository is defined by Oxford Languages as your “central location in which data is stored and managed.”

Go to github.com/new and create your first repository. Give it a descriptive name; especially if your repository will be public (i.e. viewable by anyone in the world with internet), you want to provide it with a name that will signify or describe what your project is about.

Step 2: “Fork” and “Clone” into your local drive

In order to work on your repository, you’ll need to bring the information living in the GitHub repository onto your local Git. The first action you’ll need to take is “forking” the information on the repository in GitHub. Luckily GitHub has a button on the top righthand corner of every repository called, “Fork.”

Located on the top right corner of a Github repository

After selecting “Fork,” click on the “Code” dropdown button and copy the HTTPS hyperlink.

Select the clipboard to copy the text

Once you’ve forked the repository and copied the link, open your terminal, navigate to the location where you’d like this repository to be stored in your local device using the cd command (change directory) and verify your terminal is in the correct directory by using the pwd command (print working directory). Once you have verified you are in the right spot, type git clone <the HTTPS hyperlink from GitHub>. Now you’ve got the repository in GitHub brought into your local Git. Nice work!

Umbrella thorn acacia, taken by Stolz Gary M, U.S. Fish and Wildlife Service

Step 3: Work on the correct branch

A repository has multiple branches. If we think of our Github branches as branches on a tree, by working on different branches, you are able to have a clear separation for your work. Also, it allows for you to “make mistakes” (i.e. learn and get better!), and not have that growth opportunity affect the tree. We can always adjust or move our branches. Which leads me to one of the main points in this tutorial: do not work in the master branch until your group is ready to prepare the final iteration of your work. Working in the master branch — especially when doing group work — is not advised because if you make changes in your master branch you can very easily create merge changes. Additionally if two or more individuals are working in the master branch and push live changes, you can create bugs in your code. Below is a step-by-step for how to ensure you are working on your specific branch:

  • STEP 1) In the terminal, locate the folder that contains your Git repository using the cd command
  • STEP 2) To figure out which branch you are in, type in the terminal git branch This will show you whether you are working in the master branch or in another branch. It will show all possible branches you can work in with an asterisk * next to the branch you are working on.
  • STEP 3) If you would like to work in a different branch, you can type in on the Terminal: git checkout <insert name of the branch>
  • [OPTIONAL] STEP 4) If you would like to create a new branch, you can type in on the Terminal: git checkout -b [new name of branch]

Step 4: Code away!

As you confirm you are in the correct branch, you can open up an IDE, or Integrated Development Environment, to write and test your code. I like to use a Jupyter Notebook. To open one on your computer — assuming you have installed Jupyter Notebook on your local computer — type jupyter notebook. An introductory outline of what you’ll be doing on your first Jupyter Notebook are included but not limited to:

  • Install libraries
  • Open dataframes
  • Clean/merge/manipulate dataframes
  • Isolate and visualize your data
  • Analyzing your data using various functions/regressions

When you are finished working in your Jupyter Notebook, don’t forget to save your work on the top left corner of the screen.

The save button is the icon that looks like a floppy disk…does anyone else kinda miss floppy disks?!

Step 5: Share your work by bringing it back into GitHub

  • Add the most recent changes to your local Git by opening your terminal typing git add .
  • Type git status to confirm to see if there are any changes to add. You will see your non-committed changes in red
  • Once ready to commit, type git commit -m "TYPE HERE SPECIFICALLY WHAT YOUR MOST RECENT COMMIT IS ABOUT OR RATHER WHY IT IS IMPORTANT"
  • typing in the last part -m “…” is extremely useful so we can see what each commit does so that we don’t have to parse through the code to try to figure out what specifically is going on in this commit. This leads me to my second major point: when you are naming your commit, focus on “the why.” This is critically important for you to communicate to your team why you are creating this commit. You do not only want to write that this commit adds a certain function, but you want to explicate why this function is relevant and useful.
  • Lastly, once your track changes are committed and you are ready to push them into Github, type git push
  • If this gives you a message saying something along the lines of fatal: the current brach [branch name] has no upstream branch. To push the branch and set as upstream type, "git push -- set-upstream origin [name of branch]", you can copy and paste that line [git push -- set-upstream origin [name of branch]] of code into the terminal
  • It will ask for your username and password in Github and badabing badaboom, you’ve done it!

Conclusions:

The technology industry would not be where it is today if it were not for open source code. The importance of sharing your work is vitally important for furthering the scientific community and helping others. Whether the code you are working on relates to migratory patterns of blue whales or the health benefits of flossing, by being able to easily communicate and collaborate with your teammates and share your work with others will help us ask important questions to push our collective work to higher ground.

--

--