Recently, I visited the Facebook Headquarters in London to learn about the process of developing and maintaining its mobile Facebook app. Much more goes on here than you probably realize: some of Facebook's apps are handled here in their entirety, like WhatsApp for desktop and the business-oriented Workplace app.
The offices are just what you'd expect from Facebook's image, though perhaps not quite to The Social Network-levels of excess. This is a place where serious work gets done, but there's a trendy, quirky, and relaxed atmosphere nonetheless. Employees can carry laptops to work wherever they choose, there's a printing room for making posters (just because), commissioned artwork on several of the walls, and a giant Ninja Turtle — I never got an answer as to why.
Oh, and the food is incredible. I was there during Chinese New Year and I had multiple pork bellies. Good times.
However, I wasn't there to enjoy the decor and the cuisine, I was there to learn about Facebook on mobile. More specifically: how on Earth you even go about maintaining a project this large and ambitious? The Facebook backend serves over two billion people, and the Android app alone sees a new version released every week.
How do you manage an app with such an ambitious number of features
I spoke with Tal Kellner via Facebook's own telepresence system. Tal is a technical program manager, in charge of the Release Engineering Team based in the Tel Aviv engineering office. She was more than happy to share the gritty details.
What I learned was pretty fascinating both from a developer perspective and as a user. Here's what I found out.
Project management at Facebook – Why Scrum > Waterfall
When looking at any large project, you need to consider your project management approach. One such example is called "waterfall" project management. This is a sequential and linear approach where you work on a specific phase in turn, like going from ideation to implementation to testing to release.
companies like Facebook opt instead for a more modern approach to project management called "scrum"
Crucially, in this approach you do not begin the next phase until the previous phase is complete. The system originates from manufacturing, where certain stages often rely on the previous stage: you need to source bricks before you can build a wall!
When it comes to software, this approach is restrictive. In the worst case, an update can take so long to roll out, it is obsolete by the time it arrives. Duke Nukem Forever anyone?
Thus, some software companies opt instead for a more modern approach called "scrum," which is an agile methodology. This method prioritizes the work that matters most and breaks it into modular chunks. It relies on communication between internal departments and even individual agents working alone on their own corners of code.
The result, in theory, is that everyone can work on what's most pressing for them all the time, and that every other part of the business knows what they are doing. There's a high level of ownership for each engineer, and everyone is ultimately responsible for their own work. Not only does this make the company more agile, but it also hopefully increases workplace satisfaction. No one is just a cog in the machine.
anyone from anywhere within the organization can suggest an idea for a new feature
I was very impressed to hear that anyone from anywhere within the organization could suggest an idea for a new feature, and then get to work on that if given the go-ahead. Sometimes this might even develop into its own separate app! Facebook is much more a collaborative project than the top down enforced vision of a few people (or one person) it is often portrayed as.
This allows Facebook to implement an exceedingly rapid development cycle, enabling a new mobile update every week, and thousands of commits (proposed code changes) between then. If you think that's impressive, the web version (the backend of which also serves the mobile app) updates once every two to three hours!
Facebook is generally very supportive of new ideas and startups. It even has an initiative called LDN LAB devoted to supporting new ideas and businesses.
Finding balance
Of course, there is still always going to be a limit when it comes to what a company can handle. With this much code there is always room for improvement, but there has to come a time when the version is considered "good enough."
That's where the "golden triangle" comes into play. This triangle's three points represent features, quality, and time. Every company has a choice to make here: when it comes to crunch time, do you prioritize new features at the expense of taking a little longer? Do you allow a minor existing bug to slip through the net if it means you can add more features? When you can't do everything, you are forced to prioritize.
At Facebook, the priorities are quality and time. If an update is falling behind the allotted window, a feature will probably get pushed back; rather than a corner being cut or the update being delayed.
Version control and juggling changes
For handling these updates and changes to the code, Facebook uses its own modified version of Mercurial. That's instead of the very widely used Git, which apparently didn't scale as well for the company's purposes. Phabricator is the equivalent of GitHub, and uses a lot of plugins to help streamline workflow and sometimes just to make things a little more fun (Facebook likes its memes apparently).
For the non-programmers out there, Mercurial, like Git, is a version control system. It allows large numbers of people to work on a single piece of software, and to make changes and fixes without jeopardizing the main app version, called the "master branch." These tools help prevent code conflicts and allow for experimentation. Only once a change has been thoroughly approved on a test branch will it then be committed to the master.
Imagine if some poor programmer made a typo that broke the entire code and there was only one version! That would be a bad day for everyone.
Tools like Mercurial make it possible to implement the scrum approach with relative ease, letting everyone work on specific features and bugs simultaneously before merging it all together in one big pot.
Once a week, a release candidate will be cut from the master and this will then go through the testing phase. Coders who have spent all week working on bug fixes or new features will at this point be crossing their fingers hoping their work makes it into the new update.
Any last minute fixes or changes made by team members will require being "cherry picked" for inclusion in the new branch by those in charge. Reportedly, they have been known to use bribes in the form of chocolates and alcohol gifted to the decision makers.
To compile, Facebook uses another tool called Buck. This single build tool can build anything when it comes to packaging the app. There's no need for separate options like Gradle or Ant when targeting different platforms.
Catching bugs in time
With everyone working on different things, and so many updates going out on a regular basis, it's very important that companies make sure their software works and doesn't have any serious bugs. For the most part, Facebook has a pretty good track record of keeping things running.
To that end, the team splits software testing into tiers, referred to as C1, C2, and C3.
C1 is internal testing and all employees will run that version. During C2, the version runs through 2 percent of the general public, and C3 is production. Should something truly serious be found, every employee will be able to access an emergency stop button to bring production to a grinding halt.
The volunteers who put themselves forward for keeping the tiers progressing go by the name "tree huggers" (because branches), and do this on top of their regular jobs.
On Mobile, similar tiers are called alpha, beta, and prod. Alpha means an internal test, which all employees will run. The process of any company using its own products in this way is called "dogfooding" – from "eating your own dog food."
Testers also have some unique and interesting tools at their disposal for quickly reporting bugs. One is "Rageshake," where simply shaking the device in frustration will enable a bug report, like with Google Maps.
Testers also have some unique and interesting tools at their disposal for quickly reporting bugs
During alpha — which effectively refers to any internal testing — Facebook also uses automatic testing in order to run the app. For example, one recently acquired piece of software called "Sapienz" essentially works by clicking every button and using every feature in a random assault until it triggers a crash. It then logs the stack trace, records the action, and reports back.
The beta app (the version tested by the general public) goes through a small subsection (~2 percent) of the general public. This small snippet will receive the update ahead of time, providing Facebook with real-world feedback. If everything seems good, the update goes out to the entire population, and the process begins anew.
Powerful tools for automation and force multiplication
To keep this entire process as quick and as smooth as possible, Facebook uses a large number of different tools. We've already seen how the company uses Phabricator and Sapienz, but it has other tools and plugins for other stages.
A tool called Picknic gathers all of the pull requests (changes that employees have made) in one place for quick and easy reviewing.
When testing throws up an error, a bot called Nagbot informs those responsible and gently prods them into getting the work done. Using a rudimentary AI to handle this process not only ensures the work gets done, but also allows the manager to avoid being the "bad guy" by constantly nagging!
when testing throws up an error for someone to fix, a bot called Nagbot informs those responsible and gently prods them into getting the work done
Crashbot is another bot responsible for reporting those errors as they happen, and is preferable to metrics from the Google Console, in that it reports in real time. Crashbot will flag up an issue once the problems exceed an "acceptable crash threshold." This can be due to the number of people experiencing the error, or the number of times a single user has encountered the same error. Either way, Facebook will also have a metric showing the number of sad users.
For internal communication, Facebook uses something called Workplace. This is effectively a version of Facebook intended for businesses, which provides a useful way to get information about members of the team, and communicate quickly with those sitting on the other side of the sprawling office. Facebook also sells this software to third parties.
Of course Facebook isn't going to waste time uploading each new version of its apps to the Play Store, App Store, Amazon, and all the rest. There's also an app for that called the Mobile Push Train.
Closing thoughts
Keeping an app like Facebook up to date is an immense undertaking, and the company still needs to convince users to actually install those updates. This is particularly difficult in countries where connectivity is not guaranteed. In Canada, only one percent of users still run a version of Facebook over a year old. In Ethiopia, that number is closer to 50 percent!
The team at Facebook clearly works very hard and uses a ton of tools and processes to keep everything as streamlined as possible. At the end of the day, the development team aims to adhere to five ruling principles:
- Keep the master clean.
- Have one team with expertise in release engineering.
- Release on time often.
- Dogfood products.
- Be kind to users.
It sounds simple, but as you can see it involves a lot of spinning plates. Even maintaining all the tools used in the process is a project in itself!
For its part, Facebook maintains a friendly and light-hearted atmosphere at the office in London. The team exchanges GIFs and memes through plugins, they name rooms based on "things the British hate" and Shakespearean puns, and they take a lot of pride in their work. At Facebook, they work hard and play hard, and it seems that for the most part, the system works.
Next time a new update rolls out for one of your larger apps, spare a thought for all the work and organization it took to get it there.
No comments:
Post a Comment