Blog post -

Moving to Heroku

Why?

Since Mynewsdesk was born, we've hosted our service in the hands of a company called Globalinn, on leased hardware. It has been an excellent experience, but even excellent experiences come to an end sometimes. 

Mynewsdesk has grown really fast, making it hard to scale and improve our service the way we want. Globalinn has also grown away from their hosting business, so about a year ago, we decided that we should part ways.

That started a journey to find a new hosting partner that better fits our needs. Heroku was our main candidate early on, but we knew that would be very difficult, since they put a lot of constraints in how a website should be run. All of these are good ideas, many of them are wrapped up in the rails_12factor gem. After evaluating our alternatives, and a bunch of load testing, we decided to go the Heroku path anyway.

You all know Heroku, right?

Most Ruby on Rails developers have used Heroku one way or the other, as they have a great offering for their Platform as a Service, with a freemium business model. Today they serve five billion requests per day, with four million apps. They are likely one of the biggest customers of Amazon AWS, which is the foundation of Heroku.

They also offer an ecosystem of quality addon services, including their own top notch Postgresql service, but also many third party services, such as Memcachier, New Relic, Logentries, Redis Cloud, Websolr, all of which we use today.

Preparations

In total we've put about one man year of work into the migration. The work was mostly owned by the Listen & React team at Mynewsdesk, with Kristian Hellquist as our skilled Product Owner, besides his work as Head of Development at Mynewsdesk.

To begin with we had to shrink our database a massive amount, to avoid needing a huge and costly Postgresql setup. This was done mainly by outsourcing our analytics to Google Analytics and aggregating our mail metrics. Much of this has been done skillfully by our developer Yu Wang.

We also had to upgrade an old, legacy setup of Solr so that we could use a more modern solution. It has now been moved to Websolr, after a lot of grief, digging and hacking by Belarusian developer Alex Sergeyev.

Another part, which is not quite finished yet, is to depend less on long running background jobs and cron jobs. They have been split up into smaller jobs, less fragile when the server restarts, and have been moved to Sidekiq jobs and Sidekiq Cron.

The lack of shared filesystem has also made us think differently about things such as file uploads, sitemaps, logging and other tasks.

As a part of the preparations we also took the opportunity to upgrade to the latest and greatest Ruby version.

Issues during deploy

Another ongoing project at Mynewsdesk is to break up our monolithic Rails app into smaller backend services. This made it simple to migrate it piece by piece. But in the end we had to move the monolith.

One big milestone was to move from our legacy Solr to Websolr in the end of October. The deploy went surprisingly well and Kristian summarized it with "I've have a great feeling about this." When the traffic increased the day after nothing worked. This turned out to mainly be due to a caching issue at Websolr, and partly how we used our keys.

Then, after lots of testing and preparations we felt ready to do the move, and with a great team of developers with the right spirit joining up a Saturday morning in the middle of November we started the deploy.

Everything went on surprisingly smooth. We had developed a Read Only Mode, that made it possible to migrate the database while keeping the site up for GET requests, which the majority of our traffic consists of anyway. The only issue was a failed Memcachier configuration which rendered a weird error with HTTP 500 written inside of some assets. The total downtime was only about two minutes, which we're proud of, considering it's the largest platform change in the history of Mynewsdesk.

We wrapped up the day, everything looked fine and we summarized with - "we have a great feeling about this," even though we realized there will be issues and had the full team prepared for taking care of them.

Then Monday morning came and nothing worked. The database, while still a decent Heroku setup, was way smaller than our oversized one at Globalinn. It was now overloaded with heavy database calls. The main culprit was a little autocomplete field that took the main part of Postgresql CPU time. When the site came back up we added a series of improvements to weak points that were revealed with the downgraded database. This was done with great help by Val Milkevich from the Funnel team.

As a part of moving the servers away from Globalinn we also moved our DNS hosting. Since we already had some domains on DNSimple and since they had support for ALIAS records that we needed for our Heroku hosting, we moved our domains there. A couple of weeks after the move they got DDoS:ed in an attack that we're still recovering from. To resolve that issue we moved our DNS hosting again, this time to Cloudflare, one of the largest DNS networks in the world, that should be much more resilient to attacks.

Key takeaways

  • This was an awesome team effort with many more people involved from Mynewsdesk than mentioned in this article
  • NewRelic provides invaluable insights into what needs to be improved
  • The Heroku support worked well when needed, with quick feedback from very competent engineers
  • Bad code will be hidden by oversized servers but reveal itself when outsourcing
  • Don't jinx it by having great feelings about stuff :)

Topics

  • Web services

Categories

  • heroku
  • ruby on rails