Export Medium posts as Markdown

First of all, why? Well, In my case, I write on Medium with the, albeit unlikely, hope that one day, I may be popular enough to generate a modest revenue through its partner programme (while we’re on that subject, if you like my posts, it would be nice if you followed me but don’t feel obligated). However, I also have my own site — packman.io, which is based on Jekyll and also has a blog section.

For those that don’t know Jekyll, it is a static site generator written in Ruby and distributed under the open source MIT license. If you need a portfolio/blog/documentation website, I strongly recommend you give it a go. I intend to write a post about how I make use of it to generate my own site soon but for now, suffice it to say that if a user landed on https://packman.io, I don’t want to direct them away from it by sending them to read my posts on Medium. Besides, my site supports both light and dark mode, which I think is very important because white backgrounds really hurt my eyes (by the way — if you’re like me, I’d also recommend Dark Reader, for all those inconsiderate sites that do not support dark mode natively).

Jekyll takes Markdown (MD) files as input and, using a templating mechanism, produces HTML files out of them. And so, I’ve written the below small script to fetch my Medium content and convert it to MD files Jekyll can do its magic on and, without further ado, here it is, with the hope that it will be of use to you as well:

require 'feedjira'
require 'httparty'
require 'nokogiri'
require 'reverse_markdown'
require 'fileutils'

if ARGV.length < 2
	puts "Usage: " + __FILE__ + " <medium user without the '@'> </path/to/output>"
	exit 1
end

medium_user = ARGV[0]
output_dir = ARGV[1]

FileUtils.mkdir_p(output_dir)

xml = HTTParty.get("https://medium.com/feed/@#{medium_user}").body
feed = Feedjira.parse(xml)

feed.entries.each do |e|
    # normalise `title` to arrive at a reasonable filename
    published_date = e.published.strftime("%Y-%m-%d")
    filename = output_dir + '/' + published_date + '-' + e.title.gsub(/[^0-9a-z\s]/i, '').gsub(/\s+/,'-') + '.md'
    if File.exists?(filename)
	puts "#{filename} already exists. Skipping.."
	next
    end
    
    content = e.content
    parseHTML = Nokogiri::HTML(content)
    img = parseHTML.xpath("//img")[0]['src'].sub!(/http(s)?:/,'')
    
    # Medium feed includes the hero image in the `content` field. Since Jekyll and other systems will probably render the hero image separately, remove it from the HTML before generating the Markdown
    content.sub!(/<figure><img\salt="([\w\.\-])?"\ssrc="https:\/\/cdn-images-1.medium.com\/max\/[0-9]+\/[0-9]\*[0-9a-zA-Z._-]+"\s\/>(\<figcaption\>.*\<\/figcaption\>)?<\/figure>/, '')
    
    result = ReverseMarkdown.convert(content).gsub(/\
/,"
")
    meta = <<-META
---
layout: post
author: #{e.author}
title: #{e.title}
date: #{e.published}
background: #{img}
---
    
    META
    
    File.write(filename, meta + result)
end

If you want to download it rather than copy and paste, it’s available from GitLab as well.

Invoke it like so:

./medium_to_md.rb <medium user without the '@'> </path/to/output>

It will generate a clean markdown file that includes the metadata (front matter in Jekyll terminology) from the original Medium post; i.e:

---
layout: post
author: Jesse Portnoy
title: Capture your users attention with style
date: 2023-04-23 20:23:44 UTC
background: //cdn-images-1.medium.com/max/1024/1*TlDFO_bhcRPJDMxEceyeyw.png
---

May the source be with you,