Recruitment - overhaul required, urgently (part II)

This is the second instalment in a series. In part I, I’ve covered the importance of providing a proper job description. This part will focus on the current approach towards resumes and why it so painfully fails to serve the end goal.

The resume format

While the means of initial filtering drastically changed (from a manual review by a competent human to automatic parsing by software), the format has not changed at all and, worst still, there’s no universally adhered standard to use for the parsing/data extraction process.

Let’s break this problem down to its core components:

The format is very loose and clearly aims to accommodate human thought and analysis patterns, not those of computer algorithms.

When I say the format is “loose”, I am referring to the fact that, while there’s an abundance of (sometimes contradicting) guidelines, there’s no standard. In particular (and this is important for automated parsing), there’s no unified layout and there are several different common file formats, each with its own sub-formats.

PDF (Portable Document Format) is one such common “format”. Here’s a short blurb from Wikipedia:

PDF files may contain a variety of content besides flat text and graphics including logical structuring elements, interactive elements such as annotations and form-fields, layers, rich media (including video content), three-dimensional objects using U3D or PRC, and various other data formats.

My own resume is a PDF generated using Latex, from a template I found in this Git repo. I am also a co-maintainer on mdtopdf - a project that produces PDFs from Markdown.

PDFs are easy to produce and look good to the human eye and, as Wikipedia puts it, PDF is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software”. The “independent of application software” bit is rather important because it means I do not have to use MS Word (by the way, there are several, very different, MS Word formats, since not all versions use the same format by default) or, more relevant to my case (since I use Linux exclusively on all my personal equipment), LibreOffice (in the hope that it will look okay on whatever MS Word version the recruiter opens it with).

Let’s focus on this bit from the last paragraph: look good to the human eye.

Allow me to regale you with a short quote I encountered in The FORTRAN colouring book by Dr. Roger Emanuel Kaufman:

Before you try to solve a problem using a computer, it’s only fair that I tell you a few things about what computers are like.

Once knew a fellow who called up a computer dating service, Whey tried to fix him up with an IBM machine. Needless to say, it didn’t work out, He was witty, handsome, a dashing bon vivant raconteur - terribly debonair. The computer was dumb, unimaginative, totally lacking in creativity. He spoke fluent Francais - the machine only understood FORTRAN. The poor computer couldn’t follow his exuberant dialogue - it could comprehend only the simplest of grammatical constructions, of course, it didn’t work out. For one thing, the computer was a lousy cha cha dancer. For another thing, there was a difference of religion. The computer thought it was God and he was upset about how they’d raise the children.

To be fair, we should recognize that computers have many good points as well, true, they are basically stupid. However, they have Great memories, they are incredibly fast, they are terrific typists, and they make good money, last but not least, Zey vill follow inztrukshuns to zee letter!

This book was published in 1978. We’ve made loads of technological advancements since but the general message remains very true today (and those who go “I built software with chatGPT without learning how to code” or “Soon developers will be made obsolete” will be well advised to read it).

The takeaway is simple: machines and humans are very different in how they interpret/process data. One salient point to understand is this: machines do not think, they compute, which is why they are called computers, not “thinkers”, it’s all in the name.

Parsing a PDF (or DOCX or other common formats, for that matter) resume, intended for the human mind is quite a complex task for a computer. Before we go on, let’s define parsing in this context:

In computer science, parsing is the process used to analyze and interpret the syntax of a text or program to extract relevant information.

Humans can do this easily and intuitively. Machines can be trained to parse resumes written with humans in mind but it takes an effort and is very error-prone. As I reckon many non-programmers are aware, machines ultimately deal with binary sequences (0s and 1s - bit on, bit off - if it makes you think of Karate Kid, we think alike:)). They can be taught to transform/represent a large variety of data pieces into/in these sequences but some formats make it far easier than others.

When writing software, one will never opt to store/pass data in the form of a PDF. Over the years, we’ve come up with an abundance of formats that are a compromise between what humans and machines “understand” best (in quotes because machines don’t really understand anything, they compute and process).

If you’re interested, you can look up a few common ones: XML, INI, JSON, YAML; there are, of course, many more.

Let’s use JSON for illustration purposes. It’s a reasonable choice as it’s one of the most popular ones, certainly when it comes to APIs (Application Programming Interface). Similar to HTTP, even if you’ve no idea what the term means, I assure you that you make use of APIs on a daily basis. Educating people on these concepts is a laudable aspiration but as it’s not the purpose of this particular series, I’d suggest looking that term up, rather than cover it in this text. JSON handles nested hierarchies well and we need that to represent a person’s resume.

If I wanted to format the data in my resume as a JSON, it would look something like this:

{
  "name": "Jesse Portnoy",
  "email": "jesse@packman.io",
  "github_profile": "https://github.com/jessp01",
  "gitlab_profile": "https://gitlab.com/packman.io",
  "personal_site": "http://packman.io",
  "linkedin_profile": "https://www.linkedin.com/in/jesse-portnoy-4921752",
  "title": "Multidisciplinary Programmer, Builder & Packager, Automation Engineer",
  "skills": {
    "programming": [
      "C",
      "PHP",
      "Perl",
      "JavaScript/NodeJS",
      "Dart/Flutter",
      "BASH",
      "Go",
      "SQL",
      "Ruby",
      "Python",
      "C#",
      "Java"
    ],
    "devops": [
      "LAMP",
      "Nginx",
      "AWS",
      "Google Cloud Platform",
      "Docker",
      "Vagrant",
      "Travis CI",
      "GH Actions",
      "Ansible",
      "Chef",
      "Nagios",
      "Prometheus"
    ]
  },
  "experience": [
    {
      "company": "Sirius Open Source Inc",
      "start_time": "2023-09-18T09:00:00Z",
      "end_time": "2024-10-01T17:00:00Z",
      "title": "Senior Developer",
      "description": "Things I did go here"
    },
    {
      "company": "Kaltura",
      "start_time": "2012-02-01T09:00:00Z",
      "end_time": "2023-03-31T17:00:00Z",
      "title": "Senior Developer",
      "description": "Things I did go here"
    }
  ]
}

The above is of course, truncated but you get the point. See what I mean by a compromise between what humans and machines “understand”? Most humans would definitely rather read the PDF but they could still read the data when formatted this way and it would make sense to them; computers have no preferences (as much as I love them, I make an effort not to humanise computers, I think it’s important) but parsing this JSON will be faster and less error-prone for them (though, not as fast as a binary sequence).

At this point, you may be thinking:

Okay, I got it, machines and humans process things differently and some formats are better suited for one rather than the other but surely, you don’t expect Salesman Smith and Solicitor Jones to manually format their resume this way to aid computers! I mean, they’re meant to be aiding us!

To which I say: NO, I do not expect any human to manually format their resume as a JSON. In fact, I don’t want to do it either; while I can do that (and it’s not that hard), there’s no reason for anyone (programmer or not) to struggle with finding a missing ,, : or [] in the above JSON. Instead, we should:

Agree on a standard (any standard, even a very flawed one, is better than no standard at all and standards evolve, to address needs)
Write software that takes input in the manner humans are used to and automatically transforms it into such a JSON (or YAML, or whatever, really - something a computer does not have to work hard to parse - and remember, it’s not just about how hard it works, it’s also about how likely it is to make a mistake)

Trust me, it’s very easy. Recruitment software is largely a proprietary/close source industry (which is part of the problem, actually) but I was able to find a project called open-resume. open-resume is FOSS (licensed under AGPLv3, written mostly in TypeScript) and it provides two main functionalities:

Parsing existing PDFs (https://www.open-resume.com/resume-parser)
Producing PDFs from inputs provided in an HTML form (https://www.open-resume.com/resume-builder)

The latter functionality can easily be extended to produce, alongside the PDF, a JSON similar to the one in my example above.

If we all agreed on a standard, people could easily generate and submit both files when applying for a job, making it much easier for all involved.

The format aside, there’s another important point to consider: if you consult any resume “expert” (and there are plenty of those) or perform an online scan of yours, they will all tell you that a resume must be two pages max (and some will even say it must be under 800 words long). This is very restrictive, especially when it comes to roles where so much emphasis is put on one’s technical experience and level of expertise with given technologies (programming languages, frameworks, platforms, etc) and considering the overwhelming abundance of these. Going back to the quote from The FORTRAN Colouring book

we should recognise that computers have many good points as well. True, they are basically stupid - however, they have great memories, they are incredibly fast

To a human, each additional page translates to several minutes of processing time (hence why this limitation was introduced in the first place) but to a machine, it’s utterly negligible, especially with the resources we have at our disposal today (the computing power on your phone alone is far greater than what the machines that send people up to space in 1969 had). It reminds me of a story I heard, about a bureaucratic department where a paper form had to be filled with blue ink. When someone finally questioned it, they found that this regulation was established because the machine used for scanning said form had difficulties with other ink colours. That machine was made obsolete decades prior and the new processing method did not have these limitations. Something to think about.

This feels like a good stopping point.

In this instalment, I attempted to explain why the way we represent resume data, as well as the restrictions we enforce on said data, need to change to better leverage the fact that we, for years now, have machines that can help us perform the initial resume filtering and analysis. In the next instalment, we’ll cover the methods (and people) we currently use to evaluate resumes and why these must also be changed (drastically). I’ll share my experience when querying LLMs about my own resume (as I suspect many recruiters do) and demonstrate how misleading the results can be (hint: very!).

As noted in part I, I welcome comments and discourse and have started formulating a plan to address these issues. I firmly believe it’s more than feasible and, from a technical standpoint (the human aspects are a different story), not overly complex. If you care, please take part and voice your opinions.