PMS - PitMan Search

project pitman shorthand python algo



tl;dr I made a pitman shorthand dictionary searcher, and the code is on gitlab.

So I was trying to learn pitman shorthand, and was searching around for a online dictionary where I can easily search for the corresponding shorthand symbol for the words I need.

I found a not-too-bad pitman shorthand convertor online. The shorthand convertor is online, so it is pretty convinient, but there are usage limits, when I was trying to find words one-by-one, I hit the limit pretty quickly.

pre-internet version: shorthand dictionaries

So I thought about it, how did they do it in the past, and can we do the same thing here? And the pre-Internet version of this will be shorthand dictionaries.

Shorthand dictionaries is the classical way people learn shorthand, so you are sure that there is going to be sufficient information for us to work with, but the problem here lies in the search.

I found a copy of the dictionary online, but I found the search really cumblesome. First the search is rather slow (the book is huge, you cannot blame it). And second, you don’t get the direct results. So I thought to myself, can I do better?

yes i can do better

Since they have the words indexed, I can crawl through and do a preprocessing to limit the pages that I need to search for, and that would be a really huge improvement on the usability.

So what I did was went through the pdf files, and looks at the words extracted from the ORM. Well, they are not perfect, but in this noisy data, can we do something better to index the words, and really pick a word the represents the page?

And of course, we can. So in preprocess.py, you can see that we basically do some sort of guessing at what kind of page this is likely to be.

testleng, testsize = 0, 3
while words:
    chars = {}
    testleng += 1

    for w in words:
        chars[w[:testleng]] = chars.get(w[:testleng], 0)+1
    likely = sorted([(-1*chars[c], c) for c in chars])

    if len([w for w in words[:testsize] 
        if w.startswith(likely[0][1])]) == testsize: break

    words = [w for w in words if w[:testleng]==likely[0][1]]
    if len(words) < testleng: words = []

So from the snipplet above we make intelligence guesses based on the occurance of the words starting with a certain prefix. And since it is a dictionary that is a pretty smart guess.

Anyways, this will give us a list of all the “approximated” words for different pages, and now when you search for a whole, all we have to do is to go down the list and get you an estimate of which page the word can be found on.

what i have in the end

So the idea works like this. You search up the word, this system guesses the page for you, and you can visit it in your favourite pdf viewer, and that is all to that, simple.

USAGE: python search.py <searchword> and it will return you the likely regions where your word can be found.

If you want to set it up and play with it, the code is available on github.

what’s next

Well, there are many pdf viewer plugins written in javascript. So one possible idea is to embed this list onto some form of webapp so that, you can search the word, and it will flip directly to the particular page for you.

If I have the time, I might work on it, but for now, that is all. Have a good day, and hope this helps.

update: I managed to convert it to a webapp