by Secrets of the Dark
Those of you who’ve used the Tor network probably know that it can be very hard to navigate at times, even when using the different pages that share links. In fact, I too, can relate to this – the first time I used it, I just relied on some of the link lists, which turned out to be semi-disastrous.
It does, of course, have its search engines, including not Evil, Ahmia, Grams, Sinbad, and the search engine in question – Candle, which can be accessed at Candle Search Engine.(Once again, don’t forget to access it through Tor.)
Candle’s memorable motto is “no parentheses, no boolean operators, no quotes, just words.” I recently interviewed its creator, who goes by the name “Jobi.” If you’re unfamiliar with how search engines work in general, read on, and you’ll gain some insight!
In his words, he chose the name “Candle” because it:
- “has the right amount of letters
- Ends with ‘le’
- Refers to a thing that brings light in darkness…
- …but not a lot.” Reddit: Candle (a search engine)
This is how I picture Candle – I’m visual that way.
When we spoke initially on Reddit, I had asked Jobi why he wrote Candle. He said, “I wrote Candle because it was a challenge. To see if I could do it and how it would turn out. It was not designed to be a ‘dark net search engine’, just a search engine. It could index anything. I chose to index the Tor web for a couple of reasons. Mostly because it is nice and small.
“Candle runs on a Macbook. I don’t have fiber connected server farms. For me, indexing the real web would be like sucking down an ocean through a garden hose; indexing the Tor web is like sucking down a bathtub through a straw. Neither are ideal but the latter is not impossible. Also, the Tor web isn’t that well indexed, so it would be more useful.”
If you happen to be on the Tor network and feel lost, I’d recommend trying out Candle; anyhow, on to the meat of the interview!
Secrets of the Dark: What is your background with regard to coding and web development? (i.e. Do you have formal schooling in programming?)
Jobi: Yes. I studied computer science, and have been coding professionally for almost 20 years.
SOTD: What have been your experiences with running a Tor node? Have you experienced any harassment or difficulties in the process?
J: No. It just runs by itself. I have never talked to my ISP about it and they have never contacted me. Some web sites block me, but none that are important to me. My relay is not an exit. It is just a small relay on a low power machine, a single core 16Ghz Atom.
SOTD: Prior to creating Candle, what are some software projects you have worked on?
SOTD: You said that you ‘wrote Candle because it was a challenge.’ Do you think that the result you came up with was a successful answer to that challenge?
J: I came across a bunch of issues that I didn’t know before I started. Mostly things that are a bit fuzzy, that you can not just calculate.
It took a lot of tweaking and tuning in order to prevent lots of rubbish in the index, without filtering out good data. Wikis and forums have lots of links that are just not worth crawling. [My sentiments exactly! – Ed.]
I am very conservative about what I consider a ‘word’: Anything under 3 letters is not a word. Anything with a non-letter in it is not a word. Anything with more than 3x the same letter in a row is not a word. Etc…
In the end I’m quite happy with the quality of the index.
SOTD: I’ve noticed that Candle only returns the top 20 search results (as opposed to all of them). Why did you design it this way?
J: It is part of keeping it lightweight. It also prevents Candle from becoming a tool for others to just suck down the entire index.
Having a ‘next page’ button would mean I’d either have to redo the query, or cache results in ‘sessions’.
SOTD: What kind of work do you do professionally? Is it related to software development, or is that a hobby?
J: I’m a software developer. My day to day work happens in C and C++.
SOTD: Even though a developer, like a magician, might ‘never reveal his secrets,’ would you be willing to give a basic explanation of how the Candle search engine is different from other popular search engines?
J: I don’t believe that Candle is ‘more special’ than others. It is different because I didn’t use any standard framework and came up with my own solutions for things like filtering and ranking.
Also, there is nothing secret about it. I just can not open source it because it uses proprietary libraries from work.
SOTD: Would you be willing to talk about yourself a little (like your educational background)?
J: As I said in question #1, I have studied computer science.
But before that I already coded. As a kid, I got an 8 bit micro. It came with a thick manual and I was curious enough to teach myself how to program it. First in BASIC, then in assembler. This was before the Internet was a thing. Later, I got (access to) a PC and started learning Pascal and C.
SOTD: Did you work with others on this project, or was Candle designed solely by you?
J: I did it solely by myself. At first I never even told anyone it was running. At some point [it] was discovered and the number of hits slowly started to ramp up.
SOTD: Have you ever used other anonymity networks besides Tor (like I2P, Freenet, or GNUnet)? If so, what has been your experience with them? (Has it been positive, negative, or something in between?)
J: I have not. I don’t use Tor that much either, but when I do, it works well enough and I don’t have problems.
SOTD: Is there any kind of content that you try to exclude from Candle search results (such as child pornography)?
J: No. That would be a very slippery slope. Once I start filtering out one thing, I implicitly start condoning everything else.
SOTD: What sorts of changes might you make to Candle’s search algorithms so that it could improve (if any)?
J: The crawling is as good as it gets.
The search result ranking is basically good, but I do still tweak it a little bit from time to time. I do not have a very satisfactory strategy to determine the order in which I visit pages. I have way more URLs than I can visit in a reasonable time, but some URLs deserve to be on a higher rotation than others.
I might add [an] ‘onion history’ feature, where it shows when an onion was up/down, when the home page title changed, things like that. I already keep track of some of that, and I would have to look into how clean and useful that data is.
SOTD: Have people in the Reddit community given you good feedback about Candle, or about Tor in general?
J: I have had a bit of good constructive feedback, but most of it was just ‘hey that looks nice’. Nobody was negative about it, i.e. ‘You suck for making this’.
SOTD: What advice might you give to someone who says, ‘I’d like to develop my own search engine – where should I start?’
J: You can always start with a crawler: read a page with links, parse it, extract the links, add those URLs to your list.
Have it crawl for a few hours, then look at your dataset and see what’s in there that shouldn’t [be].
Come up with filtering rules for those and then restart clean. Repeat this until you are happy with the dataset.
You should also determine your feature set early on. For example, in Candle you can only search for individual words, not phrases.
For certain features it might be necessary to keep copies of the content you index. I decided I didn’t want that.
SOTD: You had told me that ‘With Candle, I try to deliver diverse results. It won’t return multiple results from the same onion, or from the same ‘identical/very similar’ onion.” Would it be possible to explain a little about how this is done?
J: When you enter some words, I look up all the URLs that have those words in it. This might contain multiple URLs from the same onion domain. If so, I only keep the ‘best’ one. It also might contain URLs from onions that are mirrors/copies/clones of each other. This is harder to determine.
Since I don’t keep copies of content, I have to base ‘identicality’ on stats and metadata like title, size, number of words, links, etc. (Have you noticed the ‘onion:…’-link underneath each result?)
Which one is the best is based on how often the words occur, how strong those words are, how many words the page has, etc.
SOTD: What projects are you currently developing, or do you plan to develop, if given the time?
J: I got an Arduino for Christmas, so currently my evening hours are devoted to making LEDs flash.
Writing Candle was really just an exercise for myself. I am still surprised about the amount of use it gets every day.
(Well Jobi, I’m glad you created it – and I’m sure millions of other Tor users are too!)