owen

I’ve been working on some improvements to Pastoid lately. It started out as more of a response to the URL shorteners that keep popping up everywhere and getting all the press, which Pastoid languishes in obscurity.

For those that don’t know, Pastoid is a site that serves two major functions. First, it functions as a URL shortener, like the ubiquitous TinyURL. Second, it functions as a pastebin, like pastebin.com. It has a few little extra features that set it apart, and I have a lot planned for it that will break it out as something really different and special from those other tools.

I recently updated the look of the site. It has been getting mostly negative reaction. I think people don’t like the grungy purple. Maybe I’ll revise it again, but it does affect the change I wanted where the sidebar is moved to the left to allow code sections to expand in the liquid layout on the right. In addition to that, I’ve been working on the thumbnail improvement project, which is the true impetus for this post.

Pastoid uses a service to produce its thumbnails. I started out using this other service, but the one that it uses currently is much better. It takes shots of the actual URL requested, no matter how deep it is in the site, instead of just handing over generic thumbnails for sites like Google Maps. It also provides the thumbnails in multiple sizes, and even custom sizes if requested.

The service offers a caching service for the thumbnails, which is what the site is using right now. Basically, you pass in some key information in a URL querystring, and it returns a thumbnail. Unfortunately, the protection that keeps people from stealing your URLs to use elsewhere causes a few little problems that lead to inconsistent functionality.

For example, the querystring is built using an MD5 hash of a secret string and the date. If the date of my server is different from the thumbnail server (it seems to be for about 4 hours out of the day, due to timezones), then the code is invalid, and no thumbnail is returned. I have tried to compensate for this, but it’s just not reliable.

Also, this date-based hashing means you can’t cache thumbnails from day to day. This is a problem since each request of a thumbnail costs a tenth of a credit, basically charging for something that it’s already done. The price seems a little steep for something that could be cached and served once, even if the price is already pretty small.

So I’ve devised a plan.

The service also offers an API that lets me generate thumbnails behind the scenes and then store them on my own. This is a great idea! I want to couple this with some cheap Amazon S3 storage to make the whole operation really cheap and fast.

This is the convoluted project:

  1. Receive the URL to thumbnail from the user.
  2. Send the thumbnail request to the thumbnail service.
  3. Receive a ping from the thumbnail service when the thumbnail generation is complete.
  4. Request the thumbnail data XML from the service and extract the URL of the zip package of generated thumbnails.
  5. Fetch the zip file of thumbnails.
  6. Unzip each thumbnail from the zip file.
  7. Upload the individual thumbnails to S3.

Yes, it’s terribly tricky, but amazingly I’ve gotten it working with minimal fuss with the exception of item #3, which I’m only able to test on the live server, since the thumbnail service can’t ping the test site behind my home firewall. It has become one of those rare projects that is satisfying in its horrible complexity, and yet not so frustrating to implement that it’s a bother.

It should also be possible to use this system to fetch the content of the page, and then split it for two purposes. First, I’ll just throw a cache of the page to S3, where if you request it, you can get the original page contents. This will be useful if you’ve used Pastoid to bookmark someplace that succumbs to link rot.

Second, I’ll strip the tags from the source and use the content for a full-text index. This will make not just the URL searchable, but the page content, too.

On Pastoid, you can search for any string, and if it’s in the original URL, then it’ll return that link in the results. But if you used Pastoid to create a short link to an obscure URL (which is the whole point, no?), then you might want to search for page content instead of URL content. I think this will be very useful.

Also, I’m excited to have some other features in the pipe. I’ve added some login features, but I’ve delayed releasing registration because I’m thinking that I will revise the whole system to use OpenID. I’ve complained about problems with OpenID before - like the problem where if your OpenID provider goes away, you can’t recover your account - but I think that’s a minimal problem in something as ephemeral in content purpose as Pastoid.

Things are shaping up pretty well, and hopefully I’ll roll out these new features, and some I didn’t reveal here, in the next week or two.