13 August 2008 - 22:39django-xappy: Searching with Xapian

Yes, yet another search app. However, unlike other projects, say this year’s GSoC project, it doesn’t try to be generic. Rather, it is specific to Xapian, making it possible to make full use of all the advanced features the excellent Xappy library offers. Xappy is a high-level interface to Xapian (as opposed to the tedious-to-use lowlevel Python bindings), and supports some pretty nice stuff like facets, tags or ranges.

Yes, I also know about djapian, but… I really wanted to use Xappy!

So, here’s how it works. First, define your index (note that since I really built this for my own use where indexes usually span multiple models, it might not be as simple as it could be if you want to search through only one - at least, right now).

from django_xappy import Index, action, FieldActions

class MyIndex(Index):
    location = '/var/www/mysite/search-index'

    class Data:
        @action(FieldActions.INDEX_FREETEXT, spell=True, language="en")
        @action(FieldActions.STORE_CONTENT)
        def name(self):
            if self == auth.models.User:
                return self.content_object.username  

        @action(FieldActions.SORTABLE, type="date")
        def date(self):
            if self == app.models.Book:
                return self.content_object.released_at
            elif self == user:
                return self.content_object.date_joined

MyIndex.register(Book)
MyIndex.register(auth.models.User)

This says where the index lives, what fields it has, what Xappy actions to use for each, and from where to get the data for those fields (note that not all models have to provide data for each field).

Now, the way django-xappy works is, it logs all changes in a database table instead of updating the index directly. This means that your index won’t always be up-to-date, but also, that the rest of your site’s functionality will never be affected by troubles with your search engine. Instead, you regularly apply the changes to your index (e.g. using a cronjob). The easiest way to do that is using the management command.

Let’s create the index for the first time:

PS G:\...\trunk\examples\simple> .\manage.py index --full-rebuild
Creating a new index in "index-1218575904"...
Indexing 11 objects of type "Book"...
Indexing 2 objects of type "User"...
Switching "index-1218575904" to live index...
Done.

Then, after doing some changes, say in the admin:

PS G:\...\trunk\examples\simple> .\manage.py index --update
Updating 1 index with 18 changes...
Done.

Good. Now that we have an up to date index, let’s search:

results = search("searchterm", page=1, num_per_page=10)

Pass results to your template:

    {% if results %}
        {% for result in results %}
            {{ result.content_object }}
        {% endif %}
    {% endif %}

Done! For more information, see the readme - and don’t forget to check out the Xappy docs as well.

As usual, the code is available on Launchpad - and now also PyPi. Via bzr:

bzr branch lp:django-xappy

For questions and support, use the django-apps Google Group and prefix your message with (xappy).

No Comments | Tags: Django

31 July 2008 - 20:41php-languid: statistical language guesser

In the not so good old time, when I hadn’t yet seen the green grass in the land of Python, I was building websites in PHP. And one of them needed a way to identify (i.e. guess) the language of arbitrary text.

One of the most well known open source tools for this seems to be Maciej Ceglowski’s languid, written in Perl. Three years ago, I had this ported to PHP through a RAC project (meaning: While I do own the copyright, I have not written the code). And today, while cleaning up my repositories, I stumbled over it again and decided I might just was well put it our there.

So here it is:

https://launchpad.net/php-languid

I also briefly considered porting it to Python, but fortunately someone else has already done that:

http://code.google.com/p/guess-language/

2 Comments | Tags: Uncategorized

18 July 2008 - 10:43FeedPlatform

Looking at my sort-of-todo list, I see at least 4 projects that either need or would greatly benefit from a feed aggregator-like functionality, e.g. not just simply parsing, but updating a list of feeds and keeping track of the items.

So it’s seems clear that the right thing to do is to implement this only once, preferably as some sort of generic library, and then reuse it. Unfortunately, the requirements are quite different: One project needs to track enclosures (which by the way also changes how item guids can potentially be identified). Sometimes notifications need to be sent out. Content has to be analysed in different ways. Two of the apps potentially handle large lists of feeds that require prioritized parsing and sophisticated error handling - in the other cases the list is small, and it wouldn’t be worth to bother. Other issues evolve around how and if to handle cover images, how to handle redirects, whether to ignore entries under certain conditions, and even what data to collect.

You get the point: It’s not that easy to all of this under one hood, which is also the reason I’ve been putting it off for a while now. I think I finally came up with a solution that I find satisfying though.

The whole thing centers around a Django settings.py-like config file. Now per default, two tables would be used, for feeds and items, and save for primary and foreign keys, each table would only need one column: The feed table the feed url, the item table the item guid.

Then, in said config file, you would specify a list of addins that each provide a particular, isolated piece of functionality. Addins can depend on each other, and it would be easy to write your own.

A configuration file might look like this:

# my-feedbot-config.py

USER_AGENT = 'MyFeedBot/%s (+url)' % get_version()

ADDINS = [
    # builtin
    collect('title', 'description', 'author'),
    enclosures(require=True),
    save_bandwith(), # needs columns for storing of etag etc. in db
    custom_item_filter(handler_func),

    # custom
    check_for_claimcode(),
]

Now, given the addins specified above, there’d probably be a new enclosure table, new columns for storing the meta data and http header info in the item table, and the parsing process would call out to your code to handle filtering and claimcodes.

I’ve already checked a decent amount of code into a bazaar branch, but it’s far from finished (or usable). I’ll hopefully have enough time to work on this during the next few weeks and plan to post some updates as I go along (btw, this is not depending on Django, for once).

No Comments | Tags: Python

9 July 2008 - 14:53django-tables: A QuerySet renderer

While trying to add a simple sorting feature to the critify game listing, I went off on a strange train of thought involving a bunch of future functionality I only have a very vague picture of, and decided that it would be best to choose the most complex approach possible and create a separate, overarchitected abstraction layer for that very purpose.

Well, something like that, though hopefully not quite as bad. Basically, the idea is:

  • to making rendering tabular data a little easier by encapsulating some repetitive parts.
  • make working with tabular data a lot easier when it comes to the user interacting with the table.

You would define a table, which is sort of a cross between a model and a form, define it’s columns (i.e. fields), and then tell it how you want to sort, filter and group the data.

Some snippets of actual code should probably explain it best.

First, let’s define a table:

import django_tables as tables

class BookTable(tables.ModelTable):
    id = tables.Column(sortable=False, visible=False)
    book_name = tables.Column(name='title')
    author = tables.Column(data='author__name')
    class Meta:
        model = Book

The table is based on the Book model; thus, it will have a column for each model field, and in addition the locally defined columns will override the default ones, or add to them, very much like newforms works (you’ll find that even the internals are at times very similar to the newforms code).

So, now that we have defined the table, let’s create an instance:

initial_queryset = Book.objects.all()
books = BookTable(
         initial_queryset,
         sort=request.GET.get('sort', 'title'))

We tell the table to operate on the full book data set, and to order it by whatever the user sends along via the query string, or fall back to the default sort order based on the title (book_name) column.

Finally, you would send the table to the template:

return render_to_response('table.html', {'table': books})

Where it is easy to print it out:

<table>

<tr>
  {% for column in table.columns %}
  <th>
    {% if column.sortable %}
      <a href="?sort={{ column.name_toggled }}">
        {{ column }}
      </a>
      {% if column.is_ordered_reverse %}
        <img src="up.png" />
      {% else %}
        <img src="down.png" />
      {% endif %}
    {% else %}
      {{ column }}
    {% endif %}
  </th>
  {% endfor %}
</tr>

{% for row in table.rows %}
  <tr>
  {% for value in row %}
    <td>{{ value }}<td>
  {% endfor %}
  </tr>
{% endfor %}

</table>

The above template code generically renders any table you give it, restricted to it’s visible columns, and allows each column to be sorted in ascending or descending order (so long sorting is not disabled for a column). It gives you those nice arrow icons too.

At this point you will probably wonder if it’s not a lot simpler to just say:

Books.objects.all().order_by(request.GET.get('sort'))

And you’d be right, it is, and I indicated as much at the beginning of the post. There are however a couple of nice things about the table abstraction:

  • While Django’s ORM already protects you against SQL injections, you still might like to play it safe and limit the possible values of the sort parameter (which will also ensure users won’t be able to guess your database schema by trying different values). Using django-tables, this is built in.
  • It easily allows you to expose your fields under a different name, e.g. date instead of published_at. True, it’s just cosmetics, but personally I am (probably unhealthily) particular about stuff like that.
  • Both previous points are especially relevant when you want to order via a relationship, e.g. author__birthdate. The double underscore doesn’t look all that nice, gives a rather clear insight into your database layout and also exposes that you are using Django, which may be undesirable.
  • It easily allows you to move control over what fields to expose from your templates into python. Your templates will need to deal to a lesser extend with what to render, but rather on how to render it.
  • Boilerplate code required to let the user sort the table, particularly when it comes to allowing toggling between descending/ascending, or even multiple sort fields, is reduced.

Additionally, it might be worth noting that there is a non-model implementation as well (use Table instead of ModelTable), that bascially does the same thing with static python data, e.g. a list of dicts.

To be perfectly honest, at this point I have no idea myself if much of this makes any sense or how useful it actually is. The fact that I haven’t really thought that much beyond the (already implemented) sorting functionality is not helping either (i.e. grouping, filtering…). I am going to toy with it for a bit though, and you are invited to do so too.

The project is maintained in bazaar and can be retrieved via Launchpad:

bzr branch lp:django-tables

There is also a source code browser, where you’ll also find the readme with a lot more information.

For questions/discussions/support use the django-apps Google Group and prefix your message with (tables).

2 Comments | Tags: Django

13 June 2008 - 0:37django-assets

There are already a couple asset management addons for Django out there, and so I feel somewhat bad for coming up with another one.

In my defense, non of the other approaches appealed to me: django-yslow apparently simply searches recursively through the project directory, finding media files and merging them into one single package. django-assetpacker has you define your assets in the database, which is a definite no-go for me. The database is installation-specific data, my media file configuration I consider to be application/project data (although I realize that that could be argued). Finally, django-compress uses the settings file for setup, which, while better, I’m still not all that happy with either.

So what then, you ask, do I want? See, right now, the only thing that references my JS and CSS files are my templates (only a limited number of base templates in fact). I’d like it to stay this way, i.e. assets should be defined and created solely through templates. So for example:

{% load assets %}
{% assets filter="jsmin" output="packed.js" "jquery.js" "common/file1.js" %}
    <script type="text/javascript" src="{{ ASSET_URL }}"></script>
{% endassets %}

Which would merge jquery.js and file1.js, apply jsmin and store the result as packed.js. Since the tag simply renders it’s contents using the right ASSET_URL, you are free to reference your media files any way you want to, e.g. specify the correct media-type for CSS files.

Changes to the source files can automatically be detected (only by timestamp, right now), and the assets recreated. As an alternative, there is a management command that can be used to update manually:

./manage.py assets rebuild --parse-templates

As you can see from the command line, and this is the approach’s downside, since there is now no central repository of available assets, the command needs to parse your templates and look for usage of the assets tag. I find that a bearable compromise, however.

I’ve implemented a couple of the more popular minifiers and compressors, including YUI, jsmin, jspacker and cssutils. GZip is supported as well, and you can apply multiple filters:

{% assets filter="jsmin,gzip" ... %}

Another notworthy feature might be the ability to rewrite relative urls in CSS files. If your output file is in a different location than your CSS source files, relative url() references will break. The cssrewrite filter will fix this:

{% assets filter="cssrewrite,yui_css" output="cache/default.css" common/base.css extjs/css/extjs-all.css %}

I’ve put the whole thing up as a Google Code project. I’ve also created a Google Group and since I’m planning on releasing a couple other things in the future, generically called it django-apps - I don’t see the point of having a dozen empty groups, not to mention the inconvenience of jumping between them all. Feel free to use the group for your own projects - I suggest messages should be prefixed with (appname), e.g. (assets).

No Comments | Tags: Django

6 June 2008 - 13:30Django: Finding the current project’s path

Problem: You need the filesystem location of the current project, but your code doesn’t know which project is is used by (think 3rd party app).

Answer:  Define “project location” as “location of settings file”, and you can do:

from django.conf import settings
os.path.dirname(os.path.normpath(os.sys.modules[settings.SETTINGS_MODULE].__file__))

1 Comment | Tags: Django

26 May 2008 - 12:23DiggPaginator update

Here’s an update (link goes to DjangoSnippets) to the paginator code I posted a while back (link goes to the old version). The main thing is the deprecation warning, which is now gone.

In particular:

  • Previously, a custom base class was used that implemented a stateful, page-aware paginator. Now that Django’s new Paginator class does essentially the same thing (only better), this base class is gone for good.
  • Note though that Django’s version uses different attribute names to provide the data, so you’ll likely have to update your code and templates (e.g. start_index instead of first_on_page, has_other_pages instead of is_paginated, and more).
  • The custom page ranges that DiggPaginator adds are page-specific and thus attributes of the Page object. Don’t confuse this with the page_range attribute on the paginator instance itself, which is the full range, as provided by the Django implementation.
  • Note the new QuerySetDiggPaginator class - since Django splits that functionality, so do we.
  • Following the example of Django’s new paginator, InvalidPage exceptions are no longer automatically converted to 404s - do that yourself.
  • There are two cool new options, softlimit and align_left, which are useful when paginating data with possibly uncertain length (like a search result). See the comments for details.

Usage example:

objects = MyModel.objects.all()
paginator = DiggPaginator(objects, 10, body=6, padding=2)
return render_to_response('template.html', {'page': paginator.page(7)}
{% if page.has_next %}{# pagelink page.next_page_number #}{% endif %}
{% for num in page.page_range %}
   {% if not num %} ...  {# literally dots #}
   {% else %}{# pagelink num #}
   {% endif %}
{% endfor %}

4 Comments | Tags: Django

25 May 2008 - 20:27Django: Post-QSRF Aggregation

I’m in the process of upgrading critify to the current SVN trunk, with QSRF now included. So far, everything is going smoothly, and I believe I am about done.

At one point in the project I need to let the database do a SUM()/COUNT() calculation. Not wanting to write the complete SQL manually, I had previously used a patched Django version that allowed me to retrieve the query generated by a QuerySet before it’s execution, replace the select clause, then send it through connection.execute().

Fortunately, with the new and improved QuerySet implementation, that’s a thing of the past now (and not only because you can now access the generated SQL via QuerySet.query.as_sql()!). While official aggregation support is still being worked on, the groundwork seems already in place:

qs = Review.objects.filter(**random_conditions)

# insert a custom select clause
class MyCustomSelect:
    def as_sql(self, quote_func):
        return "SUM(site_review.score)/COUNT(site_review.id)"
qs.query.select = [MyCustomSelect()]

# make sure it's the only thing we select
qs.query.related_select_cols = []
qs.query.select_related = False      # or related_select_cols is regenerated
qs.query.extra_select = {}

from django.db.models.sql.constants import SINGLE
return qs.query.execute_sql(SINGLE)[0]

As they say, Voila! :)

As a footnote, contrary to what I had heard upfront, I found the qsrf code remarkably easy to understand. Excellent job, Malcolm!

1 Comment | Tags: Django

20 May 2008 - 23:26Multiline items in an Ext JS combobox

This is spectacularly simple:

.x-combo-list-item { white-space: normal; }

Multiline items in ExtJs combos

No Comments | Tags: Other

30 April 2008 - 17:43Hijacking a meme for the sole glorification of PowerShell

Yesterday I saw this post on James Bennet’s blog -basically, people are posting the top 10 or so commands from their shell history. Now, using Windows on the desktop myself, I have nothing much to contribute there (I do manage a couple debian servers, but the data is probably too distributed to be useful in this context).

However, I recently decided that I would renew my interest in PowerShell (I was in the beta program early on - the pillars of Longhorn timeframe - but never did much with it). And porting this seemed like a nice opportunity to get started.

First of all, it might be worth noting that PowerShell, at least the default host, only provides a session-based history feature - close and it’s gone. Apparently it is easily possible to add persistence yourself, using a combination of the add-history cmdlet, a startup script, and aliasing exit, but I haven’t tried that (yet).

I know the suspense must be killing you, so here’s what I came up with:

PS> history | group commandline | sort count -desc | select -f 10

Or the more verbose version:

PS> get-history | group-object -property commandline | sort -property count -descending | select -first 10

That was actually pretty simple, and should be quite easy to read. Compare it to the original:

$ history | awk '{print $2}' | sort | uniq -c | sort -rn | head

Now, I like the object-based approach a lot, but you have to admit: The *nix version isn’t half bad, considering it’s text-based.

Let’s take it one step further. PowerShell conveniently stores the start and end timestamps for each history item, so instead of by count, wouldn’t it be cool to sort the list by total runtime? Unfortunately, this is considerably more difficult. The following took me about an hour (but note that I am basically new to PowerShell, and needed to figure out a lot about commands, syntax etc. along the way):

PS> history | select commandline,@{name="duration";expression={($_.endExecutionTime - $_.startExecutionTime).totalMilliseconds}} | group commandline | foreach { $_ | add-member NoteProperty Time (($_.group | measure-object -p duration -sum).sum/1000); $_ } | sort time --desc | select name,time -f 10

It’s a bit long, but - in retrospect, considering the time I spent creating it - should be moderately simple to follow as well: Calculate the runtime duration for each item, then group by the commandline string. For each group (of instances of the same command) sum up the total duration. Finally sort, and return the top 10.

There might very well be a better way. Particularly, I had trouble with the following:

  • After calculating the sum execution time for a group with measure-item, I wasn’t sure how to continue so that the next step in the pipeline has access to both the measurement result and the commandline string. I tried a hash, but failed to get sort working with it. I suppose one could use new-object, but I wasn’t keen on bothering with it’s syntax and long .NET dotted namespace hierarchies.

    As you can see, the version above simply adds the measure-item output to each group object itself, using add-member (which by itself doesn’t seem to output/return anything btw, which is why the foreach block ends with a single $_). Stuff like add-member feels slightly strange to me anyway - why is it even necessary? Why not allow adding new properties on the fly? It kind of feels like Powershell has to work around .NETs static nature. I wonder if the DLR might change some of this.

  • I’m also not sure as to how you would use measure-object to sum up timespan objects, which is why the above disposes of them early on and continues to work with milliseconds.

No Comments | Tags: Other