April 8, 2007

Round Trip Textile

I’m all about Textile.

I’ve installed Textile 2.0 on every WordPress installation, and I regularly use Backpacks and Instiki in my work, in recovery, in New Orleans. I’ve created lesson plans around Textile, and taught neighborhood leaders to use Textile, instead of web-based WYSIWYG.

Why? Because Textile works. It’s easier to teach than a flakey JavaScript editor. Once they get it, they use it, and there is little followup.

I’m a fan.

Thus, I’m creating a Textile parse for Java, which is my language of choice. There are a few available already, I know. My needs are a tad different.

It follows that my implementation is a tad different as well. Rather than processing a string in place, I’m going line by line, building the paragraph blocks, and then processing the contents recursively.

I’m finding that there are some serious differences between the Perl and PHP implentations of Textile, and the Ruby Redcloth implementation.

Thus, I’m curious as to whether or not we can standardize on a definition of Textile.

I’m also finding that all of the implementations will produce invalid XML if fed the right sort of garbage. Because I’m emitting SAX events, that will not work for me. I’d like to have bad Textile, still produce valid XML.

I’m particularly interested, because I’d like to create a round trip implementation of Textile. One where, if you can go from Textile to XML, then you can go back to Textile, so that I can store the Textile as XML.

This would be a subset of XHTML, and it would be limited. Only that XHTML which could have been produced by Textile would be acceptable.

In order to do this though, it would be nice to start from a definition of the Textile languagae, a BNF Textile grammar.

Is there interest in creating a definition of the Textile language?


April 6, 2007

Strata Storage Strategies

At the outset, Strata was developed for use with in memory and file backed implementations of tiers.

When in memory, branches and the objects can be stored in arrays. The default implementation uses ArrayList classes to store objects.

When file backed, their are two types of objects that can be stored, variable length objects and fixed length objects. Objects can be stored within the tiers or by reference.

When file backed, tiers are softly or weakly referenced, so that they can be collected. Thus, they can act as caches of the objects they reference.

Storing by refernece means storing an address. Whether or not the object is stored in the tier, or by reference outside the tier, the object needs to be in memory for comparisons, for traversal at both the leaf and the inner tier.

By reference storage can keep the object within the in memory tier, or it can perform a look up of the referenced object. The object is an simple object, like a string, it might be easier to keep it in the memory tier, as part of an address, object pair. If it is a complicated object, it can load the object through a caching mechanism.


April 4, 2007

Strata Objectives

These were once the objectives for Strata, reflecting an early understanding of the project. I’m about to rewrite them, so I thought I’d put them somewhere, besides the wiki history.

Thus, the short-term objectives for Strata are as follows.

  • Leaf nodes only to store references to file positions, not store the data itself.
  • Branches also store only references to file positions, and hold a list of objects read from those file positions to use as keys.
  • Thread-safe.
  • File backed, paged.
  • Supports efficent insertion and deletion, while maintaining the balance of the B-Tree.
  • Equality comparison query.
  • Compares byte buffer ranges, not objects.

Crazy thoughts included storing partial in the inner tiers on pages, but this presents a strange case where a partial key might not fit on a page, if the length of a key is unbounded.

What changed? For the first two points, only the way I’d state them. Since it’s now a given, I don’t know if I’d state them. There are also a few more devils in those two details, only the fields of interest may be cached for example.

“Equality comparison query”, meant that the only operator supported is the equality operator, which is also given, but that’s all you need to implement between, less than, greater than, etc. Probably could not see that then.

“Compares byte buffer ranges, not objects”, the meaning of this has escaped my memory.

The crazy thoughts are reflections on how to store the objects referenced by the B+Tree. There was a time when I thought that the objects might be stored in the pages of the tree.


February 9, 2007

Bento


Authentic c-store Japanese bento by Ramil Sagum.

Malloc

The idea behind Bento it to take a file and divide it up using the malloc algorithm as described by Doug Lea.

Bento is written in Java for JDK 1.4.2 using NIO.

You can read more about Bento on my Wiki and you can paruse the source in my Subversion repsitory and you can use it under LGPL.

Design Goals

Bento is a file format for writing out discrete blocks of data. It is designed for the storage of objects. Bento is not supposed to be a database in itself.

Bento is a robust file format that provides

  • storage of blocks of arbitrary size,
  • random access to stored blocks,
  • atomic writes of multiple blocks,
  • thread-safe allocation and deallocation of blocks,
  • and thread-safe reads and writes.

Funamentally, Bento strives to be a simple data structure that you can use to persist objects in your applications. You can use Bento as the basis for more complicated data structures or databases.

Usage

Bento can be used to serialize objects using Java serialization, to write out objects as Java primitives, or to store Java primitives.

A single Bento file can be organized into a database of different Java types. For desktop applications, this means that a single file can be the container for many different objects of many different classes.

Multiple Bento files can be organized into a database, spread across different file system, to reduce disk contention.

You can design resuable Bento based data structures and mix them into a single Bento file.

Quick Start

An example program that creates a Bento file with a static header and a single allocated block.

Bento.Creator creator = new Bento.Creator();
creator.addStaticBlock(URI.create("flag://agtrz.com/header"), 512);
Bento bento = creator.create(new File("data.bento"));

Bento.Mutator mutator = bento.mutate();
try {
    Bento.Block block = mutator.allocate(512);

    ByteBuffer bytes = block.toByteBuffer();
    for (int i = 0; i < 512 / 4; i++) {
        bytes.putInt(i);
    }
    block.write();

    Bento.Block header = mutator.load(URI.create("flag://agtrz.com/header"));

    ByteBuffer bytes = header.toByteBuffer();
    block.getAddress().write(bytes);
    header.write();

    mutator.commit();
}
finally {
    mutator.rollback();
}

bento.close();

Bento.Opener opener = new Bento.Opener();
bento = opener.open(new File("bento.data"));

ByteBuffer header = bento.read(URI.create("http://agtrz.com/header"));

ByteBuffer bytes = bento.read(new Bento.Address(header));
for (int i = 0; i < bytes.capacity() / 4; i++) {
    if (i != bytes.getInt()) {
        throw new IllegalStateException();
    }
}

bento.close();

Mutation

A Bento file is written using a Bento.Mutator. All of the changes performed by a mutator are atomic. The changes are not made perminent until the commit method of mutator is called. The changes can be discarded by calling the rollback method of the mutator.

After calling commit or rollback the mutator can be reused. Calling commit or rollback when there are no uncommited changes held by the mutator does nothing. Thus, we use a try/finally block to recover the state of a Bento file when a failure occurs.

public void storeObjects(Bento bento, Bento.Address address) {
    Bento.Mutator mutator = bento.mutate();
    try {
        Bento.Block block = mutator.load(address);
        block.toByteBuffer().putInt(this.magicNumber);
        block.write();

        mutator.commit();
    }
    finally {
        mutator.rollback();
    }
}

Concurrency

Bento synchronizes the list of free blocks, but does not synchronize the blocks themselves. Blocks can be allocated and freed in a thread-safe manner without further consideration by the application developer.

If you read and write blocks without some syncrhonization strategy, you’ll turn your Bento file into garbage in short order.

Multiple threads can read the same blocks, but blocks that are being written should not be written or read by other threads. Reads can be shared. Writes are exclusive.

A simple strategy is to use a read/write lock from the concurrency library to gaurd the file during updates.

Bento strives for liveness, however. It places as few locks as possible on the management of list of free blocks.

A more sophisticated strategy would place locks on the blocks themselves. A composite object, such as a tree, could lock a tree root element. Different threads can then operate on different trees in the same file. They would not run afoul of each other.

Atomic Writes

A Bento file is edited using a mutator the object. The mutator object keeps all block writes, allocations, and deallocations are kept in memory until the user commits or rolls back the changes made by the mutator.

Journaling

All reads, writes, allocations and frees are performed on buffers in memory. Only when a mutation is committed or rollback, is the underlying file written.

Journaling is implemented by writing out the list of operations into a journal prior to applying the changes to the file.

The journal can be implemented using blocks within the file, or it can be implemented using external temporary files. The former is good for desktop applications, where the user may move a corrupted Bento file out of reach of the external journals, or unwittingly delete the external journals. The latter is good for a server application, where multiple Bento files and journals can exist in a single directory, watched over by an administrator.


November 23, 2006

UniversalException

I do not like checked exceptions. I consider them harmful. They create a lot of noise. They insist on attention at every level of a program, when I’d prefer to decide for myself where I’d like to handle exceptional conditions.

To reduce the amount of noise in my Java programs, I created a class called UniversalException, and use it as the base class for all my exceptions. At the starting point of a program, I can catch UniversalException, and place my application wide error handling there.

A new trick is to add the ability to declare an exception and fill it full of useful information as a one liner.

if (!file.exsits())
    throw new StorageException("file.not.found")
             .info("file", file);

There is a method in UniversalException called info that adds exception information to a linked hash map of exception information. Thus, it’s pretty easy to throw an exception and have some meaningful reporting in the stack trace.

This only works because I’ve renounced checked exceptions. If it meant something to me to declare a method as one that throws StorageException, then I’d have a problem, because the info method would generalize the exception into a UniversalException.

How nice to not have to worry about that petty little language misfeature.


November 19, 2006

A Grid in jQuery

After a disappointing start with jQuery:

  1. No clear strategy to preserve state of custom objects.
  2. No response to questions in the jQuery discuss mailing list.

As a result, I fetched the Yahoo UI Library and Jack Slocum’s Yahoo UI Extension Library. That grid worked well in Internet Explorer 6 and in Firefox. In Safari, the column resize seemed to fail.

Nice. A bit heavy. There are a lot of dependencies. A lot of features that I don’t really need. I can turn them off, but the code and the complexity are still present. In essence, I wanted a table that had a sticky headers.

I found that state is stored in jQuery fancy-pants extensions by creating a hash and attaching it to the target element. I don’t know why that couldn’t have been said on jQuery discuss.

I’d imagined that there was some limitation, that not all browsers supported “expando” properties on document elements, but it must not be the case.

Here is the simple table that I’ve created in jQuery. No resizable columns.

This works on IE 6, Firefox 1.5 and Safari 2.0.4.

It was only possible once I’d abandoned table markup, however. The table is rendered with div and span elements, rather than table, tbody, tr, td, etc.

Much easier. The implementation of tables has a lot of baggage attacted to it.

However, I’m not finding that it doesn’t scale. The rendering causes the “Unresponsive Script” warning in Firefox when the table is large.

In the course of creating this grid, I fixed the innerWidth and innerHeight methods of dimensions.js, which incorrectly named the border properites, “borderLeft” instead of “borderLeftWidth”, and neglected to deduct the padding width.

Update: I’ve addressed some of the performance issues of loading the table. The time that it takes to load this new revision is greatly reduced. Speed up is a factor of ten.

I build the cells as spans, so that I can measure the width of the text, when rendered as a span, an inline element, this is possible. A assign a class name for the span, based on it’s column position. Then I’d set the dispaly property of the span to block. In the end, I’ll define a CSS class for each column, with the width set to the maximum width for that span.

By removing the assignment of the display property of the span, I got my speed up.

I also speed up loading by measuring the border and padding for each column once, rather than calculating the inner width for each column, I keep track check the offset width, and calculate the innner width by dedcuting the total of all borders and padding of the first from form the offset width. The assumes that the styles for each column are identical in all rows (which is perfectly reasonable and all I need).

It still chokes on the large datasets that I’m using, 500+ rows. For my next trick, I’m going to load incrementally, so that the unresponsive script warning does not appear.

The grid will render much faster if I have a set width, of course. I can create an optimized rendering method that builds a large HTML string and injects it, 100 rows at a time, in the case of resorting, or in the case where I’ve assigned widths to all the columns.

Update: The grid now renders concurrently. Different browsers prefer different combinations of rows per timeout and time between timeout.

The important thing is that the page is responsive. The user can scroll the grid while it loads, the back button and other controls still work.

The key tradeoff is that the slower the grid loads, the more responsive it is.

Rendering is faster in this incarnation. Absolute positioning is much faster that relative layout. Much faster than laying out the grid as blocks that float left.


November 12, 2006

B-Tree

An oak at The University of South Carolina-Columbia by David.

I’m implementing a B-Tree in Java. The project is called Strata. You can view the source, is is in the strata subdirectory of my subversion repository. There is a lot of cruft from an earlier attempt.

I’ve yet to choose a license.

This tree is the first indexing option in Momento, which is a large file XML database that I wrote last year, when I was in Ann Arbor.

Strata will index the normalized text of a specific node. The only query supported at the outset is equality.

Then I can use Lucene to do full text indexing. In order for Lucene to be effective, you see, it must be a simple matter to retrieve the document. Indexing unique atom identifers, would give a key to fetch the document resulting from a Lucene query.

My research B-Tree implemetnations can be tracked with my del.icio.us account, under the tag strata.

I’ve written a description of the tree structure at my user page on the New Orleans Wiki, under a Software Development page.