Database theory notes

Introducing the BLOB

Multimedia content organization requires a large and efficient document store. It should be version-controlled, and should track authors, editors, and document state. It should track content whether compressed or encrypted; it should track signatures if they exist; it should track and control authentication and access, and it should track related content.

Q. What happens when tracking information is lost?

  • A. we are left with a BLOB.
  • A BLOB is a binary-large object, a sequence of octets with a finite length (though it may be very large), such as the image of a photograph, an executable program, or any other data. filesystems, databases and document servers will conventionally track and store a file name and a media-type identifier (eg: the "image/jpeg" MIME specifier, or the .jpg file extension) to go with a BLOB to give clues about the BLOB's format and what to use to transform or decode it.

    When tracking information has been lost, it would be nice to have a universal key that would guarantee a 1:1 association with the blob.

    other things about BLOBs:
    - date created
    - filespec, or URI
    - minimal edit

    Tracking mobile BLOBs (static and dynamic)

    For static BLOBs, it is possible to use a collision-resistant hash such as SHA-1 or MD5 as a database key. We can use this fact to identify and track a BLOB independently from the specific host, directory, or file where it resides.

    This kind of tracking is only possible when BLOBs are static, ie: they do not change as they move from place to place.

    Example of BLOBs that may change:

    Recognizing BLOBs that have changed.

    1. non-image BLOBs: concatenate and compress.
      Two BLOBs A and B that may be related are compared from an information-theoretic viewpoint. Take the best general file compression algorithm you know of, for example: bzip2, and use it as follows:

      L1 = length( bzip2( A ) ) + length( bzip2( B ) ).
      L2 = length( bzip2( concat A B ) ).

      If the files are unrelated, the value L1 should be about equal to L2 . If the files are related, L1 should be somewhat greater than L2 . If the files are close to identical, L1 should be should be about double L2 .

    2. image BLOBs: composting.

      Produce a (lossy) analysis and encoding of the images to identify features.

      original reduced contour other
      colour reductions edge detection etc.

      Want both the analysis and the encoding to be insensitive to cropping, scaling and rotation.

      Suggested techniques:

      • feature decomposition, edge detection and raster to vector techniques
      • frequency domain techniques
      • fractal techniques
      • neural network techniques

      Algorithms, some that mimic parts of the image recognition centers in the human brain, may be applied to the source image(s) to extract features.

      Advanced systems such as neural-network based face-recognition tools require several (say 32 or so) images of a person's head. Each image is matched to each other image and each combination (32x31/2=496) is affirmative training data for the neural network).

      To train against false-positive matches, probably need about 800 or so guaranteed-no-match images. This should consist of a few hundred test patterns of varying complexity, and 10 to 15 other people.

      Human input required: selection of test images; roughly-catagorized according to photo-ratings chart.

      A database of a few hundred thousand stills images culled from old usenet postings would provide the bulk of raw data to run tests with.

    3. Cluster analysis.
      Examine several independent properties of the BLOB and use multivariate clustering of them. For images, look at the file name, the format, the size and aspect ratio, results from composting, etc. See the photo ratings chart for suggested human inputs to this process.

    [ back ] [ index ]

    Copyright (c) 1996-2003 by Paul Shields