Re: [Tracker] Script for media library generation

From: Martyn Russell <martyn lanedo com>
To: Jonatan Pålsson <jonatan palsson pelagicore com>
Cc: "tracker-list gnome org" <tracker-list gnome org>
Subject: Re: [Tracker] Script for media library generation
Date: Thu, 01 Aug 2013 16:34:57 +0100

On 01/08/13 15:23, Jonatan Pålsson wrote:

Hi list,


Hello Jonatan,

I'm currently working on a script which I think some of you may find
interesting. The script is for generating media files with varying
degrees of meta data fields filled in. The background to this script
is that I found it difficult to build larger media databases with
media gathered from the web. There are some repositories with media
files released under permissive licenses, but;

1. Exploiting the bandwidth of these repositories by downloading large
portions of them is bad
2. Often very many meta data fields are missing the the files
available online. While this gives a realistic view of the meta data
in actual media files, it is difficult to gather files with "perfect
metadata" from the web.

Yea, actually, Tracker has had to adapt to this, we routinely expectdata to be missing from files downloaded, most notably MP3s OR themetadata to be hosed or the spec abused - e.g. using the wrong charsetfor that MP3 ID3 tag version. We're reasonably weathered to this by now,but actually having a lack of data is as useful as being complete there.

3. If a suitable repository, which permits you to use their bandwidth
is found, the actual transfer of the files likely takes a long time
4. Sharing the database you've built with someone else means the files
must be transferred to all parties


Indeed.

OK. So I think I have pitched the problem now. What I have done is to
combine media encoders (LAME and ImageMagick) and metadata tagging
software (id3v2 and exiftool) with the random number generator of
Python. By using the random numbers generated by Python as input to
these tools, random, reproducible (by reusing the seed for the PRNG),
media files can be created.

I'm using this to create large numbers of media files to test Tracker
extractor modules on, and it works pretty well. So far I can generate
MP3, PNG, JPG, TIF, and GIF.

Just before you go on, what are you trying to test here? That weindex/extract properly? Or test the data with queries to the database?

If anyone wants to have a look, I've put the script here:
https://github.com/Pelagicore/mlg
I've also put a sample run here:
https://github.com/Pelagicore/mlg/wiki/Sample-run

I should say that the script is far from finished, and probably pretty
buggy. Use with care :)

:)

At this point I wanted to ask if you had seen the data generators wehave in the Tracker tree already? NOTE: I say "data" not "file" there.


  utils/data-generators/cc/

You can run

  $ ./generate ./default.cfg

It will create a bunch of ttl files which you can import as you whichusing tracker-import. I think you can even use tracker-import *.ttl.

Anyway this is fake data, not based on files - so it really depends onwhat you're testing. You can also tweak where the data draws its randomcrap from :)

I think it would be quite useful to include your file generator into thetracker tree for people to make use of or at least reference to it froma README somewhere.


--
Regards,
Martyn

Founder & Director @ Lanedo GmbH.

Follow-Ups:
- Re: [Tracker] Script for media library generation
  - From: Jonatan Pålsson

References:
- [Tracker] Script for media library generation
  - From: Jonatan Pålsson

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]