Re: [Tracker] Script for media library generation



On 01/08/13 15:23, Jonatan Pålsson wrote:
Hi list,

Hello Jonatan,

I'm currently working on a script which I think some of you may find
interesting. The script is for generating media files with varying
degrees of meta data fields filled in. The background to this script
is that I found it difficult to build larger media databases with
media gathered from the web. There are some repositories with media
files released under permissive licenses, but;

1. Exploiting the bandwidth of these repositories by downloading large
portions of them is bad
2. Often very many meta data fields are missing the the files
available online. While this gives a realistic view of the meta data
in actual media files, it is difficult to gather files with "perfect
metadata" from the web.

Yea, actually, Tracker has had to adapt to this, we routinely expect data to be missing from files downloaded, most notably MP3s OR the metadata to be hosed or the spec abused - e.g. using the wrong charset for that MP3 ID3 tag version. We're reasonably weathered to this by now, but actually having a lack of data is as useful as being complete there.

3. If a suitable repository, which permits you to use their bandwidth
is found, the actual transfer of the files likely takes a long time
4. Sharing the database you've built with someone else means the files
must be transferred to all parties

Indeed.

OK. So I think I have pitched the problem now. What I have done is to
combine media encoders (LAME and ImageMagick) and metadata tagging
software (id3v2 and exiftool) with the random number generator of
Python. By using the random numbers generated by Python as input to
these tools, random, reproducible (by reusing the seed for the PRNG),
media files can be created.

I'm using this to create large numbers of media files to test Tracker
extractor modules on, and it works pretty well. So far I can generate
MP3, PNG, JPG, TIF, and GIF.

Just before you go on, what are you trying to test here? That we index/extract properly? Or test the data with queries to the database?

If anyone wants to have a look, I've put the script here:
https://github.com/Pelagicore/mlg
I've also put a sample run here:
https://github.com/Pelagicore/mlg/wiki/Sample-run

I should say that the script is far from finished, and probably pretty
buggy. Use with care :)

:)

At this point I wanted to ask if you had seen the data generators we have in the Tracker tree already? NOTE: I say "data" not "file" there.

  utils/data-generators/cc/

You can run

  $ ./generate ./default.cfg

It will create a bunch of ttl files which you can import as you which using tracker-import. I think you can even use tracker-import *.ttl.

Anyway this is fake data, not based on files - so it really depends on what you're testing. You can also tweak where the data draws its random crap from :)

I think it would be quite useful to include your file generator into the tracker tree for people to make use of or at least reference to it from a README somewhere.

--
Regards,
Martyn

Founder & Director @ Lanedo GmbH.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]