Re: XML-based sources (Media Factory plugin)

From: "Juan A. Suarez Romero" <jasuarez igalia com>
To: grilo-list gnome org
Subject: Re: XML-based sources (Media Factory plugin)
Date: Thu, 14 Jun 2012 18:15:38 +0200
Hello all.

Here I present the Media Factory plugin, a plugin that is able to read
XML descriptions of media sources, and instance them.

This is the second part of the work I've been doing lately, and that you
can find it (not merged yet) at:

https://gitorious.org/~jasuarez/grilo/jasuarez-grilo-plugins/commits/media-factory

Note that this work has been done over the previous metadata factory
plugin, and both plugins are completely separated.

During the development of this new plugin I got new valuable feedback,
so probably some of the ideas will be applied to the metadata factory
plugin. In fact, I expect to merge both plugins in just one before
pushing it to master.

I apologize in advance, but explaining here what it is provided by this
plugin is quite difficult, so I'll just summarize the most important
parts (still large to read), and show an example. I'll ty to write good
documentation and do some blog posts about how to use it step by step.

As I said, in this work a plugin able to read XML descriptions and
creating media sources based on these descriptions is provided. This
sources can implement the browse(), metadata() and search() operations.

The idea is that this system will make easier to develop new sources
getting the content from XML-based webservices. I think it will cover
more than 90% of cases, so it will be useful for lot of people.

The core idea of the system is that we specify how an operation returns
an XML content, and how a XML document is converted to Grilo medias.
This is the basic core of the system. Then, we have different helpers
around it to implement other features.

In the "sources" directory you can find several XML descriptions for
different sources: I've created descriptions for 8 different sources,
some of them are completely new, others are re-implementations of
sources already in Grilo using the new system (like the Jamendo one).

The basic structure is the following one:

<source>
  <id>plugin id</id>
  <name>Plugin name</name>
  <description> Plugin description</description>

  [CONFIGURATION SECTION]

  [PROVIDE SECTION]

  [OPERATION SECTION]
</source>

We will see the sections later. The "id" and "name" are compulsory,
while the "description" is optional.

The "source" element has three optional attributes:

- "type": The type of the source. Right now, the only valid value is
"media"
- "autosplit": a number to specify how much elements in a row can be
served (see the "auto-split" property for more information)
- "user-agent": the user-agent to use when requesting content from the
network.


[CONFIGURATION SECTION]

Here we describe all the Grilo options the source supports. As usual,
the values of this options will be set up by applications through
GrlConfig elements.

An example is:

<config>
  <key name="device">iphone</key>
  <key name="bandwidth">wifi</key>
  <key name="key"/>
</config>

In this example, the source supports 3 configuration keys: "device",
"bandwidth" and "key". The first two keys have a default value: it means
that if application doesn't specify a value for those keys, the default
ones will be used. The last key, named "key", doesn't have a default
value: it means that application must provide a value for it, else the
source won't be instantiated.

At this moment, we only support strings as values for the configuration
keys. It is up to the source developer to document the configuration
keys, and the values they can have.


[OPERATION SECTION]

I leave the provide section for later.

This section basically specify the mapping between the operations and
the XML content resulted from running those operations.

The format is the following one

<operation>
  [OPERATION]
  [OPERATION]
    ...
</operation>

Let's see how an operation is defined:

<search>
  <result>[RESULT]</result>
</search>

In this case we are defining the search() operation. When user executes
this operation, the content in [RESULT] will be returned.

As said previously, XML is the core of the system. Thus, [RESULT] must
be an XML content (or "something" that after evaluating it returns an
XML content; we will see later what this means).

"result" element allows some attributes:

- "format": the format of the [RESULT]. The default value is "xml",
which means that [RESULT] is an XML content. But another valid value is
"json", which means that [RESULT] is a JSON content. In this case, the
system will convert the JSON to XML following a well defined set of
rules (I'm not explaining here those rules), so at the end we only
manage XML content. This is useful for those services that only returns
JSON content.

- "cache": a boolean value specifying if the [RESULT] must be cached. If
"true", then after computing the first time the value of [RESULT], next
requests will return the same value, without re-evaluating it.

- "id": gives a name to the result. Useful to use this result in other
operations.

- "ref": used to recall a previous result, so we do not need to specify
it again.

Now, how it is a [RESULT]? I'll use a BNF syntax to explain it:

[result] ::= expandable_string
           | [url_content]
           | [rest_content]
           | [regexp_content]
           | [replace_content]

[url_content] ::= "<url>" [result] "</url>

[rest_content] ::= "<rest>" [rest_function]? [rest_param]* "</rest>"

[rest_function] ::= "<function>" expandable_string "</function>"

[rest_param] ::= "<param>" expandable string "</param>"

[regexp_content] ::= "<regexp>" [regexp_content]*
                     [regexp_input]
                     [regexp_output]
                     [regexp_expression]
                     "</regexp>"

[regexp_input] ::= "<input>" [result] "</input>

[regexp_output] ::= "<output>" expandable_string "</output>"

[regexp_expression] ::= "<expression>" expandable_string "</expression>"

[replace_content] ::= "<replace>"
                      [regexp_input]
                      [replace_replacement]
                      [regexp_expression]
                      "</replace>"

[replace_replacement] ::= "<replacement>"
                           expandable_string
                          "</replacement>"

First of all, let's see what it means an "expandable string". An
expandable string is a string containing one or more references to
external attributes. Those references are replaces by the values of the
attributes, or empty strings if there are not values.

This references are of form "%type:name%" (special case is "%%", which
is evaluated to single "%").

Example is this string "my device is %conf:device%, and user is asking
for %param:count% results". In this case, "%conf:device%" is replaced by
the value of "device" in the configuration, and "%param:count%" is
replaced by the value of the parameter "count" in the operation.

The [url_result] just evaluates the internal [result], which ends up as
an URL, reads it from internet, and returns the content.

The [rest_result] is quite similar, but instead of reading the raw URL
it performs a REST call. This is interesting because it allows to use
authenticated services.

[regexp_result] performs regular expressions over the input: it
evaluates the input, applies the expression and it returns the value of
output. This type of result is heavily inspired from XBMC and in fact,
it allows to write scrapers in the plugin; very powerful.

[replace_result] just replaces all the references of expression by
replacement in the input. Actually, this can be done through regexp, but
in lot of cases using this is more suitable and easy to read than
regexps.

We'll see a full example later

[PROVIDE SECTION]

This is the other way around: how to map XML content to Grilo media
types.

The format is:

<provide>
  [MEDIA MAP]
  [MEDIA MAP]
   ...
</provide>

The key point here are the XPath expressions: we use expressions to look
for medias in the content, and we use also XPath expressions to extract
the metadata attributes. Over that, we can use all the tools we have to
get the results: <url>, <regexp>, <rest> and <replace>.

Again, explaining everything in a single email is quite hard :)

But before finish, let's see an example: a plugin to show videos from
BBC News.

<!-- All the networks requests will have this user-agent -->
<source user-agent="BBC News 1.7.1 (iPhone; iPhone OS 4.0.1; en_GB)">
  <id>grl-bbc-news</id>
  <name>BBC Video News</name>

  <!-- Description is an optional value -->
  <description>Videos from BBC News</description>

  <!-- The source accepts two configuration options: "device" and
"bandwidth", both with defaults values; so they are not mandatory -->
  <config>
    <key name="device">iphone</key>
    <key name="bandwidth">wifi</key>
  </config>

  <!-- How to map from XML to Grilo media -->
  <provide>
    <!-- In the case of search/browse operations, all the results from
evaluating "query" expression will be mapped as Grilo media boxes; if
the operation is metadata(), it will use the "select" -->
    <media type="box"
           query="/json/feeds/feeds"
           select="/json/feeds/feeds[feed_url='%key:id%']">

      <!-- The metadata keys "id" and "title" are obtained evaluating
the "feed_url" and "title" XPath expression over each of the selected
results -->      
      <key name="id">feed_url</key>
      <key name="title">title</key>

      <!-- This is a private value: it is associated in the media, and
can be requested later through the "%priv:feed_url%" reference -->
      <private name="feed_url">feed_url</private>
    </media>

    <!-- In this case, we use namespaces in the XPath. It is important
to note that if the XML content uses namespaces, we must use them also
in the XPath; otherwise won't run. In this case, the results will be
mapped to Grilo media videos -->
    <media xmlns:a="http://www.w3.org/2005/Atom";
           xmlns:media="http://search.yahoo.com/mrss/";
           type="video"
           query="/a:feed/a:entry[*//*[starts-with(@href, 'bbcvideo')]]"
           select="/a:feed/a:entry[a:id='%key:id%']">
      <key name="id">a:id</key>
      <key name="title">a:title</key>
      <key name="description">a:summary</key>

      <!-- Here is an example of regular expressions: we evaluate
"media:thumbnail/@url" XPath expression, which returns the value of the
attribute "url" in the <media:thumbnail> element; then, we apply the
expression "bbcimage://...." over the result, and we return an output
with references to the matched input. We name this output as "1". Then,
we use this output (see ref="1"), we apply the expression "(.*)" and we
returns a new output: basically, in the last regexp we are pre-pending
the result with "http://"; -->
      <key name="thumbnail">
        <regexp>
          <regexp>
            <input>media:thumbnail/@url</input>
            <output id="1">\1/%conf:device%/\2</output>
            <expression>bbcimage://.+/(www.bbc.co.uk.+)/%7bdevice%
7d/(.+)</expression>
          </regexp>
          <input ref="1"/>
          <output>http://\1</output>
          <expression>(.*)</expression>
        </regexp>
      </key>

      <!-- Another example of regular expressions. As this time we are
request content from internet (see <url>), we mark this key as slow.
Roughly speaking, with regexp we are building an URL from a value of the
XML content, reading its content, and transforming this content into
another URL, which is the final value of "url" key -->
      <key name="url" slow="true">
        <regexp>
          <input>
            <url>
              <regexp>
                <input>*//*[starts-with(@href,
'bbcvideo')]/@href</input>
                <output>http://www.bbc.co.uk/moira/avod/%conf:device
%/\1/%conf:bandwidth%</output>

<expression>bbcvideo://.+/...device.../(.+)/...bandwidth...</expression>
              </regexp>
            </url>
          </input>
       <output>http://news.downloads.bbc.co.uk.edgesuite.net/\1.mp4?at=
\2</output>
          <expression>(mps_h264_400.*512k),
\.mp4.+hmac=([a-z0-9]+)</expression>
        </regexp>
      </key>
      <private name="feed_url">"%priv:feed_url%"</private>
    </media>
  </provide>

  <!-- Now we define the available operations -->
  <operation>

    <-- First we define the browse operations; we can have as much
operations as we want: the first that matches the requirements will be
used. In this case, we are telling that, from the results, we must skip
the elements specified in the operation and returns the number specified
also in the operation -->
    <browse skip="%param:skip%" count="%param:count%">

      <!-- These are the requirements to use this operation: the ID of
the container we are browsing must be empty (right now we use
string-based metadata keys in the requirements. Also NULL or not
available values are interpreted as empty strings) -->
      <require>
        <key name="id" match="^$"/>
      </require>

      <!-- The result to return, which is the content of the URL
specified. We tell the format is JSON, so the system transform it to
XML, the result must be cached, so the content downloaded the first time
from the network, and we identify this result as "categories" (for
further use) -->
      <result format="json" cache="true" id="categories">
        <url>http://www.bbc.co.uk/moira/feeds/iphone/news/en-GB/v1</url>
      </result>
    </browse>

    <!-- Another definition of browse operation: in this case, the
container ID must start with "http://www.bbc.co.uk";, and the result is
the value of the private key "feed_url" stored in the container -->
    <browse skip="%param:skip%" count="%param:count%">
      <require>
        <key name="id" match="^http://www.bbc.co.uk"/>
      </require>
      <result>
        <url>%priv:feed_url%</url>
      </result>
    </browse>

   <!-- Definition of metadata() operation which is used when the media
ID starts with "http://www.bbc.co.uk": the result is the same as
"categories" (defined in the browse operation above). In the case of
metadata, the "select" xpath will be used when mapping the returned
content to Grilo medias -->
    <metadata>
      <require>
        <key name="id" match="^http://www.bbc.co.uk"/>
      </require>
      <result ref="categories"/>
    </metadata>

    <!-- Another example of metadata: in this case, the only requirement
is that media must be a video -->
    <metadata>
      <require type="video"/>
      <result>
        <url>%priv:feed_url%</url>
      </result>
    </metadata>
  </operation>
</source>

There are other sources using other features in the branch. Do not
hesitate to ask me any doubt.

	J.A.
Follow-Ups:
- Re: XML-based sources (Media Factory plugin)
  - From: Iago Toral
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]